Re: Subset Matching

2016-03-25 Thread Sujit Pal
Hi Otmar,

Shouldn't Occur.SHOULD alone do what you ask? Documents that match all
terms in the query would be scored higher than documents that match fewer
than all terms.

-sujit

On Fri, Mar 25, 2016 at 2:20 AM, Otmar Caduff  wrote:

> Hi all
> In Lucene, I know of the possibility of Occur.SHOULD, Occur.MUST and the
> “minimum should match” setting on the boolean query.
>
> Now, when querying, I want to
> - (1)  match the documents which either contain all the terms of the query
> (Occur.MUST for all terms would do that) or,
> - (2)  if all terms for a given field of a document are a subset of the
> query terms, that document should match as well.
>
> Any clue on how to accomplish this?
>
> Otmar
>


Re: Calculate the score of an arbitrary string vs a query?

2015-04-11 Thread Sujit Pal
Hi Ali,

I agree with the others that there is no good way to do what you are
looking for if you want to assign lucene-like scores to your external
results, but if you have some objective measure of goodness that doesn't
depend on your lucene scores, you can apply it to both result sets and
merge them that way.

One such measure could probably be the number of words in your query that
you found in your title, or if you want to take the title length into
consideration, the Jaccard similarity between the query words and title
words.

I once solved a slightly different (but related) problem using a somewhat
different approach - mentioning it here in case it gives you some ideas. In
my previous job we would concept map documents using our ontology - so
each document could be thought of as a (weighted) bag of concepts - our
concept search involved querying this bag of concepts. The indexing process
was expensive, and we had just migrated to a new Java based annotation
pipeline which assigned very different concept scores to documents, but
which were intuitively more correct. However, whereas the old system
assigned concept scores typically in the 20,000 range, our new system
assigned scores to similar documents in the 100 range. We also had a set of
huge indexes we had crawled with the old pipeline that would take us
weeks/months to get done with the new pipeline, so we decided to merge
results from our old index and newly crawled content (much smaller set) for
a client. So I calculated the z-score (across all concepts) for both
content sets and used that to rescale the concept scores of the old set to
the new set. Although the underlying math was a bit sketchy, the merged
results looked quite good.

Hope this helps,

-sujit


On Fri, Apr 10, 2015 at 2:32 PM, Jack Krupansky jack.krupan...@gmail.com
wrote:

 There is doc for tf*idf scoring in the javadoc:

 http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

 The IndexSearcher#explain method returns an Explanation structure which
 details the scoring for a document:

 http://lucene.apache.org/core/5_0_0/core/org/apache/lucene/search/IndexSearcher.html#explain(org.apache.lucene.search.Query
 ,
 int)

 -- Jack Krupansky

 On Fri, Apr 10, 2015 at 4:15 PM, Gregory Dearing gregdear...@gmail.com
 wrote:

  Hi Ali,
 
  The short answer to your question is... there's no good way to create a
  score from your result string, without using the Lucene index, that will
 be
  directly comparable to the Lucene score.  The reason is that the score
  isn't just a function of the query and the contents of the document.
 It's
  also (usually) a function of the contents of the entire corpus... or
 rather
  how common terms are across the entire corpus.
 
  That being said... the default scoring algorithm is based on tf/idf.  The
  implementation isn't in any one class... every query type (e.g. Term
 Query,
  Boolean Query, etc...) contains its own code for calculating scores.  So
  the complete scoring formula will depend on the type of queries you're
  using.  Many of those implementations also call into the Similarity API
  that you mentioned.
 
  If you'd like to see representative examples of scoring code, then take a
  look at TermWeight/TermScorer, and also BooleanWeight, which has several
  associated scorers.
 
  -Greg
 
 
  On Tue, Apr 7, 2015 at 1:32 AM, Ali Akhtar ali.rac...@gmail.com wrote:
 
   Hello,
  
   I'm in a situation where a search query string is being submitted
   simultaneously to Lucene, and to an external API.
  
   Results are fetched from both sources. I already have a score available
  for
   Lucene results, but I don't have a score for the results fetched from
 the
   external source.
  
   I'd like to calculate scores of results from the API, so that I can
 rank
   the results by the score, and show the top 5 results from both sources.
   (I.e the results would be merged.)
  
   Is there any Lucene API method, to which I can submit a search string
 and
   result string, and get a score back? If not, which class contains the
   source code for calculating the score, so that I can implement my own
   scoring class, using the same algorithm?
  
   I've looked at the Similarity class Javadocs, but it doesn't include
 any
   source code for calculating the score.
  
   Any help would be greatly appreciated. Thanks.
  
 



Re: Proximity query

2015-02-12 Thread Sujit Pal
I did something like this sometime back. The objective was to find patterns
surrounding some keywords of interest so I could find keywords similar to
the ones I was looking for, sort of like a poor man's word2vec. It uses
SpanQuery as Jigar said, and you can find the code here (I believe it was
written against Lucene 3.x so you may have to upgrade it if you are using
Lucene 4.x):

http://sujitpal.blogspot.com/2011/08/implementing-concordance-with-lucene.html

-sujit


On Thu, Feb 12, 2015 at 8:57 AM, Maisnam Ns maisnam...@gmail.com wrote:

 Hi Shah,

 Thanks for your reply. Will try to google SpanQuery meanwhile if you have
 some links can you please share

 Thanks

 On Thu, Feb 12, 2015 at 10:17 PM, Jigar Shah jigaronl...@gmail.com
 wrote:

  This concept is called Proximity Search in general.
 
  In Lucene they are achieved using SpanQuery.
 
  On Thu, Feb 12, 2015 at 10:10 PM, Maisnam Ns maisnam...@gmail.com
 wrote:
 
   Hi,
  
   Can someone help me if this use case is possible or not with lucene
  
   Use case: I have a string say 'Japan' appearing in 10 documents and I
  want
   to get back , say some results which contain two words before 'Japan'
 and
   two words after 'Japan' may be something like this ' Economy of Japan
 is
   growing' etc.
  
If it is not possible where should I look for such queries
  
   Thanks
  
 



Re: Case sensitivity

2014-09-19 Thread Sujit Pal
Hi John,

Take a look at the PerFieldAnalyzerWrapper. As the name suggests, it allows
you to create different analyzers per field.

-sujit


On Fri, Sep 19, 2014 at 6:50 AM, John Cecere john.cec...@oracle.com wrote:

 I've considered this, but there are two problems with it. First of all, it
 feels like I'm still taking up twice the storage, I'm just doing it using a
 single index rather than two of them. This doesn't sound like it's buying
 me anything.

 The second problem with this is simply that I haven't figured out how to
 do this. I assume in creating two fields you would implement two separate
 analyzers on them, one using LowerCaseFilter and the other not. I haven't
 made the connection on how to tie an Analyzer to a particular field. It
 seems to be tied to the IndexWriterConfig and the IndexWriter.

 Thanks,
 John


 On 9/19/14 9:36 AM, Paul Libbrecht wrote:

 two fields?

 paul


 On 19 sept. 2014, at 15:07, John Cecere john.cec...@oracle.com wrote:

  Is there a way to set up Lucene so that both case-sensitive and
 case-insensitive searches can be done without having to generate two
 indexes?

 --
 John Cecere
 Principal Engineer - Oracle Corporation
 732-987-4317 / john.cec...@oracle.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


 --
 John Cecere
 Principal Engineer - Oracle Corporation
 732-987-4317 / john.cec...@oracle.com

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Quickest way to collect one field from the searched docs....

2014-09-19 Thread Sujit Pal
Hi Shouvik, not sure if you have already considered this, but you could put
the database primary key for the record into the index - ie, reverse your
insert to do DB first, get the record_id and then add this to the Lucene
index as record_id field. During retrieval you can minimize the network
traffic by setting field list to only this record_id.

-sujit


On Thu, Sep 18, 2014 at 9:23 PM, Shouvik Bardhan sbard...@gisfederal.com
wrote:

 Pardon the length of the question. I have an index with 100 million docs
 (lucene not solr) and term queries (A*, A AND B* type queries) return
 pretty quickly (2 -4 secs) and I pick the lucene docIds up pretty quickly
 with a collector. This is good for us since we take the docIds and do
 further filtering based on another database we maintain whose record ids
 match with the stored lucene doc ids and we are able to do what we want. I
 know that depending on the lucene doc id value is not a good thing, since
 after delete/merge/optimize, the doc ids may change and if that was to
 happen, our other datastore will not line up with lucene doc index and
 chaps will ensue. Thus we do not optimize the index etc.

 My question is what is the fastest way I can gather 1 field value from the
 docs which are found to match the query? Is there any way to do this as
 fast as (or at least not much slower) I am able to collect the lucene
 docids?  I want to get away from depending on the lucene docids not
 changing if possible.

 Thanks for any suggestions.



Re: How to handle words that stem to stop words

2014-07-10 Thread Sujit Pal
Hi Arjen,

This is kind of a spin on your last observation that your list of stop
words don't change frequently. If you have a custom filter that attempts to
stem the incoming token and if it stems to the same as a stopword, only
then sets the keyword attribute on the original token.

That way your reindex frequency is based on the stopword change frequency
not on the frequency of discovery of new words that stem to stopwords.

-sujit



On Thu, Jul 10, 2014 at 11:57 AM, Arjen van der Meijden 
acmmail...@tweakers.net wrote:

 I'm reluctant to apply either solution:

 Emitting both tokens will likely still provide the user with a very long
 result list. Even though the results with 'vans' in it are likely to be
 ranked to the top, its still not very user friendly due to its
 overwhelmingly large number of results (nor is it very good for the
 performance of my application).
 In our specific case we also boost documents based on their age and
 popularity, so the extra results will probably interfere even if
 'vans'-results are generally ranked higher.


 The approach with a list of specially treated terms is something we'll
 have to build and maintain by hand. Every time such a list is adjusted,
 it'll require a reindex of the database, which is not a huge problem but
 still not very practical.

 But I'm getting more and more convinced there isn't really a (reasonably
 easy) solution that would leave it dynamically changing without requiring
 database reindexes.
 Luckily the list of stop words shouldn't change that fast and we already
 have more than ten years worth of data, so it should be fairly easy to
 build a list of terms that are stemmed into stop words.

 Best regards,

 Arjen

 On 7-7-2014 23:06 Tri Cao wrote:

 I think emitting two tokens for vans is the right (potentially only)
 way to do it. You could
 also control the dictionary of terms that require this special treatment.

 Any reason makes you not happy with this approach?

 On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden
 acmmail...@tweakers.net wrote:

  Hello list,

 We have a fairly large Lucene database for a 30+ million post forum.
 Users post and search for all kinds of things. To make sure users don't
 have to type exact matches, we combine a WordDelimiterFilter with a
 (Dutch) SnowballFilter.

 Unfortunately users sometimes find examples of words that get stemmed to
 a word that's basically a stop word. Or reversely, where a very common
 word is stemmed so that it becomes the same as a rare word.

 We do index stop words, so theoretically they could still find their
 result. But when a rare word is stemmed in such a way it yields a
 million hits, that makes it very unusable...

 One example is the Dutch word 'van' which is the equivalent of 'of' in
 English. A user tried to search for the shoe brand 'vans', which gets
 stemmed to 'van' and obviously gives useless results.

 I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
 and 'van' and the StemmerOverrideFilter to try and prevent these cases.
 Are there any other solutions for these kinds of problems?

 Best regards,

 Arjen van der Meijden

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 mailto:java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 mailto:java-user-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to handle words that stem to stop words

2014-07-07 Thread Sujit Pal
Hi Arjen,

You could also mark a token as keyword so the stemmer passes it through
unchanged. For example, per the Javadocs for PorterStemFilter:
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/en/PorterStemFilter.html

Note: This filter is aware of the KeywordAttribute
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true.
To prevent certain terms from being passed to the stemmer
KeywordAttribute.isKeyword()
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/tokenattributes/KeywordAttribute.html?is-external=true#isKeyword()
should
be set to true in a previousTokenStream
http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true.
Note: For including the original term as well as the stemmed version, see
KeywordRepeatFilterFactory
http://lucene.apache.org/core/4_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html

Assuming your stemmer is also keyword attribute aware, you could build a
filter that reads a list of words (such as vans) that should be protected
from stemming and marks them with the KeywordAttribute before sending to
the Porter stemmer and put it into your analysis chain.

-sujit


On Mon, Jul 7, 2014 at 2:06 PM, Tri Cao tm...@me.com wrote:

 I think emitting two tokens for vans is the right (potentially only) way
 to do it. You could
 also control the dictionary of terms that require this special treatment.

 Any reason makes you not happy with this approach?

 On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden 
 acmmail...@tweakers.net wrote:

 Hello list,

 We have a fairly large Lucene database for a 30+ million post forum.
 Users post and search for all kinds of things. To make sure users don't
 have to type exact matches, we combine a WordDelimiterFilter with a
 (Dutch) SnowballFilter.

 Unfortunately users sometimes find examples of words that get stemmed to
 a word that's basically a stop word. Or reversely, where a very common
 word is stemmed so that it becomes the same as a rare word.

 We do index stop words, so theoretically they could still find their
 result. But when a rare word is stemmed in such a way it yields a
 million hits, that makes it very unusable...

 One example is the Dutch word 'van' which is the equivalent of 'of' in
 English. A user tried to search for the shoe brand 'vans', which gets
 stemmed to 'van' and obviously gives useless results.

 I already noticed the 'KeywordRepeatFilter' to index/search both 'vans'
 and 'van' and the StemmerOverrideFilter to try and prevent these cases.
 Are there any other solutions for these kinds of problems?

 Best regards,

 Arjen van der Meijden

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Securing stored data using Lucene

2013-06-25 Thread SUJIT PAL
Hi Rafaela,

I built something along these lines as a proof of concept. All data in the 
index was unstored and only fields which were searchable (tokenized and 
indexed) were kept in the index. The full record was encrypted and stored in a 
MongoDB database. A custom Solr component did the search against the index, 
gathered up unique ids of the results, then pulled out the encrypted data from 
MongoDB, decrypted it on the fly and rendered the results.

You can find the (Scala) code here:
https://github.com/sujitpal/solr4-extras
(under the src/main/scala/com/mycompany/solr4extras/secure folder).

More information (more or less the same as what I wrote but probably a bit more 
readable with inlined code):
http://sujitpal.blogspot.com/2012/12/searching-encrypted-document-collection.html

There are some obvious data sync concerns with this sort of setup, but as 
Adrian points out, you can't index encrypted data.

HTH
Sujit

On Jun 25, 2013, at 4:17 AM, Adrien Grand wrote:

 On Tue, Jun 25, 2013 at 1:03 PM, Rafaela Voiculescu
 rafaela.voicule...@gmail.com wrote:
 Hello,
 
 Hi,
 
 I am sorry I was not a bit more explicit. I am trying to find an acceptable
 way to encrypt the data to prevent any access of it in any way unless the
 person who is trying to access it knows how to decrypt it. As I mentioned,
 I looked a bit through the patch, but I am not sure of its status.
 
 You can encrypt stored fields, but there is no way to do it correctly
 with fields that have positions indexed: attackers could infer the
 actual terms based on the order of terms (the encrypted version must
 sort the same way as the original terms), frequencies and positions.
 
 --
 Adrien
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Payload Matching Query

2013-06-21 Thread SUJIT PAL
Hi Michael,

Instead of putting the annotation in Payloads, why not put them in as 
synonyms, ie at the same spot as the original string (see SynonymFilter in 
the LIA book). So your string would look like (to the index):

W. A. Mozart was born in Salzburg
artist   city

so you can query as s:__artist__ __city__~slop

-sujit

On Jun 20, 2013, at 9:27 AM, michal samek wrote:

 Hi Adrien,
 
 thanks for your reply. If payloads cannot be used for searching, is there
 any workaround how to achieve similar functionality?
 
 What I'd like to accomplish is to be able to search documents with contents
 for example
 W. A. Mozart[artist] was born in Salzburg[city]
 just by specifying the *payload*s [artist] [city].
 
 Thanks
 
 *Michal
 *
 
 
 2013/6/20 Adrien Grand jpou...@gmail.com
 
 Hi Michal,
 
 Although payloads can be used at query time to customize scoring, they
 can't be used for searching. Lucene only allows to search on terms.
 
 --
 Adrien
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread SUJIT PAL
Hi Uwe,

Thanks for the info, I was under the impression that it didn't... I got this 
info (that filters don't have a limit because they are not scoring) from a 
document like the one below. Can't say this is the exact doc because its been a 
while since I saw that, though.

http://searchhub.org/2009/06/08/bringing-the-highlighter-back-to-wildcard-queries-in-solr-14/


As a response to this performance pitfall on very large indices’s (and the 
infamous TooManyClauses exception), new queries were developed that relied on a 
new Query class called ConstantScoreQuery. ConstantScoreQuerys accept a filter 
of matching documents and then score with a constant value equal to the boost. 
Depending on the qualities of your index, this method can be faster than the 
Boolean expansion method, and more importantly, does not suffer from 
TooManyClauses exceptions. Rather than matching and scoring n BooleanQuery 
clauses (potentially thousands of clauses), a single filter is enumerated and 
then traveled for scoring. On the other hand, constructing and scoring with a 
BooleanQuery containing a few clauses is likely to be much faster than 
constructing and traveling a Filter.


-sujit

On Apr 15, 2013, at 1:04 AM, Uwe Schindler wrote:

 The limit also applies for filters. If you have a list of terms ORed 
 together, the fastest way is not to use a BooleanQuery at all, but instead a 
 TermsFilter (which has no limits).
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 -Original Message-
 From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
 Sent: Monday, April 15, 2013 9:53 AM
 To: java-user@lucene.apache.org
 Subject: Re: Statically store sub-collections for search (faceted search?)
 
 Am 12.04.2013 20:08, schrieb SUJIT PAL:
 Hi Carsten,
 
 Why not use your idea of the BooleanQuery but wrap it in a Filter instead?
 Since you are not doing any scoring (only filtering), the max boolean clauses
 limit should not apply to a filter.
 
 Hi Sujit,
 thanks for your suggestion! I wasn't aware that the max clause limit does not
 match for a BooleanQuery wrapped in a filter. I suppose the ideal way would
 be to use a BooleanFilter but not a QueryWrapperFilter, right?
 
 However, I am also not sure how to apply a filter in my use case because I
 perform a SpanQuery. Although SpanQuery#getSpans() does take a Bits
 object as an argument (acceptDocs), I haven't been able to figure out how to
 generate this Bits object correctly from a Filter object.
 
 Best,
 Carsten
 
 --
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation Next Generation Corpus
 Analysis Platform
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Statically store sub-collections for search (faceted search?)

2013-04-15 Thread SUJIT PAL
Hi Uwe,

I see, makes sense, thanks very much for the info. Sorry about giving you wrong 
info Carsten.

-sujit

On Apr 15, 2013, at 1:06 PM, Uwe Schindler wrote:

 Hi,
 
 Original Message-
 From: Sujit Pal [mailto:sujitatgt...@gmail.com] On Behalf Of SUJIT PAL
 Sent: Monday, April 15, 2013 9:43 PM
 To: java-user@lucene.apache.org
 Subject: Re: Statically store sub-collections for search (faceted search?)
 
 Hi Uwe,
 
 Thanks for the info, I was under the impression that it didn't... I got this 
 info
 (that filters don't have a limit because they are not scoring) from a 
 document
 like the one below. Can't say this is the exact doc because its been a while
 since I saw that, though.
 
 http://searchhub.org/2009/06/08/bringing-the-highlighter-back-to-wildcard-
 queries-in-solr-14/
 
 
 As a response to this performance pitfall on very large indices’s (and the
 infamous TooManyClauses exception), new queries were developed that
 relied on a new Query class called ConstantScoreQuery.
 ConstantScoreQuerys accept a filter of matching documents and then score
 with a constant value equal to the boost. Depending on the qualities of your
 index, this method can be faster than the Boolean expansion method, and
 more importantly, does not suffer from TooManyClauses exceptions. Rather
 than matching and scoring n BooleanQuery clauses (potentially thousands of
 clauses), a single filter is enumerated and then traveled for scoring. On the
 other hand, constructing and scoring with a BooleanQuery containing a few
 clauses is likely to be much faster than constructing and traveling a Filter.
 
 
 This is true, but you misunderstood it: This is about MultiTermQueries (which 
 is the superclass of WildcardQuery, Fuzzy-, and range queries). Those queries 
 are no native Lucene queries, so they rewrite to basic/native queries. In 
 earlier Lucene versions, Wildcards were always rewritten to BooleanQueries 
 with many TermQueries (one for each term that matches the wildcard), leading 
 to the problem with too many terms. This is still the case, but only in some 
 limits (this mode is only used if the wildcard expands to few terms). Those 
 BooleanQueris are then used with ConstantScoreQuery(Query).
 The above text talks about another mode (which is used for many terms today): 
 *No* BooleanQuery is build at all, instead all matching term's documents are 
 marked in a BitSet and this BitSet is used with a Filter to construct a 
 different Query type: ConstantScoreQuery(Filter). The BooleanQuery max clause 
 count does not apply, because no BooleanQuery is involved in the whole 
 process. If you use ConstantScoreQuery(BooleanQuery), the limit still 
 applies, but not for ConstantScoreQuery(internalWildcardFilter).
 
 Uwe
 
 On Apr 15, 2013, at 1:04 AM, Uwe Schindler wrote:
 
 The limit also applies for filters. If you have a list of terms ORed 
 together,
 the fastest way is not to use a BooleanQuery at all, but instead a 
 TermsFilter
 (which has no limits).
 
 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de
 
 
 -Original Message-
 From: Carsten Schnober [mailto:schno...@ids-mannheim.de]
 Sent: Monday, April 15, 2013 9:53 AM
 To: java-user@lucene.apache.org
 Subject: Re: Statically store sub-collections for search (faceted
 search?)
 
 Am 12.04.2013 20:08, schrieb SUJIT PAL:
 Hi Carsten,
 
 Why not use your idea of the BooleanQuery but wrap it in a Filter
 instead?
 Since you are not doing any scoring (only filtering), the max boolean
 clauses limit should not apply to a filter.
 
 Hi Sujit,
 thanks for your suggestion! I wasn't aware that the max clause limit
 does not match for a BooleanQuery wrapped in a filter. I suppose the
 ideal way would be to use a BooleanFilter but not a QueryWrapperFilter,
 right?
 
 However, I am also not sure how to apply a filter in my use case
 because I perform a SpanQuery. Although SpanQuery#getSpans() does
 take a Bits object as an argument (acceptDocs), I haven't been able
 to figure out how to generate this Bits object correctly from a Filter
 object.
 
 Best,
 Carsten
 
 --
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation Next Generation
 Corpus
 Analysis Platform
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e

Re: Statically store sub-collections for search (faceted search?)

2013-04-12 Thread SUJIT PAL
Hi Carsten,

Why not use your idea of the BooleanQuery but wrap it in a Filter instead? 
Since you are not doing any scoring (only filtering), the max boolean clauses 
limit should not apply to a filter.

-sujit

On Apr 12, 2013, at 7:34 AM, Carsten Schnober wrote:

 Dear list,
 I would like to create a sub-set of the documents in an index that is to
 be used for further searches. However, the criteria that lead to the
 creation of that sub-set are not predefined so I think that faceted
 search cannot be applied my this use case.
 
 For instance:
 A user searches for documents that contain token 'A' in a field 'text'.
 These results form a set of documents that is persistently stored (in a
 database). Each document in the index has a field 'id' that identifies
 it, so these external IDs are stored in the database.
 
 Later on, a user loads the document IDs from the database and wants to
 execute another search on this set of documents only. However,
 performing a search on the full index and subsequently filtering the
 results against that list of documents takes very long if there are many
 matches. This is obvious as I have to retrieve the external id from each
 matching document and check whether it is part of the desired sub-set.
 Constructing a BooleanQuery in the style id:Doc1 OR id:Doc2 ... is not
 suitable either because there could be thousands of documents exceeding
 any limit for Boolean clauses.
 
 Any suggestions how to solve this? I would have gone for the Lucene
 document numbers and store them as a bit set that I could use as a
 filter during later searches, but I read that the document numbers are
 ephemeral.
 
 One possible way out seems to be to create another index from the
 documents that have matched the initial search, but this seems quite an
 overkill, especially if there are plenty of them...
 
 Thanks for any hint!
 Carsten
 
 -- 
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation
 Next Generation Corpus Analysis Platform
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Accent insensitive analyzer

2013-03-22 Thread SUJIT PAL
Hi Jerome,

How about this one?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory

Regards,
Sujit

On Mar 22, 2013, at 9:22 AM, Jerome Blouin wrote:

 Hello,
 
 I'm looking for an analyzer that allows performing accent insensitive search 
 in latin languages. I'm currently using the StandardAnalyzer but it doesn't 
 fulfill this need. Could you please point me to the one I need to use? I've 
 checked the javadoc for the various analyzer packages but can't find one. Do 
 I need to implement my own analyzer?
 
 Regards,
 Jerome
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: What is flexible indexing in Lucene 4.0 if it's not the ability to make new postings codecs?

2012-12-13 Thread SUJIT PAL
Hi Glen,

I don't believe you can attach a single payload to multiple tokens. What I did 
for a similar requirement was to combine the tokens into a single _ delimited 
single token and attached the payload to it. For example:

The Big Bad Wolf huffed and puffed and blew the house of the Three Little Pigs 
down.

Now assume Big Bad Wolf and Three Little Pigs are spans to which I would 
like to attach payloads to. I run the tokens through a custom tokenizer that 
produces:

The Big_Bad_Wolf$payload1 huffed and puffed and blew the house of the 
Three_Little_Pigs$payload2 down.

In my case this makes sense, ie I can treat the span as a single unit. Not sure 
about your use case.

HTH
Sujit

On Dec 13, 2012, at 2:08 PM, Glen Newton wrote:

 Cool! Sounds great!  :-)
 
 Any pointers to a (Lucene) example that attaches a payload to a
 start..end span that is more than one token?
 
 thanks,
 -Glen
 
 On Thu, Dec 13, 2012 at 5:03 PM, Lance Norskog goks...@gmail.com wrote:
 I should not have added that note. The Opennlp patch gives a concrete
 example of adding an annotation to text.
 
 
 On 12/13/2012 01:54 PM, Glen Newton wrote:
 
 It is not clear this is exactly what is needed/being discussed.
 
 From the issue:
 We are also planning a Tokenizer/TokenFilter that can put parts of
 speech as either payloads (PartOfSpeechAttribute?) on a token or at
 the same position.
 
 This adds it to a token, not a span. 'same position' does not suggest
 it also records the end position.
 
 -Glen
 
 On Thu, Dec 13, 2012 at 4:45 PM, Lance Norskog goks...@gmail.com wrote:
 
 Parts-of-speech is available now, in the indexer.
 
 LUCENE-2899 adds OpenNLP to the LuceneSolr codebase. It does
 parts-of-speech, chunking and Named Entity Recognition. OpenNLP is an
 Apache
 project for natural-language processing.
 
 Some parts are in Solr that could be in Lucene.
 
 https://issues.apache.org/jira/browse/lucene-2899
 
 
 On 12/12/2012 12:02 PM, Wu, Stephen T., Ph.D. wrote:
 
 Is there any (preliminary) code checked in somewhere that I can look
 at,
 that would help me understand the practical issues that would need to
 be
 addressed?
 
 Maybe we can make this more concrete: what new attribute are you
 needing to record in the postings and access at search time?
 
 For example:
   - part of speech of a token.
   - syntactic parse subtree (over a span).
   - semantically normalized phrase (to canonical text or ontological
 code).
   - semantic group (of a span).
   - coreference link.
 
 stephen
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 
 -- 
 -
 http://zzzoot.blogspot.com/
 -
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring a document using LDA topics

2011-11-29 Thread Sujit Pal
Hi Stephen,

We precompute a variant of P(z,d) during indexing, and do the first 3
steps. The resulting documents are ordered by payload score, which is
basically z in our case. We don't currently care about P(t,z) but it
seems like a good thing to have for disambiguation purposes.

So anyway, I have never done what you are looking to do, but I guess the
approach you have outlined would be the one you would use to do this.
Although there may be performance issues where you have a large number
of topic matches.

An alternative - since you need to know the P(t,z) (the probability of
the terms in the query being in a particular topic), and each
PayloadTermQuery in the BooleanQuery corresponds to a z (topic), perhaps
you could boost each clauses by P(t,z)?

-sujit

On Tue, 2011-11-29 at 10:50 -0500, Stephen Thomas wrote:
 Sujit,
 
 Thanks for your reply, and the link to your blog post, which was
 helpful and got me thinking about Payloads.
 
 I still have one more question. I need to be able to compute the
 Sim(query q, doc d) similarity function, which is defined below:
 
 Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
 
 So, I'm guessing that the only what to do this is to do the following:
 
 - At index time, store the (flattened) topics as a payload for each
 documen, as you suggest in your blog
 
 - At query time, find out which topics are in the query
 - Construct a BooleanQuery, consisting of one PayloadTermQuery per
 topic in the query
 - Search on the BooleanQuery. This essentially tells me which
 documents have the topics in the query
 - Iterate over the TopDocs returns by the search. For each doc, get
 the full payload, unflatten it, and use it to compute Sim(query q, doc
 d).
 - Reorder the results based on the Sim(query q, doc d) results.
 
 Is this the best way? I can't see a way to compute the Sim() metric at
 any other time, because in scorePayload(), we don't have access to the
 full payload, nor to the query.
 
 Thanks again,
 Steve
 
 
 On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal sujit@comcast.net wrote:
  Hi Stephen,
 
  We are doing something similar, and we store as a multifield with each
  document as (d,z) pairs where we store the z's (scores) as payloads for
  each d (topic). We have had to build a custom similarity which
  implements the scorePayload function. So to find docs for a given d
  (topic), we do a simple PayloadTermQuery and the docs come back in
  descending order of z. Simple boolean term queries also work. We turn
  off norms (in the ctor for the PayloadTermQuery) to get scores that are
  identical to the d values.
 
  I wrote about this sometime back...maybe this would help you.
  http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html
 
  -sujit
 
  On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote:
  List,
 
  I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
  model into Lucene. Briefly, the LDA model extracts topics
  (distribution over words) from a set of documents, and then represents
  each document with topic vectors. For example, documents could be
  represented as:
 
  d1 = (0,  0.5, 0, 0.5)
 
  d2 = (1, 0, 0, 0)
 
  This means that document d1 contains topics 2 and 4, and document d2
  contains topic 1. I.e.,
 
  P(z1, d1) = 0
  P(z2, d1) = 0.5
  P(z3, d1) = 0
  P(z4, d1) = 0.5
  P(z1, d2) = 1
  P(z2, d2) = 0
  ...
 
  Also, topics are represented by the probability that a term appears in
  that topic, so we also have a set of vectors:
 
  z1 = (0, 0, .02, ...)
 
  meaning that topic z1 does not contain terms 1 or 2, but does contain
  term 3. I.e.,
 
  P(t1, z1) = 0
  P(t2, z1) = 0
  P(t3, z1) = .02
  ...
 
  Then, the similarity between a query and a document is computed as:
 
  Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
 
  Basically, for each term in the query, and each topic in existence,
  see how relevant that term is in that topic, and how relevant that
  topic is in the document.
 
 
  I've been thinking about how to do this in Lucene. Assume I already
  have the topics and the topic vectors for each document. I know that I
  need to write my own Similarity class that extends DefaultSimilarity.
  I need to override tf(), queryNorm(), coord(), and computeNorm() to
  all return a constant 1, so that they have no effect. Then, I can
  override idf() to compute the Sim equation above. Seems simple enough.
  However, I have a few practical issues:
 
 
  - Storing the topic vectors for each document. Can I store this in the
  index somehow? If so, how do I retrieve it later in my
  CustomSimilarity class?
 
  - Changing the Boolean model. Instead of only computing the similarity
  on a documents that contain any of the terms in the query (the default
  behavior), I need to compute the similarity on all of the documents.
  (This is the whole  idea behind LDA: you don't need an exact term
  match for there to be a similarity.) I understand that this will
  result in a performance hit

Re: Scoring a document using LDA topics

2011-11-28 Thread Sujit Pal
Hi Stephen,

We are doing something similar, and we store as a multifield with each
document as (d,z) pairs where we store the z's (scores) as payloads for
each d (topic). We have had to build a custom similarity which
implements the scorePayload function. So to find docs for a given d
(topic), we do a simple PayloadTermQuery and the docs come back in
descending order of z. Simple boolean term queries also work. We turn
off norms (in the ctor for the PayloadTermQuery) to get scores that are
identical to the d values.

I wrote about this sometime back...maybe this would help you.
http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html 

-sujit

On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote:
 List,
 
 I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
 model into Lucene. Briefly, the LDA model extracts topics
 (distribution over words) from a set of documents, and then represents
 each document with topic vectors. For example, documents could be
 represented as:
 
 d1 = (0,  0.5, 0, 0.5)
 
 d2 = (1, 0, 0, 0)
 
 This means that document d1 contains topics 2 and 4, and document d2
 contains topic 1. I.e.,
 
 P(z1, d1) = 0
 P(z2, d1) = 0.5
 P(z3, d1) = 0
 P(z4, d1) = 0.5
 P(z1, d2) = 1
 P(z2, d2) = 0
 ...
 
 Also, topics are represented by the probability that a term appears in
 that topic, so we also have a set of vectors:
 
 z1 = (0, 0, .02, ...)
 
 meaning that topic z1 does not contain terms 1 or 2, but does contain
 term 3. I.e.,
 
 P(t1, z1) = 0
 P(t2, z1) = 0
 P(t3, z1) = .02
 ...
 
 Then, the similarity between a query and a document is computed as:
 
 Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
 
 Basically, for each term in the query, and each topic in existence,
 see how relevant that term is in that topic, and how relevant that
 topic is in the document.
 
 
 I've been thinking about how to do this in Lucene. Assume I already
 have the topics and the topic vectors for each document. I know that I
 need to write my own Similarity class that extends DefaultSimilarity.
 I need to override tf(), queryNorm(), coord(), and computeNorm() to
 all return a constant 1, so that they have no effect. Then, I can
 override idf() to compute the Sim equation above. Seems simple enough.
 However, I have a few practical issues:
 
 
 - Storing the topic vectors for each document. Can I store this in the
 index somehow? If so, how do I retrieve it later in my
 CustomSimilarity class?
 
 - Changing the Boolean model. Instead of only computing the similarity
 on a documents that contain any of the terms in the query (the default
 behavior), I need to compute the similarity on all of the documents.
 (This is the whole  idea behind LDA: you don't need an exact term
 match for there to be a similarity.) I understand that this will
 result in a performance hit, but I do not see a way around it.
 
 - Turning off fieldNorm(). How can I set the field norm for each doc
 to a constant 1?
 
 
 Any help is greatly appreciated.
 
 Steve
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Bet you didn't know Lucene can...

2011-10-22 Thread Sujit Pal
Hi Grant,

Not sure if this qualifies as a bet you didn't know, but one could use
Lucene term vectors to construct document vectors for similarity,
clustering and classification tasks. I found this out recently (although
I am probably not the first one), and I think this could be quite
useful.

-sujit

On Sat, 2011-10-22 at 11:11 +0200, Grant Ingersoll wrote:
 Hi All,
 
 I'm giving a talk at ApacheCon titled Bet you didn't know Lucene can... 
 (http://na11.apachecon.com/talks/18396).  It's based on my observation, that 
 over the years, a number of us in the community have done some pretty cool 
 things using Lucene that don't fit under the core premise of full text 
 search.  I've got a fair number of ideas for the talk (easily enough for 1 
 hour), but I wanted to reach out to hear your stories of ways you've (ab)used 
 Lucene and Solr to see if we couldn't extend the conversation to a bit more 
 than the conference and also see if I can't inject more ideas beyond the ones 
 I have.  I don't need deep technical details, but just high level use case 
 and the basic insight that led you to believe Lucene could solve the problem.
 
 Thanks in advance,
 Grant
 
 
 Grant Ingersoll
 http://www.lucidimagination.com
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How do you see if a tokenstream has tokens without consuming the tokens ?

2011-10-17 Thread Sujit Pal
Hi Paul,

Since you have modified the StandardAnalyzer (I presume you mean
StandardFilter), why not do a check on the term.text() and if its all
punctuation, skip the analysis for that term? Something like this in
your StandardFilter:

public final boolean incrementToken() throws IOException {
  CharTermAttribute ta = getAttribute(CharTermAttribute.class);
  if (isAllPunctuation(ta.buffer()) {
return true;
  } else {
... normal processing here
  }
}

If the filters are made keyword attribute aware (I have a bug open on
this, LUCENE-3236, although I only asked for Lowercase and Stop filters
in here), then its even simpler, you can plug in your own filter that
marks the term as a KeywordAttribute so downstream filters pass it
through.

-sujit

On Mon, 2011-10-17 at 13:12 +0100, Paul Taylor wrote:
 We have a modified version of a Lucene StandardAnalyzer , we use it for 
 tokenizing music metadata such as as artist names  song titles, so 
 typically only a few words. On tokenizing it usually it strips out 
 punctuations which is correct, however if the input text consists of 
 only punctuation characters then we end up with nothing, for these 
 particular RARE cases I want to use a mapping filter.
 
 So what I try to do is have my analyzer tokenize as normal, then if the 
 results is no tokens retokenize with the mapping filter , I check it has 
 no token using incrementToken() but then cant see how I 
 decrementToken(). How can I do this, or is there a more efficient way of 
 doing this. Note of maybe 10,000,000 records only a few 100 records will 
 have this problem so I need a solution which doesn't impact performance 
 unreasonably.
 
  NormalizeCharMap specialcharConvertMap = new NormalizeCharMap();
  specialcharConvertMap.add(!, Exclamation);
  specialcharConvertMap.add(?,QuestionMark);
  ...
 
  public  TokenStream tokenStream(String fieldName, Reader reader) {
  CharFilter specialCharFilter = new 
 MappingCharFilter(specialcharConvertMap,reader);
 
  StandardTokenizer tokenStream = new 
 StandardTokenizer(LuceneVersion.LUCENE_VERSION);
  try
  {
  if(tokenStream.incrementToken()==false)
  {
  tokenStream = new 
 StandardTokenizer(LuceneVersion.LUCENE_VERSION, specialCharFilter);
  }
  else
  {
  //TODO  set tokenstream back as it was 
 before increment token
  }
  }
  catch(IOException ioe)
  {
 
  }
  TokenStream result = new LowercaseFilter(result);
  return result;
  }
 
 thanks for any help
 
 
 Paul
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is there any Query in Lucene can search the term, which is similar as SQL-LIKE?

2011-10-17 Thread Sujit Pal
Hi Mead,

You may want to check out the permuterm index idea.
http://www-nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html 

Basically you write a custom filter that takes a term and generates all
word permutations off it. On the query side, you convert your query so
its always a prefix query by rotating the characters so * is always at
the end and match against the permuterm indexed field.

I have a simple (and currently incomplete) working implementation (works
with queries such as *keyword, keyword*, key*rd, *keyword*, but only a
single * and no ? unlike the Wildcard query. But because its always a
prefix query internally, it does not have the performance penalty of
leading * in WildcardQuery. Maybe it will give you some ideas...
http://sujitpal.blogspot.com/2011/10/lucene-wildcard-query-and-permuterm.html 

-sujit

On Thu, 2011-10-13 at 10:10 +0800, Mead Lai wrote:
 Thank you very much,
 With your helps, that, finally, I use WildcardQuery to find right result:
 BooleanQuery resultQuery = new BooleanQuery();
 resultQuery.add(WildcardQuery(new Term(content, *keyword*));
 TopDocs topDocs = searcher.search(resultQuery,*1000*);
 
 But there is also a problem puzzle me, the result only can get 1000 items,
 which is not enough.
 I want to have entire/whole items, which match that condition(*keyword*).
 
 OR, may I put a date condtion to query,
 e.g: select * from table where start_date *=* 2011-10-12
 
 
 Regards,
 Mead
 
 
 On Tue, Oct 11, 2011 at 11:39 PM, Chris Lu chris...@gmail.com wrote:
 
  You need to analyze the search keyword with the same analyzer that's
  applied
  on the content field.
 
  --
  Chris Lu
  -
  Instant Scalable Full-Text Search On Any Database/Application
  site: http://www.dbsight.net
  demo: http://search.dbsight.com
  Lucene Database Search in 3 minutes:
 
  http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
 
  On Tue, Oct 11, 2011 at 12:11 AM, Mead Lai laiqi...@gmail.com wrote:
 
   Hello all,
   *Background:
   *There are *ONE MILLION* data in a table, and this table has 100 columns
   inside.
   The application need to search the data in EVERY column with one
  'keyword'.
   so, I try it in a clumsy way, using a database view, then search the
  view.
   Just like the following SQL:
   *=Step1*: create a view.
  
   CREATE OR REPLACE VIEW V_MY_VIEW(id,title,content)
   as
   SELECT
  
  
  mv.l_instanceid,mv.c_param1,mv.c_param2||';'||mv.c_param3||';'||mv.c_param4||';'||mv.c_param5||';'||mv.c_param6||';'||mv.c_param7||';'||mv.c_param8||';'||mv.c_param9||';'||mv.c_param10||';'||mv.c_param11||';'||mv.c_param12||';'||mv.c_param13||';'||mv.c_param14||';'||mv.c_param15||';'||mv.c_param16||';'||mv.c_param17||';'||mv.c_param18||';'||mv.c_param19||';'||mv.c_param20||';'||mv.c_param21||';'||mv.c_param22||';'||mv.c_param23||';'||mv.c_param24||';'||mv.c_param25||';'||mv.c_param26||';'||mv.c_param27||';'||mv.c_param28||';'||mv.c_param29||';'||mv.c_param30||';'||mv.c_param31||';'||mv.c_param32||';'||mv.c_param33||';'||mv.c_param34||';'||mv.c_param35||';'||mv.c_param36||';'||mv.c_param37||';'||mv.c_param38||';'||mv.c_param39||';'||mv.c_param40||';'||mv.c_param41||';'||mv.c_param42||';'||mv.c_param43||';'||mv.c_param44||';'||mv.c_param45||';'||mv.c_param46||';'||mv.c_param47||';'||mv.c_param48||';'||mv.c_param49||';'||mv.c_param50||';'||mv.c_param51||';'||mv.c_param52||';'||mv.c_param53||';'||mv.c_param54||';'||mv.c_param55||';'||mv.c_param56||';'||mv.c_param57||';'||mv.c_param58||';'||mv.c_param59||';'||mv.c_param60||';'||mv.c_param61||';'||mv.c_param62||';'||mv.c_param63||';'||mv.c_param64||';'||mv.c_param65||';'||mv.c_param66||';'||mv.c_param67||';'||mv.c_param68||';'||mv.c_param69||';'||mv.c_param70||';'||mv.c_param71||';'||mv.c_param72||';'||mv.c_param73||';'||mv.c_param74||';'||mv.c_param75||';'||mv.c_param76||';'||mv.c_param77||';'||mv.c_param78||';'||mv.c_param79||';'||mv.c_param80||';'||mv.c_param81||';'||mv.c_param82||';'||mv.c_param83||';'||mv.c_param84||';'||mv.c_param85||';'||mv.c_param86||';'||mv.c_param87||';'||mv.c_param88||';'||mv.c_param89||';'||mv.c_param90||';'||mv.c_param91||';'||mv.c_param92||';'||mv.c_param93||';'||mv.c_param94||';'||mv.c_param95||';'||mv.c_param96||';'||mv.c_param97||';'||mv.c_param98||';'||mv.c_param99||';'||mv.c_param100||';'
   FROM MyTable mv
  
   *=Step2*: search the view with LIKE '%keyword%'
  
   SELECT *
   FROM V_MY_VIEW wcv
   WHERE wcv.content LIKE '%keyword%'
  
   Finally, it works nice, but inefficiency, almost cost 5~7 seconds. cos
  ONE
   MILLION rows are tooo huge.
  
   *Lucene way:*
  So, I use the Lucene to store these ONE MILLION data,
   code:document.add(new Field(content, content, Store.YES,
   Index.ANALYZED));//variable content, is the strings which jointed from
  the
   100 columns
   The problem is that: if some keyword is not a word or a term, the
   search will return nothing.
   Usually, the keyword would be a person's name 

Payload Query and Document Boosts

2011-10-12 Thread Sujit Pal
Hi, 

Question about Payload Query and Document Boosts. We are using Lucene
3.2 and Payload queries, with our own PayloadSimilarity class which
overrides the scorePayload method like so:

{code}
  @Override
  public float scorePayload(int docId, String fieldName,
  int start, int end, byte[] payload, int offset, int length) {
if (payload != null) {
  return PayloadHelper.decodeFloat(payload, offset);
} else {
  return 1.0F;
}
  }
{/code}

We are injecting payloads as ID$SCORE pairs using the
DelimitedPayloadTokenFilter and life was good - when we run
PayloadTermQuery() the scores came back as our score. I have included
code below that illustrates the calling pattern, its this:

{code}
PayloadTermQuery q = new PayloadTermQuery(new Term(imuids_p,
2790926), new AveragePayloadFunction(), false);
{/code}

ie, do not include the span score (the SCORE is calculated as a result
of offline processing and we can't change that value).

Now we would like to boost each document differently (index time,
document.setBoost(boost), based on its content type), and we are running
into problems. Looks like the document boost is not applied to the
document score during search if includeSpanScore==false. When we set it
to true, we see a difference in scores (the original score without
document boosts is multiplied by the document boost set), but the
original scores without boost is not the same as SCORE, ie its now
affected by the span score.

My question is - is there some method in DefaultSimilarity that I can
override so that my score is my original SCORE * document boost? The
Similarity documentation does not provide any clues to my problem - I
tried modifying the computeNorm() method to return state.getBoost() but
it looks like its never called.

If not, the other option would be to bake in the doc boost into the
SCORE value, by multiplying them on their way into lucene, so that now
SCORE *= doc boost.

Here is my unit test which illustrates the issue:

{code}
import java.io.Reader;
import java.util.HashMap;
import java.util.Map;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.PerFieldAnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilter;
import org.apache.lucene.analysis.payloads.FloatEncoder;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.lucene.store.RAMDirectory;
import org.junit.Test;

import com.healthline.query.kb.ConceptAnalyzer;
import com.healthline.solr.HlSolrConstants;
import com.healthline.solr.search.PayloadSimilarity;
import com.healthline.util.Config;

public class DocBoostTest {

  private class PayloadAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
  TokenStream tokens = new
WhitespaceTokenizer(HlSolrConstants.CURRENT_VERSION, reader);
  tokens = new DelimitedPayloadTokenFilter(tokens, '$', new
FloatEncoder());
  return tokens;
}
  };

  private Analyzer getAnalyzer() {
MapString,Analyzer pfas = new HashMapString,Analyzer();
pfas.put(imuids_p, new PayloadAnalyzer());
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(
  new ConceptAnalyzer(), pfas);
return analyzer;
  }
  
  private IndexSearcher loadTestData(boolean setBoosts) throws Exception
{
RAMDirectory ramdir = new RAMDirectory();
IndexWriterConfig iwconf = new IndexWriterConfig(
  HlSolrConstants.CURRENT_VERSION, getAnalyzer());
iwconf.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(ramdir, iwconf);
Document doc1 = new Document();
doc1.add(new Field(itemtitle, Cancer and the Nervous System
PARANEOPLASTIC DISORDERS, Store.YES, Index.ANALYZED));
doc1.add(new Field(imuids_p, 2790917$52.01 2790926$53.18,
Store.YES, Index.ANALYZED));
doc1.add(new Field(contenttype, BK, Store.YES,
Index.NOT_ANALYZED));
if (setBoosts) doc1.setBoost(1.2F);
writer.addDocument(doc1);
Document doc2 = new Document();
doc2.add(new Field(itemtitle, Esophagogastric cancer: Targeted
agents, Store.YES, Index.ANALYZED));
doc2.add(new Field(imuids_p, 2790926$52.18 2790981$5.19,
Store.YES, Index.ANALYZED));
doc2.add(new Field(contenttype, JL, Store.YES,
Index.NOT_ANALYZED));

Re: How can i index a Java Bean into Lucene application ?

2011-08-07 Thread Sujit Pal
Depending on what you wanted to do with the Javabean (I assume you want
to make some or all its fields searchable since you are writing to
Lucene), you could use reflection to break it up into field name value
pairs and write them out to the IndexWriter using something like this:

Document d = new Document();
d.addField(fieldname1, fieldvalue1, Store.YES, Index.ANALYZED, ...)
...
writer.addDocument(d);

-sujit

On Sat, 2011-08-06 at 18:28 +0530, KARTHIK SHIVAKUMAR wrote:
 Hi
 
 How can i  index a  Java Bean  into Lucene  application ?  instead of a
 file
 
 API  : IndexWriter writer = new IndexWriter(*FSDirectory.open(INDEX_DIR)*,
 new StandardAnalyzer(Version.LUCENE_CURRENT), true,
 IndexWriter.MaxFieldLength.LIMITED);
 
 Is there any alternate for the same .
 
 ex:
 
 * package com.web.beans.searchdata;*
 
 * public class SearchIndexHtmlData {
 
 public String CONTENT =NA;
 public String DATEOFCREATION  =NA;
 public String DATEOFINDEXCREATION =NA;
 
 
 public String getCONTENT() {
 return CONTENT;
 }
 public void setCONTENT(String cONTENT) {
 CONTENT = cONTENT;
 }
 public String getDATEOFCREATION() {
 return DATEOFCREATION;
 }
 public void setDATEOFCREATION(String dATEOFCREATION) {
 DATEOFCREATION = dATEOFCREATION;
 }
 public String getDATEOFINDEXCREATION() {
 return DATEOFINDEXCREATION;
 }
 public void setDATEOFINDEXCREATION(String dATEOFINDEXCREATION) {
 DATEOFINDEXCREATION = dATEOFINDEXCREATION;
 }
 }*
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Suggestion: make some more TokenFilters KeywordAttribute aware

2011-06-23 Thread Sujit Pal
Thanks Simon, I have opened a JIRA and attached a patch. I have verified
that I haven't broken anything, and I have used these patched files to
test in my local application and have verified that they work.

https://issues.apache.org/jira/browse/LUCENE-3236 

-sujit

On Thu, 2011-06-23 at 08:21 +0200, Simon Willnauer wrote:
 On Wed, Jun 22, 2011 at 8:53 PM, Sujit Pal s...@healthline.com wrote:
  Hello,
 
  I am currently in need of a LowerCaseFilter and StopFilter that will
  recognize KeywordAttribute, similar to the way PorterStemFilter
  currently does (on trunk). Specifically, in case the term is a
  KeywordAttribute.isKeyword(), it should not lowercase and remove
  respectively.
 
  This can be achieved without breaking backward compatibility by
  introducing an extra constructor which takes a boolean ignoreKeyword
  parameter.
 
  If this sounds like this would be a good idea, please let me know, I can
  open a JIRA and attach a patch. Currently, I have created my own
  versions of KeywordAwareXXX filters that does pretty much the same
  thing.
 
 I think you should open an issue and take it from there. I can't
 promise this is going to be added but its worth to try!
 
 please go ahead and open an issue.
 
 simon
 
  Thanks
  Sujit
 
 
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 
 
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Suggestion: make some more TokenFilters KeywordAttribute aware

2011-06-22 Thread Sujit Pal
Hello,

I am currently in need of a LowerCaseFilter and StopFilter that will
recognize KeywordAttribute, similar to the way PorterStemFilter
currently does (on trunk). Specifically, in case the term is a
KeywordAttribute.isKeyword(), it should not lowercase and remove
respectively.

This can be achieved without breaking backward compatibility by
introducing an extra constructor which takes a boolean ignoreKeyword
parameter.

If this sounds like this would be a good idea, please let me know, I can
open a JIRA and attach a patch. Currently, I have created my own
versions of KeywordAwareXXX filters that does pretty much the same
thing.

Thanks
Sujit



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Passage retrieval with Lucene-based application

2011-05-25 Thread Sujit Pal
Hi Leroy,

Would it make sense to index as Lucene documents the unit to be
searched? So if you want paragraphs to be shown in search results, you
could parse the source document during indexing into paragraphs and
index them as separate Lucene documents.

-sujit

On Wed, 2011-05-25 at 15:46 -0400, Leroy Stone wrote:
 Hello!
  I am purchased Lucene in Action, 2nd Ed., and posted the 
 question below at the Manning Forum. Mike MCCandless suggested that I 
 send it to you.
 
 Thanks in advance for your attention.
 
  the question I posted ___
 I would like the search program to return with segments of a document 
 (paragraphs) that contain my search phrase, rather than simply 
 pointers to the whole document. in searching among applications based 
 upon the Lucene, I have found only one that seems to have this 
 functionality. It is at http://www.crosswire.org/bibledesktop/ . Can 
 someone point me to some other Lucene-based applications where the 
 search engine returns text segments from within documents?
 Thanks in advance.
 
 
 N.B. I know Lucene can be modified to do what I wish.  My problem is 
 that my professional obligations do not allow the time for me to 
 build the entire application that I need.  Thus I am searching for 
 one that exists already, that I can adapt quickly, and which has all 
 the code with which I must surround Lucene to make a full-blown 
 application.
 
 The Bible application I cite requires preprocessing of the documents 
 into SWORD format.  I will try that route if that is all that is 
 available. I thought I would look around (with your help) before 
 trying to take on the SWORD-format issue.
 
 
 Thanks.
 
 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FastVectorHighlighter - can FieldFragList expose fragInfo?

2011-05-24 Thread Sujit Pal
Thank you Koji. I opened LUCENE-3141 for this.
https://issues.apache.org/jira/browse/LUCENE-3141 

-sujit

On Tue, 2011-05-24 at 22:33 +0900, Koji Sekiguchi wrote:
 (11/05/24 3:28), Sujit Pal wrote:
  Hello,
  
  My version: Lucene 3.1.0
  
  I've had to customize the snippet for highlighting based on our
  application requirements. Specifically, instead of the snippet being a
  set of relevant fragments in the text, I need it to be the first
  sentence where a match occurs, with a fixed size from the beginning of
  the sentence.
  
  For this, I built (in my application code, using Lucene jars) a custom
  FragmentsBuilder (subclassing SimpleFragmentBuilder and overriding the
  createFragment(IndexReader reader, int docId, String fieldName,
  FieldFragList fieldFragList).
  
  However, the FieldFragList does not allow access to the
  ListWeightedFragInfo  member variable. I changed this locally to be
  public so my subclass can access it, ie:
  
  public ListWeightedFragInfo  fragInfos = new
  ArrayListWeightedFragInfo();
  
  Once this is done, my createFragment method can get at the fragInfos
  from the passed in fieldFragList, iterate through its
  WeightedFragInfo.SubInfo.Toffs to get the term offsets, which I then use
  to calculate and highlight my snippet (I can provide the code if it
  makes things clearer, but thats the gist).
  
  So my question is - would it be feasible to make the
  FieldFragList.fragInfos variable public in a future release?
 
 No. Please open a jira ticket and attach a patch, if possible.
 I'll take a look.
 
 koji


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



FastVectorHighlighter - can FieldFragList expose fragInfo?

2011-05-23 Thread Sujit Pal
Hello,

My version: Lucene 3.1.0

I've had to customize the snippet for highlighting based on our
application requirements. Specifically, instead of the snippet being a
set of relevant fragments in the text, I need it to be the first
sentence where a match occurs, with a fixed size from the beginning of
the sentence.

For this, I built (in my application code, using Lucene jars) a custom
FragmentsBuilder (subclassing SimpleFragmentBuilder and overriding the
createFragment(IndexReader reader, int docId, String fieldName,
FieldFragList fieldFragList). 

However, the FieldFragList does not allow access to the
ListWeightedFragInfo member variable. I changed this locally to be
public so my subclass can access it, ie:

public ListWeightedFragInfo fragInfos = new
ArrayListWeightedFragInfo();

Once this is done, my createFragment method can get at the fragInfos
from the passed in fieldFragList, iterate through its
WeightedFragInfo.SubInfo.Toffs to get the term offsets, which I then use
to calculate and highlight my snippet (I can provide the code if it
makes things clearer, but thats the gist).

So my question is - would it be feasible to make the
FieldFragList.fragInfos variable public in a future release?

If not, is there some other way that I should do what I need to do?

Thanks very much,
Sujit



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Reg: Query behavior

2011-04-26 Thread Sujit Pal
Hi Deepak,

Would something like this work in your case?

Arcos Bioscience^2.0 Arcos Bioscience

ie, a BooleanQuery with the full phrase boosted OR'd with a query on
each word?

-sujit

On Tue, 2011-04-26 at 14:46 -0400, Deepak Konidena wrote:
 Hi,
 
 Currently when I type in Arcos Bioscience in my lucene search, it returns all 
 those documents with
 either Arcos or Bioscience at the top of the search results and the actual 
 document containing
 
 Arcos Bioscience somewhere in the middle/bottom.
 
 The desired behavior is to rank those documents that contain the terms Arcos 
 and Bioscience next
 to each other higher than those that contain either of the terms or contain 
 both the terms but which far
 away from each other.
 
 When I search the same term with quotes Arcos Bioscience in the term, it 
 gives the exact document that
 contains the term and nothing else.
 
 In general, how would I modify the system in such a way that the documents 
 containing exact term are shown
 first and also the documents with either exact or term are shown later 
 (without just showing one result).
 
 Thanks
 Deepak Konidena.
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching partial names using Lucene

2011-03-24 Thread Sujit Pal
I don't know if there is already an analyzer available for this, but you
could use GATE or UIMA for Named Entity Extraction against names and
expand the query to include the extra names that are used synonymously.
You could do this outside Lucene or inline using a custom Lucene
tokenizer that embeds either a GATE or UIMA NER.

If you go the custom route (and you are not familiar with GATE or UIMA),
you may want to take a look at Dr Manu Konchady's book on Lingpipe,
Lucene and GATE - there is code in there to embed a GATE NER into a
Lucene tokenizer (although its not a streaming tokenizer due to the
nature of the NER process). The process would be similar for embedding a
UIMA NER.

GATE (ANNIE) contains data files that list the common synonyms (eg. Bill
== William, Bob == Robert, Tom == Thomas, etc) which you can leverage
with GATE's Jape rule language. Alternatively, you could use the same
data from UIMA using a custom analysis engine (I prefer this route
because this is all Java, easier learning curve and maintainability).

-sujit

On Thu, 2011-03-24 at 14:31 -0400, Deepak Konidena wrote:
 Hi,
 
 I  would like to build a search system where a search for Dan would also 
 search for Daniel and a search for Will, William . Any ideas on how to 
 go about implementing that? I can think of writing a custom Analyzer that 
 would map these partial tokens to their full firstname or lastnames. But is 
 there an Analyzer in lucene contrib modules or elsewhere that does a similar 
 job for me?
 
 Thanks,
 Deepak Konidena.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to define different similarity scores per field ?

2011-03-01 Thread Sujit Pal
One way to do this currently is to build a per field similarity wrapper
(that triggers off the field name). I believe there is some work going
on with Lucene Similarity that would make it pluggable for this sort of
stuff, but in the meantime, this is what I did:

public class MyPerFieldSimilarityWrapper extends Similarity {

  public MyPerFieldSimilarityWrapper() {
this.defaultSimilarity = new DefaultSimilarity();
this.fieldSimilarityMap = new HashMapString,Similarity();
this.fieldSimilarityMap.put(fieldA, new FieldASimilarity());
...
  }

  @Override
  public float lengthNorm(String fieldName, int numTokens) {
Similarity sim = fieldSimilarityMap.get(fieldName);
if (sim == null) {
  return defaultSimilarity.lengthNorm(fieldName, numTokens);
} else {
  return sim.lengthNorm(fieldName, numTokens);
}
  }
  // same for scorePayload. For the others, I just delegate 
  // to defaultSimilarity (all I really need is scorePayload in 
  // my case).
}

and in the schema.xml, I just set this class to be the similarity class:
  similarity class=com.mycompany.MyPerFieldSimilarityWrapper/

hth
-sujit

On Tue, 2011-03-01 at 20:41 +0100, Patrick Diviacco wrote:
 I need to define different similarity scores per document field.
 
 For example for field A I want to use Lucene tf.idf score, for the numerical
 field B I want to use a different metric (difference between values) and so
 on...
 
 thanks


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to define different similarity scores per field ?

2011-03-01 Thread Sujit Pal
Yes, for the other methods (except scorePayload), I just use delegate to
the corresponding method in DefaultSimilarity. The reason is that I
don't have a way to trigger off the field name for these others. For me,
I really only need to distinguish between DefaultSimilarity and
PayloadSimilarity (which needs to be triggered for certain fields in my
index), so I overrode the scorePayloads method also in the same Map
driven way.

On Tue, 2011-03-01 at 23:28 +0100, Patrick Diviacco wrote:
 I see, but I don't get one thing... you are actually customizing only
 normLenght method but not all the other methods that are calculating
 the similarity scores...
 
 
 those methods are called and they have the implementation you have in
 DefaultSimilarityClass.. right ?
 
 
 
 
 On 1 March 2011 21:12, Sujit Pal sujit@comcast.net wrote:
 One way to do this currently is to build a per field
 similarity wrapper
 (that triggers off the field name). I believe there is some
 work going
 on with Lucene Similarity that would make it pluggable for
 this sort of
 stuff, but in the meantime, this is what I did:
 
 public class MyPerFieldSimilarityWrapper extends Similarity {
 
  public MyPerFieldSimilarityWrapper() {
this.defaultSimilarity = new DefaultSimilarity();
this.fieldSimilarityMap = new HashMapString,Similarity();
this.fieldSimilarityMap.put(fieldA, new
 FieldASimilarity());
...
  }
 
  @Override
  public float lengthNorm(String fieldName, int numTokens) {
Similarity sim = fieldSimilarityMap.get(fieldName);
if (sim == null) {
  return defaultSimilarity.lengthNorm(fieldName,
 numTokens);
} else {
  return sim.lengthNorm(fieldName, numTokens);
}
  }
  // same for scorePayload. For the others, I just delegate
  // to defaultSimilarity (all I really need is scorePayload in
  // my case).
 }
 
 and in the schema.xml, I just set this class to be the
 similarity class:
  similarity
 class=com.mycompany.MyPerFieldSimilarityWrapper/
 
 hth
 -sujit
 
 
 On Tue, 2011-03-01 at 20:41 +0100, Patrick Diviacco wrote:
  I need to define different similarity scores per document
 field.
 
  For example for field A I want to use Lucene tf.idf score,
 for the numerical
  field B I want to use a different metric (difference between
 values) and so
  on...
 
  thanks
 
 
 
 -
 To unsubscribe, e-mail:
 java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail:
 java-user-h...@lucene.apache.org
 
 
 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org