Re: PorterStemFilter causes wildcard searches to not work

2011-11-29 Thread Ian Lea
This is very hard to follow.  I for one don't recall what you
described or what you are looking for.

Have you worked through
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F?


--
Ian.

On Tue, Nov 29, 2011 at 7:25 AM, SBS  wrote:
> I am applying the PorterStemFilter at both indexing and search time.
>
> As for schema, I have 3 fields: title, subtitle and notes.  When the user
> enters a query string of */a*itis/*, my software turns this into an actual
> Lucene query of */title: a*itis OR subtitle: a*itis OR notes: a*itis/* and I
> get the results I described.  However, if I run an actual query of just
> */a*itis/* I then get the results I am looking for.  I guess in this case
> it's using the default field that I specify in creating the QueryParser
> which is "notes" but if I change the actual query to */notes: a*itis/* I
> still get the undesirable results.
>
> Any idea why I am seeing this behaviour?  Again, if I remove the
> PorterStemFilter from my custom analyzer (which I use at both indexing and
> search time) then I get the results I want (albeit I lose the other
> functionality which I also need).
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/PorterStemFilter-causes-wildcard-searches-to-not-work-tp3525790p3544411.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: PorterStemFilter causes wildcard searches to not work

2011-11-29 Thread SBS
> This is very hard to follow.  I for one don't recall what you 
> described or what you are looking for. 

Sorry about that, I am using the web interface where the context of my post
is visible to all.

To sum up, my original post was:

> It seems that when I use a PorterStemFilter in my custom analyser,
> wildcard searches malfunction.
> 
> As an example, I have the words "appendicitis" and "sensitisation"
> in our content.  When I enter a query of "a*itis" I would expect
> to have "appendicitis" match but instead I get "sensitisation" and
> not "appendicitis".  If I remove the PorterStemFilter then things
> behave as I would have expected and desired.
> 
> Why is this happening?  Is there a way to apply a PorterStemFilter
> and still be able to use wildcards?
> 
> I am using Lucene 3.2. 

And now I am adding:

> I am applying the PorterStemFilter at both indexing and search time. 
> 
> As for schema, I have 3 fields: title, subtitle and notes.  When the user 
> enters a query string of "a*itis" (without the double quotes of course),
> my software turns this into an actual 
> Lucene query of "title: a*itis OR subtitle: a*itis OR notes: a*itis" and I 
> get the results I described.  However, if I run an actual query of just 
> "a*itis" I then get the results I am looking for.  I guess in this case 
> it's using the default field that I specify in creating the QueryParser 
> which is "notes" but if I change the actual query to "notes: a*itis" I 
> still get the undesirable results. 
> 
> Any idea why I am seeing this behaviour?  Again, if I remove the 
> PorterStemFilter from my custom analyzer (which I use at both indexing and 
> search time) then I get the results I want (albeit I lose the other 
> functionality which I also need). 

I hope that's a bit clearer.  Any ideas to explain and/or resolve this?

Thanks,

-sbs

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PorterStemFilter-causes-wildcard-searches-to-not-work-tp3525790p3544760.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: PorterStemFilter causes wildcard searches to not work

2011-11-29 Thread Ian Lea
A google search of "lucene stemming wildcards" finds some hits
implying these don't work well together.

http://lucene.472066.n3.nabble.com/Conflicts-with-Stemming-and-Wildcard-Prefix-Queries-td540479.html
may be a solution.


--
Ian.


On Tue, Nov 29, 2011 at 10:39 AM, SBS  wrote:
>> This is very hard to follow.  I for one don't recall what you
>> described or what you are looking for.
>
> Sorry about that, I am using the web interface where the context of my post
> is visible to all.
>
> To sum up, my original post was:
>
>> It seems that when I use a PorterStemFilter in my custom analyser,
>> wildcard searches malfunction.
>>
>> As an example, I have the words "appendicitis" and "sensitisation"
>> in our content.  When I enter a query of "a*itis" I would expect
>> to have "appendicitis" match but instead I get "sensitisation" and
>> not "appendicitis".  If I remove the PorterStemFilter then things
>> behave as I would have expected and desired.
>>
>> Why is this happening?  Is there a way to apply a PorterStemFilter
>> and still be able to use wildcards?
>>
>> I am using Lucene 3.2.
>
> And now I am adding:
>
>> I am applying the PorterStemFilter at both indexing and search time.
>>
>> As for schema, I have 3 fields: title, subtitle and notes.  When the user
>> enters a query string of "a*itis" (without the double quotes of course),
>> my software turns this into an actual
>> Lucene query of "title: a*itis OR subtitle: a*itis OR notes: a*itis" and I
>> get the results I described.  However, if I run an actual query of just
>> "a*itis" I then get the results I am looking for.  I guess in this case
>> it's using the default field that I specify in creating the QueryParser
>> which is "notes" but if I change the actual query to "notes: a*itis" I
>> still get the undesirable results.
>>
>> Any idea why I am seeing this behaviour?  Again, if I remove the
>> PorterStemFilter from my custom analyzer (which I use at both indexing and
>> search time) then I get the results I want (albeit I lose the other
>> functionality which I also need).
>
> I hope that's a bit clearer.  Any ideas to explain and/or resolve this?
>
> Thanks,
>
> -sbs
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/PorterStemFilter-causes-wildcard-searches-to-not-work-tp3525790p3544760.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Error while re-indexing - cannot overwrite 0.fdt

2011-11-29 Thread Rohan A Ambasta

Hi,

I get the error - "Cannot Overwrite 0.fdt" when I start indexing.

Detail TestCase -

1) Performing indexing for the first time work fine.
2) Then I do search and I get the search results
3) After search, If I again start indexing I get the error - "Cannot
overwrite 0.fdt"


Has anybody faced such error before? How can I resolve this error?

Thanks,
Rohan Ambasta



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Error while re-indexing - cannot overwrite 0.fdt

2011-11-29 Thread Ian Lea
Close the first index writer?

http://lmgtfy.com/?q=lucene+Cannot+overwrite+%22_0.fdt%22+file

If you can't find the answer and need to post again, include as a
minimum details of the OS and lucene version that you are using.


--
Ian.

On Tue, Nov 29, 2011 at 12:15 PM, Rohan A Ambasta
 wrote:
>
> Hi,
>
> I get the error - "Cannot Overwrite 0.fdt" when I start indexing.
>
> Detail TestCase -
>
> 1) Performing indexing for the first time work fine.
> 2) Then I do search and I get the search results
> 3) After search, If I again start indexing I get the error - "Cannot
> overwrite 0.fdt"
>
>
> Has anybody faced such error before? How can I resolve this error?
>
> Thanks,
> Rohan Ambasta
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Quoted search on Analyzed fields

2011-11-29 Thread Mihai Caraman
field = new Field("author",(author).toLowerCase(),Field.Store.NO,
Field.Index.NOT_ANALYZED);
field.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
field.setOmitNorms(true);

When in the above configuration i switched from NOT_ANALYZED to ANALYZED,
luke's results for author:"john doe" stopped showing(after rebuilding the
index). Why?

Also, how can, on the same index, luke show me results for author:"john
doe"(using not_analyzed), but when i debug the IndexSearcher which receives
a query parsed with the same standardanalyzer as at indexing time(and seems
to be correct), returns no results?!

Thank you,
Mihai C.


Re: Quoted search on Analyzed fields

2011-11-29 Thread Robert Muir
if you use standardanalyzer it will break "john doe" into 2 tokens and
form a phrase query.
if you want to do phrase queries, don't set the indexoptions to
DOCS_ONLY. otherwise they won't work.

if what you want is for "john doe" to only be 1 term without
positions, then use KeywordAnalyzer, and DOCS_ONLY is then ok because
you don't need any positions.

On Tue, Nov 29, 2011 at 10:18 AM, Mihai Caraman  wrote:
> field = new Field("author",(author).toLowerCase(),Field.Store.NO,
> Field.Index.NOT_ANALYZED);
>            field.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY);
>            field.setOmitNorms(true);
>
> When in the above configuration i switched from NOT_ANALYZED to ANALYZED,
> luke's results for author:"john doe" stopped showing(after rebuilding the
> index). Why?
>
> Also, how can, on the same index, luke show me results for author:"john
> doe"(using not_analyzed), but when i debug the IndexSearcher which receives
> a query parsed with the same standardanalyzer as at indexing time(and seems
> to be correct), returns no results?!
>
> Thank you,
> Mihai C.
>



-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Scoring a document using LDA topics

2011-11-29 Thread Stephen Thomas
Sujit,

Thanks for your reply, and the link to your blog post, which was
helpful and got me thinking about Payloads.

I still have one more question. I need to be able to compute the
Sim(query q, doc d) similarity function, which is defined below:

Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)

So, I'm guessing that the only what to do this is to do the following:

- At index time, store the (flattened) topics as a payload for each
documen, as you suggest in your blog

- At query time, find out which topics are in the query
- Construct a BooleanQuery, consisting of one PayloadTermQuery per
topic in the query
- Search on the BooleanQuery. This essentially tells me which
documents have the topics in the query
- Iterate over the TopDocs returns by the search. For each doc, get
the full payload, unflatten it, and use it to compute Sim(query q, doc
d).
- Reorder the results based on the Sim(query q, doc d) results.

Is this the best way? I can't see a way to compute the Sim() metric at
any other time, because in scorePayload(), we don't have access to the
full payload, nor to the query.

Thanks again,
Steve


On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal  wrote:
> Hi Stephen,
>
> We are doing something similar, and we store as a multifield with each
> document as (d,z) pairs where we store the z's (scores) as payloads for
> each d (topic). We have had to build a custom similarity which
> implements the scorePayload function. So to find docs for a given d
> (topic), we do a simple PayloadTermQuery and the docs come back in
> descending order of z. Simple boolean term queries also work. We turn
> off norms (in the ctor for the PayloadTermQuery) to get scores that are
> identical to the d values.
>
> I wrote about this sometime back...maybe this would help you.
> http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html
>
> -sujit
>
> On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote:
>> List,
>>
>> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
>> model into Lucene. Briefly, the LDA model extracts topics
>> (distribution over words) from a set of documents, and then represents
>> each document with topic vectors. For example, documents could be
>> represented as:
>>
>> d1 = (0,  0.5, 0, 0.5)
>>
>> d2 = (1, 0, 0, 0)
>>
>> This means that document d1 contains topics 2 and 4, and document d2
>> contains topic 1. I.e.,
>>
>> P(z1, d1) = 0
>> P(z2, d1) = 0.5
>> P(z3, d1) = 0
>> P(z4, d1) = 0.5
>> P(z1, d2) = 1
>> P(z2, d2) = 0
>> ...
>>
>> Also, topics are represented by the probability that a term appears in
>> that topic, so we also have a set of vectors:
>>
>> z1 = (0, 0, .02, ...)
>>
>> meaning that topic z1 does not contain terms 1 or 2, but does contain
>> term 3. I.e.,
>>
>> P(t1, z1) = 0
>> P(t2, z1) = 0
>> P(t3, z1) = .02
>> ...
>>
>> Then, the similarity between a query and a document is computed as:
>>
>> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
>>
>> Basically, for each term in the query, and each topic in existence,
>> see how relevant that term is in that topic, and how relevant that
>> topic is in the document.
>>
>>
>> I've been thinking about how to do this in Lucene. Assume I already
>> have the topics and the topic vectors for each document. I know that I
>> need to write my own Similarity class that extends DefaultSimilarity.
>> I need to override tf(), queryNorm(), coord(), and computeNorm() to
>> all return a constant 1, so that they have no effect. Then, I can
>> override idf() to compute the Sim equation above. Seems simple enough.
>> However, I have a few practical issues:
>>
>>
>> - Storing the topic vectors for each document. Can I store this in the
>> index somehow? If so, how do I retrieve it later in my
>> CustomSimilarity class?
>>
>> - Changing the Boolean model. Instead of only computing the similarity
>> on a documents that contain any of the terms in the query (the default
>> behavior), I need to compute the similarity on all of the documents.
>> (This is the whole  idea behind LDA: you don't need an exact term
>> match for there to be a similarity.) I understand that this will
>> result in a performance hit, but I do not see a way around it.
>>
>> - Turning off fieldNorm(). How can I set the field norm for each doc
>> to a constant 1?
>>
>>
>> Any help is greatly appreciated.
>>
>> Steve
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Quoted search on Analyzed fields

2011-11-29 Thread Mihai Caraman
Still no difference, it may be because of some other hidden
bug.Anyway, adding freq and
positions will be a no - no because of space :) so
bye bye quotes.

Thank you


Re: Quoted search on Analyzed fields

2011-11-29 Thread Robert Muir
Again there is nothing wrong with the quotes: its instead how you are
configuring the analysis for this field.

If you put stuff in quotes and your analyzer breaks it into multiple
tokens, then queryparser forms a phrase query. You must index
positions to support phrase queries.

Normally DOCS_ONLY is only used for fields that contain a *single
term*, like a numeric field. If you want to exclude positions for a
field but at the same time allow tokenized queries against it like you
are doing, then you need to adjust your queryparsing to do the right
thing if someone enters quoted text like "john doe", such as forming a
boolean query (john AND doe) instead.

The way to do this is to subclass the queryparser and do something like:
@Override
protected Query getFieldQuery(String field, String queryText, boolean
quoted)  throws ParseException {
 if (quoted && field.equals("myfieldwithoutpositions") {
// my special logic to form boolean queries or something else
 } else {
return super.getFieldQuery(field, queryText, quoted);
 }
}


On Tue, Nov 29, 2011 at 10:58 AM, Mihai Caraman  wrote:
> Still no difference, it may be because of some other hidden
> bug.Anyway, adding freq and
> positions will be a no - no because of space :) so
> bye bye quotes.
>
> Thank you
>



-- 
lucidimagination.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Custom Filter for Splitting CamelCase?

2011-11-29 Thread Stephen Thomas
List,

I have written my own CustomAnalyzer, as follows:

public TokenStream tokenStream(String fieldName, Reader reader) {

// TODO: add calls to RemovePuncation, and SplitIdentifiers here

// First, convert to lower case
TokenStream out = new  LowerCaseTokenizer(reader);

if (this.doStopping){
out = new StopFilter(true, out, customStopSet);
}

if (this.doStemming){
out = new PorterStemFilter(out);
}

return out;
  }



What I need to do is write two custom filters that do the following:

- RemovePuncation() removes all characters except [a-zA-Z], preserving
case. E.g.,

"foo=bar*45;" ==> "foo bar 45"
"fooBar" ==> "fooBar"
"\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca"


- SplitIdentifers() breaks up words based on camelCase notation:

"fooBar" ==> "foo Bar"
"ABCCompany" ==> "ABC Company"

(I have the regex for this.)

Note this step must be performed before LowerCaseTokenizer, because we
need case information to do the splitting.


How can I write custom filters, and how do I call them before
LowerCaseTokenizer()?


Thanks in advance,
Steve

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Custom Filter for Splitting CamelCase?

2011-11-29 Thread Uwe Schindler
Hi,

There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
itself is package-private).

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: stephen.warner.tho...@gmail.com
> [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas
> Sent: Tuesday, November 29, 2011 5:20 PM
> To: java-user@lucene.apache.org
> Subject: Custom Filter for Splitting CamelCase?
> 
> List,
> 
> I have written my own CustomAnalyzer, as follows:
> 
> public TokenStream tokenStream(String fieldName, Reader reader) {
> 
>   // TODO: add calls to RemovePuncation, and SplitIdentifiers
> here
> 
>   // First, convert to lower case
>   TokenStream out = new  LowerCaseTokenizer(reader);
> 
>   if (this.doStopping){
>   out = new StopFilter(true, out, customStopSet);
>   }
> 
>   if (this.doStemming){
>   out = new PorterStemFilter(out);
>   }
> 
>   return out;
> }
> 
> 
> 
> What I need to do is write two custom filters that do the following:
> 
> - RemovePuncation() removes all characters except [a-zA-Z], preserving
case.
> E.g.,
> 
> "foo=bar*45;" ==> "foo bar 45"
> "fooBar" ==> "fooBar"
> "\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
> 
> 
> - SplitIdentifers() breaks up words based on camelCase notation:
> 
> "fooBar" ==> "foo Bar"
> "ABCCompany" ==> "ABC Company"
> 
> (I have the regex for this.)
> 
> Note this step must be performed before LowerCaseTokenizer, because we
> need case information to do the splitting.
> 
> 
> How can I write custom filters, and how do I call them before
> LowerCaseTokenizer()?
> 
> 
> Thanks in advance,
> Steve
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Filter for Splitting CamelCase?

2011-11-29 Thread Stephen Thomas
How do you use the WordDelimiterFilterFactory()? I tried the following code:


TokenStream out = new  LowerCaseTokenizer(reader);
WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory();
out = wdf.create(out);
...

But I am getting a runtime error:

Exception in thread "main" java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
at 
org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141)
at 
org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:54)
...

I can't create a class of type WordDelimiterFilter directly, because
it is protected.

Any ideas?

Thanks,
Steve




On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler  wrote:
> Hi,
>
> There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis
> module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your
> classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter
> itself is package-private).
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: stephen.warner.tho...@gmail.com
>> [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas
>> Sent: Tuesday, November 29, 2011 5:20 PM
>> To: java-user@lucene.apache.org
>> Subject: Custom Filter for Splitting CamelCase?
>>
>> List,
>>
>> I have written my own CustomAnalyzer, as follows:
>>
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>
>>               // TODO: add calls to RemovePuncation, and SplitIdentifiers
>> here
>>
>>               // First, convert to lower case
>>               TokenStream out = new  LowerCaseTokenizer(reader);
>>
>>               if (this.doStopping){
>>                       out = new StopFilter(true, out, customStopSet);
>>               }
>>
>>               if (this.doStemming){
>>                       out = new PorterStemFilter(out);
>>               }
>>
>>               return out;
>>         }
>>
>>
>>
>> What I need to do is write two custom filters that do the following:
>>
>> - RemovePuncation() removes all characters except [a-zA-Z], preserving
> case.
>> E.g.,
>>
>> "foo=bar*45;" ==> "foo bar 45"
>> "fooBar" ==> "fooBar"
>> "\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
>>
>>
>> - SplitIdentifers() breaks up words based on camelCase notation:
>>
>> "fooBar" ==> "foo Bar"
>> "ABCCompany" ==> "ABC Company"
>>
>> (I have the regex for this.)
>>
>> Note this step must be performed before LowerCaseTokenizer, because we
>> need case information to do the splitting.
>>
>>
>> How can I write custom filters, and how do I call them before
>> LowerCaseTokenizer()?
>>
>>
>> Thanks in advance,
>> Steve
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Custom Filter for Splitting CamelCase?

2011-11-29 Thread Uwe Schindler
Hi,

Be sure to use the same Solr version as your Lucene version (if >= 3.1) and
this is example code from test case:

WordDelimiterFilterFactory fact = new WordDelimiterFilterFactory();
// we don’t need this if we don’t load external exclusion files:
// ResourceLoader loader = new SolrResourceLoader(null, null);
Map args = new HashMap();
args.put("generateWordParts", "1");
args.put("generateNumberParts", "1");
args.put("catenateWords", "1");
args.put("catenateNumbers", "1");
args.put("catenateAll", "0");
args.put("splitOnCaseChange", "1");
fact.init(args);
// fact.inform(loader);

TokenStream ts = fact.create(new LowerCaseTokenizer(reader));


For all args params look here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimit
erFilterFactory

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: stephen.warner.tho...@gmail.com
> [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas
> Sent: Tuesday, November 29, 2011 7:39 PM
> To: java-user@lucene.apache.org
> Subject: Re: Custom Filter for Splitting CamelCase?
> 
> How do you use the WordDelimiterFilterFactory()? I tried the following
code:
> 
> 
> TokenStream out = new  LowerCaseTokenizer(reader);
> WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory(); out =
> wdf.create(out); ...
> 
> But I am getting a runtime error:
> 
> Exception in thread "main" java.lang.AbstractMethodError:
> org.apache.lucene.analysis.TokenStream.incrementToken()Z
>   at
> org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141)
>   at
>
org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.
j
> ava:54)
> ...
> 
> I can't create a class of type WordDelimiterFilter directly, because it is
> protected.
> 
> Any ideas?
> 
> Thanks,
> Steve
> 
> 
> 
> 
> On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler  wrote:
> > Hi,
> >
> > There is WordDelimiterFilter in Solr that was also ported to Lucene
> > Analysis module in Lucene trunk (4.0). In 3.x yu can still add
> > solr.jar to your classpath and WordDelimiterFilterFactory to produce
> > one (WordDelimiterFilter itself is package-private).
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> >
> >> -Original Message-
> >> From: stephen.warner.tho...@gmail.com
> >> [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas
> >> Sent: Tuesday, November 29, 2011 5:20 PM
> >> To: java-user@lucene.apache.org
> >> Subject: Custom Filter for Splitting CamelCase?
> >>
> >> List,
> >>
> >> I have written my own CustomAnalyzer, as follows:
> >>
> >> public TokenStream tokenStream(String fieldName, Reader reader) {
> >>
> >>               // TODO: add calls to RemovePuncation, and
> >> SplitIdentifiers here
> >>
> >>               // First, convert to lower case
> >>               TokenStream out = new  LowerCaseTokenizer(reader);
> >>
> >>               if (this.doStopping){
> >>                       out = new StopFilter(true, out, customStopSet);
> >>               }
> >>
> >>               if (this.doStemming){
> >>                       out = new PorterStemFilter(out);
> >>               }
> >>
> >>               return out;
> >>         }
> >>
> >>
> >>
> >> What I need to do is write two custom filters that do the following:
> >>
> >> - RemovePuncation() removes all characters except [a-zA-Z],
> >> preserving
> > case.
> >> E.g.,
> >>
> >> "foo=bar*45;" ==> "foo bar 45"
> >> "fooBar" ==> "fooBar"
> >> "\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca"
> >>
> >>
> >> - SplitIdentifers() breaks up words based on camelCase notation:
> >>
> >> "fooBar" ==> "foo Bar"
> >> "ABCCompany" ==> "ABC Company"
> >>
> >> (I have the regex for this.)
> >>
> >> Note this step must be performed before LowerCaseTokenizer, because
> >> we need case information to do the splitting.
> >>
> >>
> >> How can I write custom filters, and how do I call them before
> >> LowerCaseTokenizer()?
> >>
> >>
> >> Thanks in advance,
> >> Steve
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For ad

Re: Scoring a document using LDA topics

2011-11-29 Thread Sujit Pal
Hi Stephen,

We precompute a variant of P(z,d) during indexing, and do the first 3
steps. The resulting documents are ordered by payload score, which is
basically z in our case. We don't currently care about P(t,z) but it
seems like a good thing to have for disambiguation purposes.

So anyway, I have never done what you are looking to do, but I guess the
approach you have outlined would be the one you would use to do this.
Although there may be performance issues where you have a large number
of topic matches.

An alternative - since you need to know the P(t,z) (the probability of
the terms in the query being in a particular topic), and each
PayloadTermQuery in the BooleanQuery corresponds to a z (topic), perhaps
you could boost each clauses by P(t,z)?

-sujit

On Tue, 2011-11-29 at 10:50 -0500, Stephen Thomas wrote:
> Sujit,
> 
> Thanks for your reply, and the link to your blog post, which was
> helpful and got me thinking about Payloads.
> 
> I still have one more question. I need to be able to compute the
> Sim(query q, doc d) similarity function, which is defined below:
> 
> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
> 
> So, I'm guessing that the only what to do this is to do the following:
> 
> - At index time, store the (flattened) topics as a payload for each
> documen, as you suggest in your blog
> 
> - At query time, find out which topics are in the query
> - Construct a BooleanQuery, consisting of one PayloadTermQuery per
> topic in the query
> - Search on the BooleanQuery. This essentially tells me which
> documents have the topics in the query
> - Iterate over the TopDocs returns by the search. For each doc, get
> the full payload, unflatten it, and use it to compute Sim(query q, doc
> d).
> - Reorder the results based on the Sim(query q, doc d) results.
> 
> Is this the best way? I can't see a way to compute the Sim() metric at
> any other time, because in scorePayload(), we don't have access to the
> full payload, nor to the query.
> 
> Thanks again,
> Steve
> 
> 
> On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal  wrote:
> > Hi Stephen,
> >
> > We are doing something similar, and we store as a multifield with each
> > document as (d,z) pairs where we store the z's (scores) as payloads for
> > each d (topic). We have had to build a custom similarity which
> > implements the scorePayload function. So to find docs for a given d
> > (topic), we do a simple PayloadTermQuery and the docs come back in
> > descending order of z. Simple boolean term queries also work. We turn
> > off norms (in the ctor for the PayloadTermQuery) to get scores that are
> > identical to the d values.
> >
> > I wrote about this sometime back...maybe this would help you.
> > http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html
> >
> > -sujit
> >
> > On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote:
> >> List,
> >>
> >> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic
> >> model into Lucene. Briefly, the LDA model extracts topics
> >> (distribution over words) from a set of documents, and then represents
> >> each document with topic vectors. For example, documents could be
> >> represented as:
> >>
> >> d1 = (0,  0.5, 0, 0.5)
> >>
> >> d2 = (1, 0, 0, 0)
> >>
> >> This means that document d1 contains topics 2 and 4, and document d2
> >> contains topic 1. I.e.,
> >>
> >> P(z1, d1) = 0
> >> P(z2, d1) = 0.5
> >> P(z3, d1) = 0
> >> P(z4, d1) = 0.5
> >> P(z1, d2) = 1
> >> P(z2, d2) = 0
> >> ...
> >>
> >> Also, topics are represented by the probability that a term appears in
> >> that topic, so we also have a set of vectors:
> >>
> >> z1 = (0, 0, .02, ...)
> >>
> >> meaning that topic z1 does not contain terms 1 or 2, but does contain
> >> term 3. I.e.,
> >>
> >> P(t1, z1) = 0
> >> P(t2, z1) = 0
> >> P(t3, z1) = .02
> >> ...
> >>
> >> Then, the similarity between a query and a document is computed as:
> >>
> >> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d)
> >>
> >> Basically, for each term in the query, and each topic in existence,
> >> see how relevant that term is in that topic, and how relevant that
> >> topic is in the document.
> >>
> >>
> >> I've been thinking about how to do this in Lucene. Assume I already
> >> have the topics and the topic vectors for each document. I know that I
> >> need to write my own Similarity class that extends DefaultSimilarity.
> >> I need to override tf(), queryNorm(), coord(), and computeNorm() to
> >> all return a constant 1, so that they have no effect. Then, I can
> >> override idf() to compute the Sim equation above. Seems simple enough.
> >> However, I have a few practical issues:
> >>
> >>
> >> - Storing the topic vectors for each document. Can I store this in the
> >> index somehow? If so, how do I retrieve it later in my
> >> CustomSimilarity class?
> >>
> >> - Changing the Boolean model. Instead of only computing the similarity
> >> on a documents that contain any of the terms in the query (the default
> >> behavior)