Re: PorterStemFilter causes wildcard searches to not work
This is very hard to follow. I for one don't recall what you described or what you are looking for. Have you worked through http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F? -- Ian. On Tue, Nov 29, 2011 at 7:25 AM, SBS wrote: > I am applying the PorterStemFilter at both indexing and search time. > > As for schema, I have 3 fields: title, subtitle and notes. When the user > enters a query string of */a*itis/*, my software turns this into an actual > Lucene query of */title: a*itis OR subtitle: a*itis OR notes: a*itis/* and I > get the results I described. However, if I run an actual query of just > */a*itis/* I then get the results I am looking for. I guess in this case > it's using the default field that I specify in creating the QueryParser > which is "notes" but if I change the actual query to */notes: a*itis/* I > still get the undesirable results. > > Any idea why I am seeing this behaviour? Again, if I remove the > PorterStemFilter from my custom analyzer (which I use at both indexing and > search time) then I get the results I want (albeit I lose the other > functionality which I also need). > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PorterStemFilter-causes-wildcard-searches-to-not-work-tp3525790p3544411.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: PorterStemFilter causes wildcard searches to not work
> This is very hard to follow. I for one don't recall what you > described or what you are looking for. Sorry about that, I am using the web interface where the context of my post is visible to all. To sum up, my original post was: > It seems that when I use a PorterStemFilter in my custom analyser, > wildcard searches malfunction. > > As an example, I have the words "appendicitis" and "sensitisation" > in our content. When I enter a query of "a*itis" I would expect > to have "appendicitis" match but instead I get "sensitisation" and > not "appendicitis". If I remove the PorterStemFilter then things > behave as I would have expected and desired. > > Why is this happening? Is there a way to apply a PorterStemFilter > and still be able to use wildcards? > > I am using Lucene 3.2. And now I am adding: > I am applying the PorterStemFilter at both indexing and search time. > > As for schema, I have 3 fields: title, subtitle and notes. When the user > enters a query string of "a*itis" (without the double quotes of course), > my software turns this into an actual > Lucene query of "title: a*itis OR subtitle: a*itis OR notes: a*itis" and I > get the results I described. However, if I run an actual query of just > "a*itis" I then get the results I am looking for. I guess in this case > it's using the default field that I specify in creating the QueryParser > which is "notes" but if I change the actual query to "notes: a*itis" I > still get the undesirable results. > > Any idea why I am seeing this behaviour? Again, if I remove the > PorterStemFilter from my custom analyzer (which I use at both indexing and > search time) then I get the results I want (albeit I lose the other > functionality which I also need). I hope that's a bit clearer. Any ideas to explain and/or resolve this? Thanks, -sbs -- View this message in context: http://lucene.472066.n3.nabble.com/PorterStemFilter-causes-wildcard-searches-to-not-work-tp3525790p3544760.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: PorterStemFilter causes wildcard searches to not work
A google search of "lucene stemming wildcards" finds some hits implying these don't work well together. http://lucene.472066.n3.nabble.com/Conflicts-with-Stemming-and-Wildcard-Prefix-Queries-td540479.html may be a solution. -- Ian. On Tue, Nov 29, 2011 at 10:39 AM, SBS wrote: >> This is very hard to follow. I for one don't recall what you >> described or what you are looking for. > > Sorry about that, I am using the web interface where the context of my post > is visible to all. > > To sum up, my original post was: > >> It seems that when I use a PorterStemFilter in my custom analyser, >> wildcard searches malfunction. >> >> As an example, I have the words "appendicitis" and "sensitisation" >> in our content. When I enter a query of "a*itis" I would expect >> to have "appendicitis" match but instead I get "sensitisation" and >> not "appendicitis". If I remove the PorterStemFilter then things >> behave as I would have expected and desired. >> >> Why is this happening? Is there a way to apply a PorterStemFilter >> and still be able to use wildcards? >> >> I am using Lucene 3.2. > > And now I am adding: > >> I am applying the PorterStemFilter at both indexing and search time. >> >> As for schema, I have 3 fields: title, subtitle and notes. When the user >> enters a query string of "a*itis" (without the double quotes of course), >> my software turns this into an actual >> Lucene query of "title: a*itis OR subtitle: a*itis OR notes: a*itis" and I >> get the results I described. However, if I run an actual query of just >> "a*itis" I then get the results I am looking for. I guess in this case >> it's using the default field that I specify in creating the QueryParser >> which is "notes" but if I change the actual query to "notes: a*itis" I >> still get the undesirable results. >> >> Any idea why I am seeing this behaviour? Again, if I remove the >> PorterStemFilter from my custom analyzer (which I use at both indexing and >> search time) then I get the results I want (albeit I lose the other >> functionality which I also need). > > I hope that's a bit clearer. Any ideas to explain and/or resolve this? > > Thanks, > > -sbs > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PorterStemFilter-causes-wildcard-searches-to-not-work-tp3525790p3544760.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Error while re-indexing - cannot overwrite 0.fdt
Hi, I get the error - "Cannot Overwrite 0.fdt" when I start indexing. Detail TestCase - 1) Performing indexing for the first time work fine. 2) Then I do search and I get the search results 3) After search, If I again start indexing I get the error - "Cannot overwrite 0.fdt" Has anybody faced such error before? How can I resolve this error? Thanks, Rohan Ambasta - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Error while re-indexing - cannot overwrite 0.fdt
Close the first index writer? http://lmgtfy.com/?q=lucene+Cannot+overwrite+%22_0.fdt%22+file If you can't find the answer and need to post again, include as a minimum details of the OS and lucene version that you are using. -- Ian. On Tue, Nov 29, 2011 at 12:15 PM, Rohan A Ambasta wrote: > > Hi, > > I get the error - "Cannot Overwrite 0.fdt" when I start indexing. > > Detail TestCase - > > 1) Performing indexing for the first time work fine. > 2) Then I do search and I get the search results > 3) After search, If I again start indexing I get the error - "Cannot > overwrite 0.fdt" > > > Has anybody faced such error before? How can I resolve this error? > > Thanks, > Rohan Ambasta > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Quoted search on Analyzed fields
field = new Field("author",(author).toLowerCase(),Field.Store.NO, Field.Index.NOT_ANALYZED); field.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY); field.setOmitNorms(true); When in the above configuration i switched from NOT_ANALYZED to ANALYZED, luke's results for author:"john doe" stopped showing(after rebuilding the index). Why? Also, how can, on the same index, luke show me results for author:"john doe"(using not_analyzed), but when i debug the IndexSearcher which receives a query parsed with the same standardanalyzer as at indexing time(and seems to be correct), returns no results?! Thank you, Mihai C.
Re: Quoted search on Analyzed fields
if you use standardanalyzer it will break "john doe" into 2 tokens and form a phrase query. if you want to do phrase queries, don't set the indexoptions to DOCS_ONLY. otherwise they won't work. if what you want is for "john doe" to only be 1 term without positions, then use KeywordAnalyzer, and DOCS_ONLY is then ok because you don't need any positions. On Tue, Nov 29, 2011 at 10:18 AM, Mihai Caraman wrote: > field = new Field("author",(author).toLowerCase(),Field.Store.NO, > Field.Index.NOT_ANALYZED); > field.setIndexOptions(FieldInfo.IndexOptions.DOCS_ONLY); > field.setOmitNorms(true); > > When in the above configuration i switched from NOT_ANALYZED to ANALYZED, > luke's results for author:"john doe" stopped showing(after rebuilding the > index). Why? > > Also, how can, on the same index, luke show me results for author:"john > doe"(using not_analyzed), but when i debug the IndexSearcher which receives > a query parsed with the same standardanalyzer as at indexing time(and seems > to be correct), returns no results?! > > Thank you, > Mihai C. > -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Scoring a document using LDA topics
Sujit, Thanks for your reply, and the link to your blog post, which was helpful and got me thinking about Payloads. I still have one more question. I need to be able to compute the Sim(query q, doc d) similarity function, which is defined below: Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) So, I'm guessing that the only what to do this is to do the following: - At index time, store the (flattened) topics as a payload for each documen, as you suggest in your blog - At query time, find out which topics are in the query - Construct a BooleanQuery, consisting of one PayloadTermQuery per topic in the query - Search on the BooleanQuery. This essentially tells me which documents have the topics in the query - Iterate over the TopDocs returns by the search. For each doc, get the full payload, unflatten it, and use it to compute Sim(query q, doc d). - Reorder the results based on the Sim(query q, doc d) results. Is this the best way? I can't see a way to compute the Sim() metric at any other time, because in scorePayload(), we don't have access to the full payload, nor to the query. Thanks again, Steve On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal wrote: > Hi Stephen, > > We are doing something similar, and we store as a multifield with each > document as (d,z) pairs where we store the z's (scores) as payloads for > each d (topic). We have had to build a custom similarity which > implements the scorePayload function. So to find docs for a given d > (topic), we do a simple PayloadTermQuery and the docs come back in > descending order of z. Simple boolean term queries also work. We turn > off norms (in the ctor for the PayloadTermQuery) to get scores that are > identical to the d values. > > I wrote about this sometime back...maybe this would help you. > http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html > > -sujit > > On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote: >> List, >> >> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic >> model into Lucene. Briefly, the LDA model extracts topics >> (distribution over words) from a set of documents, and then represents >> each document with topic vectors. For example, documents could be >> represented as: >> >> d1 = (0, 0.5, 0, 0.5) >> >> d2 = (1, 0, 0, 0) >> >> This means that document d1 contains topics 2 and 4, and document d2 >> contains topic 1. I.e., >> >> P(z1, d1) = 0 >> P(z2, d1) = 0.5 >> P(z3, d1) = 0 >> P(z4, d1) = 0.5 >> P(z1, d2) = 1 >> P(z2, d2) = 0 >> ... >> >> Also, topics are represented by the probability that a term appears in >> that topic, so we also have a set of vectors: >> >> z1 = (0, 0, .02, ...) >> >> meaning that topic z1 does not contain terms 1 or 2, but does contain >> term 3. I.e., >> >> P(t1, z1) = 0 >> P(t2, z1) = 0 >> P(t3, z1) = .02 >> ... >> >> Then, the similarity between a query and a document is computed as: >> >> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) >> >> Basically, for each term in the query, and each topic in existence, >> see how relevant that term is in that topic, and how relevant that >> topic is in the document. >> >> >> I've been thinking about how to do this in Lucene. Assume I already >> have the topics and the topic vectors for each document. I know that I >> need to write my own Similarity class that extends DefaultSimilarity. >> I need to override tf(), queryNorm(), coord(), and computeNorm() to >> all return a constant 1, so that they have no effect. Then, I can >> override idf() to compute the Sim equation above. Seems simple enough. >> However, I have a few practical issues: >> >> >> - Storing the topic vectors for each document. Can I store this in the >> index somehow? If so, how do I retrieve it later in my >> CustomSimilarity class? >> >> - Changing the Boolean model. Instead of only computing the similarity >> on a documents that contain any of the terms in the query (the default >> behavior), I need to compute the similarity on all of the documents. >> (This is the whole idea behind LDA: you don't need an exact term >> match for there to be a similarity.) I understand that this will >> result in a performance hit, but I do not see a way around it. >> >> - Turning off fieldNorm(). How can I set the field norm for each doc >> to a constant 1? >> >> >> Any help is greatly appreciated. >> >> Steve >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Quoted search on Analyzed fields
Still no difference, it may be because of some other hidden bug.Anyway, adding freq and positions will be a no - no because of space :) so bye bye quotes. Thank you
Re: Quoted search on Analyzed fields
Again there is nothing wrong with the quotes: its instead how you are configuring the analysis for this field. If you put stuff in quotes and your analyzer breaks it into multiple tokens, then queryparser forms a phrase query. You must index positions to support phrase queries. Normally DOCS_ONLY is only used for fields that contain a *single term*, like a numeric field. If you want to exclude positions for a field but at the same time allow tokenized queries against it like you are doing, then you need to adjust your queryparsing to do the right thing if someone enters quoted text like "john doe", such as forming a boolean query (john AND doe) instead. The way to do this is to subclass the queryparser and do something like: @Override protected Query getFieldQuery(String field, String queryText, boolean quoted) throws ParseException { if (quoted && field.equals("myfieldwithoutpositions") { // my special logic to form boolean queries or something else } else { return super.getFieldQuery(field, queryText, quoted); } } On Tue, Nov 29, 2011 at 10:58 AM, Mihai Caraman wrote: > Still no difference, it may be because of some other hidden > bug.Anyway, adding freq and > positions will be a no - no because of space :) so > bye bye quotes. > > Thank you > -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Custom Filter for Splitting CamelCase?
List, I have written my own CustomAnalyzer, as follows: public TokenStream tokenStream(String fieldName, Reader reader) { // TODO: add calls to RemovePuncation, and SplitIdentifiers here // First, convert to lower case TokenStream out = new LowerCaseTokenizer(reader); if (this.doStopping){ out = new StopFilter(true, out, customStopSet); } if (this.doStemming){ out = new PorterStemFilter(out); } return out; } What I need to do is write two custom filters that do the following: - RemovePuncation() removes all characters except [a-zA-Z], preserving case. E.g., "foo=bar*45;" ==> "foo bar 45" "fooBar" ==> "fooBar" "\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca" - SplitIdentifers() breaks up words based on camelCase notation: "fooBar" ==> "foo Bar" "ABCCompany" ==> "ABC Company" (I have the regex for this.) Note this step must be performed before LowerCaseTokenizer, because we need case information to do the splitting. How can I write custom filters, and how do I call them before LowerCaseTokenizer()? Thanks in advance, Steve - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Custom Filter for Splitting CamelCase?
Hi, There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter itself is package-private). - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: stephen.warner.tho...@gmail.com > [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas > Sent: Tuesday, November 29, 2011 5:20 PM > To: java-user@lucene.apache.org > Subject: Custom Filter for Splitting CamelCase? > > List, > > I have written my own CustomAnalyzer, as follows: > > public TokenStream tokenStream(String fieldName, Reader reader) { > > // TODO: add calls to RemovePuncation, and SplitIdentifiers > here > > // First, convert to lower case > TokenStream out = new LowerCaseTokenizer(reader); > > if (this.doStopping){ > out = new StopFilter(true, out, customStopSet); > } > > if (this.doStemming){ > out = new PorterStemFilter(out); > } > > return out; > } > > > > What I need to do is write two custom filters that do the following: > > - RemovePuncation() removes all characters except [a-zA-Z], preserving case. > E.g., > > "foo=bar*45;" ==> "foo bar 45" > "fooBar" ==> "fooBar" > "\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca" > > > - SplitIdentifers() breaks up words based on camelCase notation: > > "fooBar" ==> "foo Bar" > "ABCCompany" ==> "ABC Company" > > (I have the regex for this.) > > Note this step must be performed before LowerCaseTokenizer, because we > need case information to do the splitting. > > > How can I write custom filters, and how do I call them before > LowerCaseTokenizer()? > > > Thanks in advance, > Steve > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Custom Filter for Splitting CamelCase?
How do you use the WordDelimiterFilterFactory()? I tried the following code: TokenStream out = new LowerCaseTokenizer(reader); WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory(); out = wdf.create(out); ... But I am getting a runtime error: Exception in thread "main" java.lang.AbstractMethodError: org.apache.lucene.analysis.TokenStream.incrementToken()Z at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141) at org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:54) ... I can't create a class of type WordDelimiterFilter directly, because it is protected. Any ideas? Thanks, Steve On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler wrote: > Hi, > > There is WordDelimiterFilter in Solr that was also ported to Lucene Analysis > module in Lucene trunk (4.0). In 3.x yu can still add solr.jar to your > classpath and WordDelimiterFilterFactory to produce one (WordDelimiterFilter > itself is package-private). > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: stephen.warner.tho...@gmail.com >> [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas >> Sent: Tuesday, November 29, 2011 5:20 PM >> To: java-user@lucene.apache.org >> Subject: Custom Filter for Splitting CamelCase? >> >> List, >> >> I have written my own CustomAnalyzer, as follows: >> >> public TokenStream tokenStream(String fieldName, Reader reader) { >> >> // TODO: add calls to RemovePuncation, and SplitIdentifiers >> here >> >> // First, convert to lower case >> TokenStream out = new LowerCaseTokenizer(reader); >> >> if (this.doStopping){ >> out = new StopFilter(true, out, customStopSet); >> } >> >> if (this.doStemming){ >> out = new PorterStemFilter(out); >> } >> >> return out; >> } >> >> >> >> What I need to do is write two custom filters that do the following: >> >> - RemovePuncation() removes all characters except [a-zA-Z], preserving > case. >> E.g., >> >> "foo=bar*45;" ==> "foo bar 45" >> "fooBar" ==> "fooBar" >> "\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca" >> >> >> - SplitIdentifers() breaks up words based on camelCase notation: >> >> "fooBar" ==> "foo Bar" >> "ABCCompany" ==> "ABC Company" >> >> (I have the regex for this.) >> >> Note this step must be performed before LowerCaseTokenizer, because we >> need case information to do the splitting. >> >> >> How can I write custom filters, and how do I call them before >> LowerCaseTokenizer()? >> >> >> Thanks in advance, >> Steve >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Custom Filter for Splitting CamelCase?
Hi, Be sure to use the same Solr version as your Lucene version (if >= 3.1) and this is example code from test case: WordDelimiterFilterFactory fact = new WordDelimiterFilterFactory(); // we dont need this if we dont load external exclusion files: // ResourceLoader loader = new SolrResourceLoader(null, null); Map args = new HashMap(); args.put("generateWordParts", "1"); args.put("generateNumberParts", "1"); args.put("catenateWords", "1"); args.put("catenateNumbers", "1"); args.put("catenateAll", "0"); args.put("splitOnCaseChange", "1"); fact.init(args); // fact.inform(loader); TokenStream ts = fact.create(new LowerCaseTokenizer(reader)); For all args params look here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimit erFilterFactory Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: stephen.warner.tho...@gmail.com > [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas > Sent: Tuesday, November 29, 2011 7:39 PM > To: java-user@lucene.apache.org > Subject: Re: Custom Filter for Splitting CamelCase? > > How do you use the WordDelimiterFilterFactory()? I tried the following code: > > > TokenStream out = new LowerCaseTokenizer(reader); > WordDelimiterFilterFactory wdf = new WordDelimiterFilterFactory(); out = > wdf.create(out); ... > > But I am getting a runtime error: > > Exception in thread "main" java.lang.AbstractMethodError: > org.apache.lucene.analysis.TokenStream.incrementToken()Z > at > org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:141) > at > org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter. j > ava:54) > ... > > I can't create a class of type WordDelimiterFilter directly, because it is > protected. > > Any ideas? > > Thanks, > Steve > > > > > On Tue, Nov 29, 2011 at 12:44 PM, Uwe Schindler wrote: > > Hi, > > > > There is WordDelimiterFilter in Solr that was also ported to Lucene > > Analysis module in Lucene trunk (4.0). In 3.x yu can still add > > solr.jar to your classpath and WordDelimiterFilterFactory to produce > > one (WordDelimiterFilter itself is package-private). > > > > - > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > >> -Original Message- > >> From: stephen.warner.tho...@gmail.com > >> [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas > >> Sent: Tuesday, November 29, 2011 5:20 PM > >> To: java-user@lucene.apache.org > >> Subject: Custom Filter for Splitting CamelCase? > >> > >> List, > >> > >> I have written my own CustomAnalyzer, as follows: > >> > >> public TokenStream tokenStream(String fieldName, Reader reader) { > >> > >> // TODO: add calls to RemovePuncation, and > >> SplitIdentifiers here > >> > >> // First, convert to lower case > >> TokenStream out = new LowerCaseTokenizer(reader); > >> > >> if (this.doStopping){ > >> out = new StopFilter(true, out, customStopSet); > >> } > >> > >> if (this.doStemming){ > >> out = new PorterStemFilter(out); > >> } > >> > >> return out; > >> } > >> > >> > >> > >> What I need to do is write two custom filters that do the following: > >> > >> - RemovePuncation() removes all characters except [a-zA-Z], > >> preserving > > case. > >> E.g., > >> > >> "foo=bar*45;" ==> "foo bar 45" > >> "fooBar" ==> "fooBar" > >> "\"stho...@cs.queensu.ca\"" ==> "sthomas cs queensu ca" > >> > >> > >> - SplitIdentifers() breaks up words based on camelCase notation: > >> > >> "fooBar" ==> "foo Bar" > >> "ABCCompany" ==> "ABC Company" > >> > >> (I have the regex for this.) > >> > >> Note this step must be performed before LowerCaseTokenizer, because > >> we need case information to do the splitting. > >> > >> > >> How can I write custom filters, and how do I call them before > >> LowerCaseTokenizer()? > >> > >> > >> Thanks in advance, > >> Steve > >> > >> - > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For ad
Re: Scoring a document using LDA topics
Hi Stephen, We precompute a variant of P(z,d) during indexing, and do the first 3 steps. The resulting documents are ordered by payload score, which is basically z in our case. We don't currently care about P(t,z) but it seems like a good thing to have for disambiguation purposes. So anyway, I have never done what you are looking to do, but I guess the approach you have outlined would be the one you would use to do this. Although there may be performance issues where you have a large number of topic matches. An alternative - since you need to know the P(t,z) (the probability of the terms in the query being in a particular topic), and each PayloadTermQuery in the BooleanQuery corresponds to a z (topic), perhaps you could boost each clauses by P(t,z)? -sujit On Tue, 2011-11-29 at 10:50 -0500, Stephen Thomas wrote: > Sujit, > > Thanks for your reply, and the link to your blog post, which was > helpful and got me thinking about Payloads. > > I still have one more question. I need to be able to compute the > Sim(query q, doc d) similarity function, which is defined below: > > Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) > > So, I'm guessing that the only what to do this is to do the following: > > - At index time, store the (flattened) topics as a payload for each > documen, as you suggest in your blog > > - At query time, find out which topics are in the query > - Construct a BooleanQuery, consisting of one PayloadTermQuery per > topic in the query > - Search on the BooleanQuery. This essentially tells me which > documents have the topics in the query > - Iterate over the TopDocs returns by the search. For each doc, get > the full payload, unflatten it, and use it to compute Sim(query q, doc > d). > - Reorder the results based on the Sim(query q, doc d) results. > > Is this the best way? I can't see a way to compute the Sim() metric at > any other time, because in scorePayload(), we don't have access to the > full payload, nor to the query. > > Thanks again, > Steve > > > On Mon, Nov 28, 2011 at 1:51 PM, Sujit Pal wrote: > > Hi Stephen, > > > > We are doing something similar, and we store as a multifield with each > > document as (d,z) pairs where we store the z's (scores) as payloads for > > each d (topic). We have had to build a custom similarity which > > implements the scorePayload function. So to find docs for a given d > > (topic), we do a simple PayloadTermQuery and the docs come back in > > descending order of z. Simple boolean term queries also work. We turn > > off norms (in the ctor for the PayloadTermQuery) to get scores that are > > identical to the d values. > > > > I wrote about this sometime back...maybe this would help you. > > http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html > > > > -sujit > > > > On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote: > >> List, > >> > >> I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic > >> model into Lucene. Briefly, the LDA model extracts topics > >> (distribution over words) from a set of documents, and then represents > >> each document with topic vectors. For example, documents could be > >> represented as: > >> > >> d1 = (0, 0.5, 0, 0.5) > >> > >> d2 = (1, 0, 0, 0) > >> > >> This means that document d1 contains topics 2 and 4, and document d2 > >> contains topic 1. I.e., > >> > >> P(z1, d1) = 0 > >> P(z2, d1) = 0.5 > >> P(z3, d1) = 0 > >> P(z4, d1) = 0.5 > >> P(z1, d2) = 1 > >> P(z2, d2) = 0 > >> ... > >> > >> Also, topics are represented by the probability that a term appears in > >> that topic, so we also have a set of vectors: > >> > >> z1 = (0, 0, .02, ...) > >> > >> meaning that topic z1 does not contain terms 1 or 2, but does contain > >> term 3. I.e., > >> > >> P(t1, z1) = 0 > >> P(t2, z1) = 0 > >> P(t3, z1) = .02 > >> ... > >> > >> Then, the similarity between a query and a document is computed as: > >> > >> Sim (query q, doc d) = sum_{t in q} sum_{z} P(t, z) * P(z, d) > >> > >> Basically, for each term in the query, and each topic in existence, > >> see how relevant that term is in that topic, and how relevant that > >> topic is in the document. > >> > >> > >> I've been thinking about how to do this in Lucene. Assume I already > >> have the topics and the topic vectors for each document. I know that I > >> need to write my own Similarity class that extends DefaultSimilarity. > >> I need to override tf(), queryNorm(), coord(), and computeNorm() to > >> all return a constant 1, so that they have no effect. Then, I can > >> override idf() to compute the Sim equation above. Seems simple enough. > >> However, I have a few practical issues: > >> > >> > >> - Storing the topic vectors for each document. Can I store this in the > >> index somehow? If so, how do I retrieve it later in my > >> CustomSimilarity class? > >> > >> - Changing the Boolean model. Instead of only computing the similarity > >> on a documents that contain any of the terms in the query (the default > >> behavior)