Re: Custom Filter for Splitting CamelCase?

2011-11-29 Thread Stephen Thomas
>> -Original Message- >> From: stephen.warner.tho...@gmail.com >> [mailto:stephen.warner.tho...@gmail.com] On Behalf Of Stephen Thomas >> Sent: Tuesday, November 29, 2011 5:20 PM >> To: java-user@lucene.apache.org >> Subject: Custom Filter for Splitting CamelCase?

Custom Filter for Splitting CamelCase?

2011-11-29 Thread Stephen Thomas
List, I have written my own CustomAnalyzer, as follows: public TokenStream tokenStream(String fieldName, Reader reader) { // TODO: add calls to RemovePuncation, and SplitIdentifiers here // First, convert to lower case TokenStream

Re: Scoring a document using LDA topics

2011-11-29 Thread Stephen Thomas
ote about this sometime back...maybe this would help you. > http://sujitpal.blogspot.com/2011/01/payloads-with-solr.html > > -sujit > > On Mon, 2011-11-28 at 12:29 -0500, Stephen Thomas wrote: >> List, >> >> I am trying to incorporate the Latent Dirichlet Allocation

Scoring a document using LDA topics

2011-11-28 Thread Stephen Thomas
List, I am trying to incorporate the Latent Dirichlet Allocation (LDA) topic model into Lucene. Briefly, the LDA model extracts topics (distribution over words) from a set of documents, and then represents each document with topic vectors. For example, documents could be represented as: d1 = (0,

Do duplicate documents affect term scoring?

2011-11-27 Thread Stephen Thomas
List, I am indexing a subset of Wikipedia. I have 4 years worth of data, and have taken snapshots of each document at each month in the 4 year span. Thus, I have 4*12=36 versions of each document. (I keep track of the timestamp in a field.) I have noticed that in many cases, a Wikipedia document d