lucene mailing list archives zip?
Dear list -- is there any archive "proper" of the lucene dev and user Mailman lists? A link per-month or zip or tar.gz of the mbox files would be terrific. Thanks in advance gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Numerical ids for terms?
Thanks Toke and Kirill -- I guess that's the way to go (at least until v4.0). Best regards gregor On 4/13/11 3:42 PM, Toke Eskildsen wrote: On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote: Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Maybe you're thinking about something like TermsEnum? https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html It provides ordinal-access to terms, represented with longs. In order to make the access at index-level rather than segment-level you will have to perform a merge of the ordinals from the different segments. Unfortunately it is optional whether the codec supports ordinal-based terms access and the default codec does not, so you will have to explicitly select a codec when you build your index. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Numerical ids for terms?
Thanks for the quick response. Please be a bit more concrete than "some form" of term--id mapping: Do you refer to subclassing SegmentReader with the appropriate Map implementation or is there a tested structure in the existing API that I've overseen? Regarding a Directory abstraction backed by a memory mapping API, my question refers to using Lucene API because even if may be perceived "dumb", it hides a lot of boilerplate code. Are there any efforts going on regarding this? Cheers gregor On 4/12/11 1:21 PM, Earwin Burrfoot wrote: On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich wrote: Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Lucene index already provides term<-> id mapping in some form. Another interesting feature would be to use Lucene's Directory backend for storage of large dense matrices, for instance to data-mining tasks from within Lucene. Lucene's Directory is a dumb abstraction for random-access named write-once byte streams. It doesn't add /any/ value over mmap. Any suggestions? *troll mode on* Use numpy/scipy? :) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Numerical ids for terms?
Hi -- has there been any effort to create a numerical representation of Lucene indices. That is, to use the Lucene Directory backend as a large term-document matrix at index level. As this would require bijective mapping between terms (per-field, as customary in Lucene) and a numerical index (integer, monotonous from 0 to numTerms()-1), I guess this requires some some special modifications to the Lucene core. Another interesting feature would be to use Lucene's Directory backend for storage of large dense matrices, for instance to data-mining tasks from within Lucene. Any suggestions? Best regards and thanks gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Release schedule Lucene 4?
Hi Mike, all -- a (sorrily slow) thanks for this response ;) From the ensuing discussion, it sounds like there's a LOT to be in v4, and not raising wrong expectation by giving dates is appreciated ;) Only thing is, are we talking any time in 2012 or 2011, just to have a coarse-grained estimate without any assumptions attached? Best gregor On 1/15/11 3:20 PM, Michael McCandless wrote: This is unfortunately hard to say! There's tons of good stuff in 4.0, so we'd really like to release sooner rather than later. But then there's also alot of work remaining, eg we have 3 feature branches in flight right now, that we need to wrap up and land on trunk: * realtime (gives us concurrent flushing during indexing) * docvalues (adds column-stride fields) * bulkpostings (gives good search speedup for intblock codecs) Plus many open Jira issues. So it's hard to predict when all of this will be done Mike On Fri, Jan 14, 2011 at 12:31 PM, Gregor Heinrich wrote: Dear Lucene team, I am wondering whether there is an updated Lucene release schedule for the v4.0 stream. Any earliest/latest alpha/beta/stable date? And if not yet, where to track such info? Thanks in advance from Germany gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Release schedule Lucene 4?
Dear Lucene team, I am wondering whether there is an updated Lucene release schedule for the v4.0 stream. Any earliest/latest alpha/beta/stable date? And if not yet, where to track such info? Thanks in advance from Germany gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: lucene.index.*: extending Lucene to store topic model data ?
Hi Uwe -- thanks for this great hint. Is it considered stable enough to throw corpora at it that have 100MB etc. raw text? ps -- sorry for staying cryptic about the actual application. I tried to abstract its relation to Lucene... Basically it's about automatically associating queries and documents with groups of related terms (topics) and thus improving recall. I wrote an introductory note about this stuff that may give an overview and cites much of the original literature: http://www.arbylon.net/publications/text-est2.pdf . All the best gregor On 11/19/10 9:07 AM, Uwe Schindler wrote: Hi Gregor, I do not come from your area, so I don't understand all the stuff you are writing about, but from what you write, it looks that you are interested in the new flexible indexing coming with Lucene 4.0 aka Lucene trunk? Currently flexible indexing only allows to modify term dictionary and posting lists currently (the 4-dim Enum api in Lucene), but in the future we will also allow to modify index format of stotred fields/term vectors. We already started to have patches that allow per-field/document statistics for BM25 scoring. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message----- From: Gregor Heinrich [mailto:gre...@arbylon.net] Sent: Friday, November 19, 2010 8:50 AM To: dev@lucene.apache.org Subject: lucene.index.*: extending Lucene to store topic model data ? Dear list -- a question on potential storage of data originating from "topic models" like LSA (latent semantic analysis) and LDA (latent Dirichlet allocation). Packages like Mahout or SemanticVectors allow extraction of latent topics from an existing Lucene corpus. They don't have the storage of the actual latent concepts integrated into Lucene's efficient backend. So storing those data withing Lucene's segments may be a benefit for them. My question: In the IndexWriter backend, is there any reasonable way you can think of to store extra information after segments have been created but before a commit() ? (This way any IndexSearcher/Reader always sees a consistent index.) Further, after the optimize() step, another modification of the extra information in index should be possible. Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from the information in the index and stores topic related data with the segments currently active for indexing, but in extra files. The extra files contain document-specific topic float vectors as well as segment-global float vectors. During commit(), the extra files are merged with the segments (which involves some math processing again). At the end of the indexing process, the LDA algorithm is rerun, improving the topic model globally, thus again modifying the extra files. What may be a point of departure? Adding a modified TermVector-like storage approach and hooking it to extended Segment* classes? Best regards gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
lucene.index.*: extending Lucene to store topic model data ?
Dear list -- a question on potential storage of data originating from "topic models" like LSA (latent semantic analysis) and LDA (latent Dirichlet allocation). Packages like Mahout or SemanticVectors allow extraction of latent topics from an existing Lucene corpus. They don't have the storage of the actual latent concepts integrated into Lucene's efficient backend. So storing those data withing Lucene's segments may be a benefit for them. My question: In the IndexWriter backend, is there any reasonable way you can think of to store extra information after segments have been created but before a commit() ? (This way any IndexSearcher/Reader always sees a consistent index.) Further, after the optimize() step, another modification of the extra information in index should be possible. Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from the information in the index and stores topic related data with the segments currently active for indexing, but in extra files. The extra files contain document-specific topic float vectors as well as segment-global float vectors. During commit(), the extra files are merged with the segments (which involves some math processing again). At the end of the indexing process, the LDA algorithm is rerun, improving the topic model globally, thus again modifying the extra files. What may be a point of departure? Adding a modified TermVector-like storage approach and hooking it to extended Segment* classes? Best regards gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
contrib/fast-vector-highlighter classes: package-private fields
Dear list -- I was wondering why in the fast-vector-highlighter some fields are set package-private and at the same time don't have accessor methods. Are subclasses supposed to be put in the same package then? Example: Subclassing ScoreOrderFragmentsBuilder with a new method like this: @Override public List getWeightedFragInfoList(List src) { Collections.sort(src, new ScoreComparator()); super(src); for (int i = 0; i < src.size(); i++) { // ??? every field package private in FieldFragList.WeightedFragInfo WeightedFragInfo u = src.get(i); u.startOffset -= 20; u.endOffset += 20; src.set(i, u); } return src; } I'd vote for protected access for all those fields in the 11 classes where this issue applies. IMO, the package is worth having this extra flexibility. Then it really deserves its attribute "fast" also in terms of developing with it. Best wishes gregor - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org