lucene mailing list archives zip?

2011-06-05 Thread Gregor Heinrich
Dear list -- is there any archive "proper" of the lucene dev and user Mailman 
lists?  A link per-month or zip or tar.gz of the mbox files would be terrific.


Thanks in advance

gregor

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Numerical ids for terms?

2011-04-13 Thread Gregor Heinrich

Thanks Toke and Kirill -- I guess that's the way to go (at least until v4.0).

Best regards

gregor

On 4/13/11 3:42 PM, Toke Eskildsen wrote:

On Tue, 2011-04-12 at 11:41 +0200, Gregor Heinrich wrote:

Hi -- has there been any effort to create a numerical representation of Lucene
indices. That is, to use the Lucene Directory backend as a large term-document
matrix at index level. As this would require bijective mapping between terms
(per-field, as customary in Lucene) and a numerical index (integer, monotonous
from 0 to numTerms()-1), I guess this requires some some special modifications
to the Lucene core.

Maybe you're thinking about something like TermsEnum?
https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/index/TermsEnum.html
It provides ordinal-access to terms, represented with longs. In order to
make the access at index-level rather than segment-level you will have
to perform a merge of the ordinals from the different segments.

Unfortunately it is optional whether the codec supports ordinal-based
terms access and the default codec does not, so you will have to
explicitly select a codec when you build your index.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Numerical ids for terms?

2011-04-12 Thread Gregor Heinrich
Thanks for the quick response. Please be a bit more concrete than "some form" of 
term--id mapping:  Do you refer to subclassing SegmentReader with the 
appropriate Map implementation or is there a tested structure in the existing 
API that I've overseen? Regarding a Directory abstraction backed by a memory 
mapping API, my question refers to using Lucene API because even if may be 
perceived "dumb", it hides a lot of boilerplate code. Are there any efforts 
going on regarding this?


Cheers

gregor

On 4/12/11 1:21 PM, Earwin Burrfoot wrote:

On Tue, Apr 12, 2011 at 13:41, Gregor Heinrich  wrote:

Hi -- has there been any effort to create a numerical representation of
Lucene indices. That is, to use the Lucene Directory backend as a large
term-document matrix at index level. As this would require bijective mapping
between terms (per-field, as customary in Lucene) and a numerical index
(integer, monotonous from 0 to numTerms()-1), I guess this requires some
some special modifications to the Lucene core.

Lucene index already provides term<->  id mapping in some form.


Another interesting feature would be to use Lucene's Directory backend for
storage of large dense matrices, for instance to data-mining tasks from
within Lucene.

Lucene's Directory is a dumb abstraction for random-access named
write-once byte streams.
It doesn't add /any/ value over mmap.


Any suggestions?

*troll mode on* Use numpy/scipy? :)



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Numerical ids for terms?

2011-04-12 Thread Gregor Heinrich
Hi -- has there been any effort to create a numerical representation of Lucene 
indices. That is, to use the Lucene Directory backend as a large term-document 
matrix at index level. As this would require bijective mapping between terms 
(per-field, as customary in Lucene) and a numerical index (integer, monotonous 
from 0 to numTerms()-1), I guess this requires some some special modifications 
to the Lucene core.


Another interesting feature would be to use Lucene's Directory backend for 
storage of large dense matrices, for instance to data-mining tasks from within 
Lucene.


Any suggestions?

Best regards and thanks

gregor


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Release schedule Lucene 4?

2011-01-17 Thread Gregor Heinrich

Hi Mike, all --

a (sorrily slow) thanks for this response ;)

From the ensuing discussion, it sounds like there's a LOT to be in v4, and not 
raising wrong expectation by giving dates is appreciated ;)


Only thing is, are we talking any time in 2012 or 2011, just to have a 
coarse-grained estimate without any assumptions attached?


Best

gregor





On 1/15/11 3:20 PM, Michael McCandless wrote:

This is unfortunately hard to say!

There's tons of good stuff in 4.0, so we'd really like to release
sooner rather than later.

But then there's also alot of work remaining, eg we have 3 feature
branches in flight right now, that we need to wrap up and land on
trunk:

   * realtime (gives us concurrent flushing during indexing)

   * docvalues (adds column-stride fields)

   * bulkpostings (gives good search speedup for intblock codecs)

Plus many open Jira issues.  So it's hard to predict when all of this
will be done

Mike

On Fri, Jan 14, 2011 at 12:31 PM, Gregor Heinrich  wrote:

Dear Lucene team,

I am wondering whether there is an updated Lucene release schedule for the
v4.0 stream.

Any earliest/latest alpha/beta/stable date? And if not yet, where to track
such info?

Thanks in advance from Germany

gregor

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Release schedule Lucene 4?

2011-01-14 Thread Gregor Heinrich

Dear Lucene team,

I am wondering whether there is an updated Lucene release schedule for the v4.0 
stream.


Any earliest/latest alpha/beta/stable date? And if not yet, where to track such 
info?


Thanks in advance from Germany

gregor

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: lucene.index.*: extending Lucene to store topic model data ?

2010-11-19 Thread Gregor Heinrich
Hi Uwe -- thanks for this great hint. Is it considered stable enough to throw 
corpora at it that have 100MB etc. raw text?


ps -- sorry for staying cryptic about the actual application. I tried to 
abstract its relation to Lucene... Basically it's about automatically 
associating queries and documents with groups of related terms (topics) and thus 
improving recall. I wrote an introductory note about this stuff that may give an 
overview and cites much of the original literature: 
http://www.arbylon.net/publications/text-est2.pdf .


All the best

gregor


On 11/19/10 9:07 AM, Uwe Schindler wrote:

Hi Gregor,

I do not come from your area, so I don't understand all the stuff you are
writing about, but from what you write, it looks that you are interested in
the new flexible indexing coming with Lucene 4.0 aka Lucene trunk? Currently
flexible indexing only allows to modify term dictionary and posting lists
currently (the 4-dim Enum api in Lucene), but in the future we will also
allow to modify index format of stotred fields/term vectors. We already
started to have patches that allow per-field/document statistics for BM25
scoring.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-Original Message-----
From: Gregor Heinrich [mailto:gre...@arbylon.net]
Sent: Friday, November 19, 2010 8:50 AM
To: dev@lucene.apache.org
Subject: lucene.index.*: extending Lucene to store topic model data ?

Dear list -- a question on potential storage of data originating from

"topic

models" like LSA (latent semantic analysis) and LDA (latent Dirichlet

allocation).

Packages like Mahout or SemanticVectors allow extraction of latent topics

from

an existing Lucene corpus. They don't have the storage of the actual

latent

concepts integrated into Lucene's efficient backend. So storing those data
withing Lucene's segments may be a benefit for them.

My question: In the IndexWriter backend, is there any reasonable way you

can

think of to store extra information after segments have been created but
before a commit() ? (This way any IndexSearcher/Reader always sees a
consistent index.) Further, after the optimize() step, another

modification of the

extra information in index should be possible.

Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from
the information in the index and stores topic related data with the

segments

currently active for indexing, but in extra files. The extra files contain
document-specific topic float vectors as well as segment-global float

vectors.

During commit(), the extra files are merged with the segments (which

involves

some math processing again). At the end of the indexing process, the LDA
algorithm is rerun, improving the topic model globally, thus again

modifying

the extra files.

What may be a point of departure? Adding a modified TermVector-like

storage

approach and hooking it to extended Segment* classes?

Best regards

gregor



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
commands, e-mail: dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



lucene.index.*: extending Lucene to store topic model data ?

2010-11-18 Thread Gregor Heinrich
Dear list -- a question on potential storage of data originating from "topic 
models" like LSA (latent semantic analysis) and LDA (latent Dirichlet 
allocation). Packages like Mahout or SemanticVectors allow extraction of latent 
topics from an existing Lucene corpus. They don't have the storage of the actual 
latent concepts integrated into Lucene's efficient backend. So storing those 
data withing Lucene's segments may be a benefit for them.


My question: In the IndexWriter backend, is there any reasonable way you can 
think of to store extra information after segments have been created but before 
a commit() ? (This way any IndexSearcher/Reader always sees a consistent index.) 
Further, after the optimize() step, another modification of the extra 
information in index should be possible.


Example scenario: An IndexWriter.preCommit() starts the LDA algorithm from the 
information in the index and stores topic related data with the segments 
currently active for indexing, but in extra files. The extra files contain 
document-specific topic float vectors as well as segment-global float vectors. 
During commit(), the extra files are merged with the segments (which involves 
some math processing again). At the end of the indexing process, the LDA 
algorithm is rerun, improving the topic model globally, thus again modifying the 
extra files.


What may be a point of departure? Adding a modified TermVector-like storage 
approach and hooking it to extended Segment* classes?


Best regards

gregor



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



contrib/fast-vector-highlighter classes: package-private fields

2010-11-18 Thread Gregor Heinrich
Dear list -- I was wondering why in the fast-vector-highlighter some fields are 
set package-private and at the same time don't have accessor methods. Are 
subclasses supposed to be put in the same package then?


Example: Subclassing ScoreOrderFragmentsBuilder with a new method like this:

@Override
public List getWeightedFragInfoList(List 
src) {
Collections.sort(src, new ScoreComparator());
super(src);
for (int i = 0; i < src.size(); i++) {
// ??? every field package private in FieldFragList.WeightedFragInfo
WeightedFragInfo u = src.get(i);
u.startOffset -= 20;
u.endOffset += 20;
src.set(i, u);
}
return src;
}

I'd vote for protected access for all those fields in the 11 classes where this 
issue applies. IMO, the package is worth having this extra flexibility. Then it 
really deserves its attribute "fast" also in terms of developing with it.


Best wishes

gregor





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org