ported lucandra: lucene index on HBase

2010-03-25 Thread Thomas Koch
Hi,

Lucandra stores a lucene index on cassandra:
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend

As the author of lucandra writes: I’m sure something similar could be built 
on hbase.

So here it is:
http://github.com/thkoch2001/lucehbase

This is only a first prototype which has not been tested on anything real yet. 
But if you're interested, please join me to get it production ready!

I propose to keep this thread on hbase-user and java-dev only.
Would it make sense to aim this project to become an hbase contrib? Or a 
lucene contrib?

Best regards,

Thomas Koch, http://www.koch.ro

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849639#action_12849639
 ] 

Michael McCandless commented on LUCENE-2215:


This is a neat collector!

I like the idea of chaining/filtering... couldn't we put this in core
(under TFC/TSDC.create), but instead of doubling the 12 specialized
(anonymous) impls we now have, just delegate?

Ie, we'd make a FilteredCollector, taking another collector when it's
created, and then on every collect call, only if the hit is weak
enough (ie is worse than what the app provided as prev low score/doc)
would it forward it to the delegate?  I guess we should test perf w/
(the new additions to benchmark -- yay!) to see if specializing the
code (even anonymously) is warranted.

The indent whitespace needs to fixed to 2 spaces...


 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, LUCENE-2215.patch, 
 PagingCollector.java, TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-25 Thread Michael McCandless
On Mon, Mar 22, 2010 at 12:45 PM, Marvin Humphrey
mar...@rectangular.com wrote:
 On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote:
 Also, will Lucy store the original stats?

 These?

   * Total number of tokens in the field.
   * Number of unique terms in the field.
   * Doc boost.
   * Field boost.

Also sum(tf).  Robert can generate more :)

 That would depend on which Similiarity the user specs for that field.  In
 other words, it's just another data-reduction decision: if the Sim needs it,
 keep it, and if doesn't, throw it away.

OK.

 Incidentally, what are you planning to do about field boost if it's not always
 1.0?  Are you going to store full 32-bit floats?

For starters, yes.  We may (later) want to make a new attr that sets
the #bits (levels/precision) you want... then uses packed ints to
encode.

 Ie so the chosen Sim can properly recompute all boost bytes (if it uses
 those), for scoring models that pivot based on avg's of these stats?

 Yes, we could support that.

 It's not high on my todo-list for core Lucy, though: poor payoff for all the
 complexity it would introduce, particularly file format complexity with its
 heavy backwards compatibility burden.  Right now, we only have the boost
 bytes, and the fact that they are used for length normalization, field boost,
 and doc boost is incidental.  If we add all the raw stats, that's a bunch of
 stuff we have to support for a long time, yet which doesn't yield practical
 advantages for us yet.

 I'd be much more interested in finding a way to support such a feature as an
 extension.

I was specifically asking if Lucy will allow the user to force true
average to be recomputed, ie, at commit time from the writer.  It's
more costly and often not needed (ie, once your index is large enough,
new docs typically won't shift the average much).  But I imagine
some users will want true average.

  In any case, the proposal to start delaying Sim choice to search-time -- 
  while
  a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
  because it would kill the cheap-Searcher model to generate boost bytes at
  Searcher construction time and cache them within the object.  We need those
  boost bytes written to disk so we can mmap them and share them amongst many
  cheap Searchers.

 It'd seem like Lucy could re-gen the boost bytes if a different Sim
 were selected, or, the current Sim hadn't yet computed  cached its
 bytes?  But then logically this means a reader needs write
 permission to the index dir, which is not good...

 Whatever's reading the boost bytes can't tell the difference between process
 RAM and mmap'd RAM, so write-permission on the index dir isn't required.

Hmm if you could somehow soften this... so that a custom Sim could
regen its boost bytes (if it needed to), write them into the index,
and then whoever's reading can mmap... that'd buy you some
flexibility back.

 What's trickier is that Schemas are not normally mutable, and that they are
 part of the index.  You don't have to supply an Analyzer, or a Similarity, or
 anything else when opening a Searcher -- you just provide the location of the
 index, and the Schema gets deserialized from the latest schema_NNN.json file.
 That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much
 a thing of the past for us.

That's nice... though... is it too rigid?  Do users even want to pick
a different analyzer at search time?

 But it makes your feature request of runtime settability for
 Similarity awkward to implement: by the time you have a Schema
 object to work with, the Searcher is already open.

  Searcher searcher = new Searcher(/path/to/index);
  Schema schema = searcher.getSchema();
  schema.setSim(content, altSim); // Too late, and not implemented anyway.

I see...

  To my mind, these are all related data reduction tasks:
 
   * Omit doc-boost and field-boost, replacing them with a single float
 docXfield multiplier -- because you never need doc-boost on its own.
   * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
 replacing them all with a single boost byte -- because for the kind of
 scoring you want to do, you don't need all those raw stats.
   * Omit the boost byte, because you don't need to do scoring at all.
   * Omit positions because you don't need PhraseQueries, etc. to match.

 I wouldn't group this one with the others -- I mean technically it is
 data reduction -- but omitting positions means certain queries
 (PhraseQuery) won't work even in match only searching.  Whereas the
 rest of these examples affect how scoring is done (or whether it's
 done).

 Couldn't disagree more.  Omitting positions is *exactly* the kind of data
 reduction task which we know is safe to perform when a user specifically tells
 us they don't need PhraseQueries by specifying a MinimalSimilarity.

Hmmm... it just seems to be different categories to me.  One category
prevents certain kinds of queries 

[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2345:
--

Attachment: LUCENE-2345_3.0.patch

Here's a patch against 3.0 that provides the SegmentReaderFactory ability
(not tested yet, but i'll be doing that shortly as i integrate this 
functionality)

It adds a SegmentReaderFactory.

The IndexWriter now has a getter and setter for setting this

SegmentReader has a new protected method init() which is called after the 
segment reader has been initialized (to allow subclasses to hook this action 
and do additional initialization, etc

added 2 new IndexReader.open() calls that allow specifying the 
SegmentReaderFactory



 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849728#action_12849728
 ] 

Shai Erera commented on LUCENE-2345:


bq. The IndexWriter now has a getter and setter for setting this

If this is not expected to change during the lifetime of IW, I think it should 
be added to IWC when you upgrade the patch to 3.1.

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849731#action_12849731
 ] 

Tim Smith commented on LUCENE-2345:
---

that was my plan

 Make it possible to subclass SegmentReader
 --

 Key: LUCENE-2345
 URL: https://issues.apache.org/jira/browse/LUCENE-2345
 Project: Lucene - Java
  Issue Type: Wish
  Components: Index
Reporter: Tim Smith
 Fix For: 3.1

 Attachments: LUCENE-2345_3.0.patch


 I would like the ability to subclass SegmentReader for numerous reasons:
 * to capture initialization/close events
 * attach custom objects to an instance of a segment reader (caches, 
 statistics, so on and so forth)
 * override methods on segment reader as needed
 currently this isn't really possible
 I propose adding a SegmentReaderFactory that would allow creating custom 
 subclasses of SegmentReader
 default implementation would be something like:
 {code}
 public class SegmentReaderFactory {
   public SegmentReader get(boolean readOnly) {
 return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
   }
   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
 return newSegmentReader(readOnly);
   }
 }
 {code}
 It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
 (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
 etc)
 I could prepare a patch if others think this has merit
 Obviously, this API would be experimental/advanced/will change in future

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-25 Thread Marvin Humphrey
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
  Also, will Lucy store the original stats?
 
  These?
 
* Total number of tokens in the field.
* Number of unique terms in the field.
* Doc boost.
* Field boost.
 
 Also sum(tf).  Robert can generate more :)

Hmm, aren't Total number of tokens in the field and sum(tf) normally
equivalent?  I guess there might be analyzers for which that isn't true, e.g.
those which perform synonym-injection?

In any case, sum(tf) is probably a better definition, because it makes no
ancillary claims...

  Incidentally, what are you planning to do about field boost if it's not 
  always
  1.0?  Are you going to store full 32-bit floats?
 
 For starters, yes.  

OK, how are those going to be encoded?  IEEE 754?  Big-endian?

http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness

 We may (later) want to make a new attr that sets
 the #bits (levels/precision) you want... then uses packed ints to
 encode.

I'm concerned that the bit-wise entropy of floats may make them a poor match
for compression via packed ints.  We'll probably get a compressed
representation which is larger than the original.

Are there any standard algorithms out there for compressing IEEE 754 floats?
RLE works, but only with certain data patterns.

... [ time passes ] ...

Hmm, maybe not:


http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data

 I was specifically asking if Lucy will allow the user to force true
 average to be recomputed, ie, at commit time from the writer. 

That's theoretically possible.  We'd have to implement the reader the same way
we have DeletionsReader -- the most recent segment may contain data which
applies to older segments.  

Here's the DeletionsReader code, which searches backwards through the segments
looking for a particular file:

/* Start with deletions files in the most recently added segments and work
 * backwards.  The first one we find which addresses our segment is the
 * one we need. */
for (i = VA_Get_Size(segments) - 1; i = 0; i--) {
Segment *other_seg = (Segment*)VA_Fetch(segments, i);
Hash *metadata 
= (Hash*)Seg_Fetch_Metadata_Str(other_seg, deletions, 9);
if (metadata) {
Hash *files = (Hash*)CERTIFY(
Hash_Fetch_Str(metadata, files, 5), HASH);
Hash *seg_files_data 
= (Hash*)Hash_Fetch(files, (Obj*)my_seg_name);
if (seg_files_data) {
Obj *count = (Obj*)CERTIFY(
Hash_Fetch_Str(seg_files_data, count, 5), OBJ);
del_count = (i32_t)Obj_To_I64(count);
del_file  = (CharBuf*)CERTIFY(
Hash_Fetch_Str(seg_files_data, filename, 8), CHARBUF);
break;
}
}
}

What we'd do is write the regenerated boost bytes for *all* segments to the
most recent segment.  It would be roughly analogous to building up an NRT
reader.

  What's trickier is that Schemas are not normally mutable, and that they are
  part of the index.  You don't have to supply an Analyzer, or a Similarity, 
  or
  anything else when opening a Searcher -- you just provide the location of 
  the
  index, and the Schema gets deserialized from the latest schema_NNN.json 
  file.
  That has many advantages, e.g. inadvertent Analyzer conflicts are pretty 
  much
  a thing of the past for us.
 
 That's nice... though... is it too rigid?  Do users even want to pick
 a different analyzer at search time?

It's not common.  

To my mind, the way a field is tokenized is part of its field definition, thus
the Analyzer is part of the field definition, thus the analyzer is part of the
schema and needs to be stored with the index.

Still, we support different Analyzers at search time by way of QueryParser.
QueryParser's constructor requires a Schema, but also accepts an optional
Analyzer which if supplied will be used instead of the Analyzers from the
Schema.

  Maybe aggressive automatic data-reduction makes more sense in the context of
  flexible matching, which is more expansive than flexible scoring?
 
 I think so.  Maybe it shouldn't be called a Similarity (which to me
 (though, carrying a heavy curse of knowledge burden...) means
 scoring)?  Matcher?

Heh.  Matcher is taken.  It's a crucial class, too, roughly combining the
roles of Lucene's Scorer and DocIDSetIterator.

The first alternative that comes to mind is Relevance, because not only can
one thing's relevance to another be continuously variable (i.e. score), it can
also be binary: relevant/not-relevant (i.e. match).

But I don't see why Relevance, Matcher, or anything else would be so much
better than Similarity.  I think this is your hang up.  ;) 

  I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice 
  feature,
  but I don't think we've worked out all the problems yet.  If we can, I might
  switch to +1 (FWIW).
 
 What 

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849806#action_12849806
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Michael, I'm guessing this patch needs to be updated as per LUCENE-2329?  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324-no-pooling.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849808#action_12849808
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Actually, I just browsed the patch again, I don't think it implements private 
doc writers as of yet?  

I think you're right, we can get this issue completed.  LUCENE-2312's path 
looks clear at this point.  Shall I take a whack at it?

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1

 Attachments: lucene-2324-no-pooling.patch


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: (was: lucene-2324-no-pooling.patch)

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849819#action_12849819
 ] 

Michael Busch commented on LUCENE-2324:
---

Hey Jason,

Disregard my patch here.  I just experimented with removal of pooling, but then 
did LUCENE-2329 instead.  TermsHash and TermsHashPerThread are now much 
simpler, because all the pooling code is gone after 2329 was committed.  Should 
make it a little easier to get this patch done.

Sure it'd be awesome if you could provide a patch here.  I can help you, we 
should just frequently post patches here so that we don't both work on the same 
areas.



 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849843#action_12849843
 ] 

Grant Ingersoll commented on LUCENE-2215:
-

Mike,  don't you think, though, that through a fairly simple update of some of 
the clauses to appropriate short circuit things that we can just hook this into 
the existing collectors w/o no need for any delegation or changes?  Let me try 
a patch.  Now that the benchmark stuff is in, we should be able to test.


 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, LUCENE-2215.patch, 
 PagingCollector.java, TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849844#action_12849844
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Michael, I'm working on a patch and will post one (hopefully) shortly.

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849851#action_12849851
 ] 

Uwe Schindler commented on LUCENE-2215:
---

Hey, and I want to fix the NaN thing in TSDC: LUCENE-2271

Maybe when we delegate, we can also use my cool code that switches the delegate 
to remove on comparison after the queue is full.

 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, LUCENE-2215.patch, 
 PagingCollector.java, TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849863#action_12849863
 ] 

Michael McCandless commented on LUCENE-2215:


bq. ...through a fairly simple update of some of the clauses to appropriate 
short circuit things that we can just hook this into the existing collectors 
w/o no need for any delegation or changes? Let me try a patch. Now that the 
benchmark stuff is in, we should be able to test.

This'd make me nervous...

Ie I don't think we should insert bytecodes for the 99.9% of searches that 
wouldn't make use of this, even if we can't uncover a slowdown with 
benchmarking.

We should still benchmark it though (I'm curious)... we should also benchmark 
the delegate solution.

 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, LUCENE-2215.patch, 
 PagingCollector.java, TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849899#action_12849899
 ] 

Michael Busch commented on LUCENE-2324:
---

Awesome!

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search

2010-03-25 Thread Michael Busch (JIRA)
Explore other in-memory postinglist formats for realtime search
---

 Key: LUCENE-2346
 URL: https://issues.apache.org/jira/browse/LUCENE-2346
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


The current in-memory posting list format might not be optimal for searching. 
VInt decoding performance and the lack of skip lists would arguably be the 
biggest bottlenecks.

For LUCENE-2312 we should investigate other formats.

Some ideas:
- PFOR or packed ints for posting slices?
- Maybe even int[] slices instead of byte slices? This would be great for 
search performance, but the additional memory overhead might not be acceptable.
- For realtime search it's usually desirable to evaluate the most recent 
documents first.  So using backward pointers instead of forward pointers and 
having the postinglist pointer point to the most recent docID in a list is 
something to consider.
- Skipping: if we use fixed-length postings ([packed] ints) we can do binary 
search within a slice.  We can also locate a pointer then without scanning and 
thus skip entire slices quickly.  Is that sufficient or would we need more 
skipping layers, so that it's possible to skip directly to particular slices?


It would be awesome to find a format that doesn't slow down normal indexing, 
but is very efficient for in-memory searches.  If we can't find such a fits-all 
format, we should have a separate indexing chain for real-time indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2347) Dump WordNet to SOLR Synonym format

2010-03-25 Thread Bill Bell (JIRA)
Dump WordNet to SOLR Synonym format
---

 Key: LUCENE-2347
 URL: https://issues.apache.org/jira/browse/LUCENE-2347
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0.1
Reporter: Bill Bell


This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get 
all your syns loaded easily.

1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ WordNet 
V2 to SOLR by first using the Sys2Index program
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html

Get WNprolog from http://wordnetcode.princeton.edu/2.0/

2. We modified this program to work with SOLR (See attached) on 
amidev.kaango.com in /vol/src/lucene/contrib/wordnet
vi 
/vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java

3. Run ant

4. java -classpath 
/vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar 
org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr  index_synonyms.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2347) Dump WordNet to SOLR Synonym format

2010-03-25 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated LUCENE-2347:
--

Attachment: Syns2Solr.java

 Dump WordNet to SOLR Synonym format
 ---

 Key: LUCENE-2347
 URL: https://issues.apache.org/jira/browse/LUCENE-2347
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0.1
Reporter: Bill Bell
 Attachments: Syns2Solr.java


 This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get 
 all your syns loaded easily.
 1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ 
 WordNet V2 to SOLR by first using the Sys2Index program
 http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html
 Get WNprolog from http://wordnetcode.princeton.edu/2.0/
 2. We modified this program to work with SOLR (See attached) on 
 amidev.kaango.com in /vol/src/lucene/contrib/wordnet
 vi 
 /vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java
 3. Run ant
 4. java -classpath 
 /vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar 
 org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr  index_synonyms.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849961#action_12849961
 ] 

Grant Ingersoll commented on LUCENE-2215:
-

Yeah, but one could make the argument, Mike, that the existing optimizations 
are useless for the most common case, since I think it's safe to say most 
applications implement paging.  Of course, that being said, most users don't 
page all that deeply.  Also, for something like Solr that prefetches the top 50 
it might not be good, either.  Still, in my mind it is one additional boolean 
check, as in:
{code}
if ( (current stuff) || (pagingInfoPresent == true  paging check) )
...
{code}

pagingInfoPresent can be determined at construction time and that whole clause 
would be short circuited very quickly.

That being said, delegation could be done at construction time, too and more 
cleanly separates things.  I'll try to put up my version tomorrow.

 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, LUCENE-2215.patch, 
 PagingCollector.java, TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849965#action_12849965
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

I'm a little confused in the flushedDocCount, remap deletes conversion portions 
of DocWriter.  flushedDocCount is used as a global counter, however when we 
move to per thread doc writers, it won't be global anymore.  Is there a 
different (easier) way to perform remap deletes?  

 Per thread DocumentsWriters that write their own private segments
 -

 Key: LUCENE-2324
 URL: https://issues.apache.org/jira/browse/LUCENE-2324
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


 See LUCENE-2293 for motivation and more details.
 I'm copying here Mike's summary he posted on 2293:
 Change the approach for how we buffer in RAM to a more isolated
 approach, whereby IW has N fully independent RAM segments
 in-process and when a doc needs to be indexed it's added to one of
 them. Each segment would also write its own doc stores and
 normal segment merging (not the inefficient merge we now do on
 flush) would merge them. This should be a good simplification in
 the chain (eg maybe we can remove the *PerThread classes). The
 segments can flush independently, letting us make much better
 concurrent use of IO  CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850002#action_12850002
 ] 

Shai Erera commented on LUCENE-2215:


bq. since I think it's safe to say most applications implement paging

Let's be careful about the semantics here Grant. Most if not all applications 
implement paging indeed, but I believe only FEW actually store user contexts 
between searches. PagingCollector relies on the application to store the lowest 
ranking doc that was returned previously, which means storing context between 
user's searches.

I agree w/ Mike's statement about 99.9% of the searches would never run that 
code, which is why I've proposed a delegation/wrapper approach from the 
beginning. I also think that we should make some allowances here and there, for 
the non-common case, and introduce better software design than specialized 
code. A Collector filter approach for some rare (or even less common) cases 
seems very reasonable to me.

Also, I think that if we add to TSDC a create method which takes into account 
the previously scored lowest doc, it will confuse people. Now they will need to 
think where do I get this low score from? - but perhaps after I see the code, 
it wouldn't be such a bad thing  just have a feeling TSDC and TFC should be 
left on their own, and extreme paging stuff should either be its own 
specialized collector, or a wrapper.

 paging collector
 

 Key: LUCENE-2215
 URL: https://issues.apache.org/jira/browse/LUCENE-2215
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Search
Affects Versions: 2.4, 3.0
Reporter: Adam Heinz
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: IterablePaging.java, LUCENE-2215.patch, 
 PagingCollector.java, TestingPagingCollector.java


 http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
 Somebody assign this to Aaron McCurry and we'll see if we can get enough 
 votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

2010-03-25 Thread Trejkaz (JIRA)
DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
readers
-

 Key: LUCENE-2348
 URL: https://issues.apache.org/jira/browse/LUCENE-2348
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9.2
Reporter: Trejkaz


DuplicateFilter currently works by building a single doc ID set, without taking 
into account that getDocIdSet() will be called once per segment and only with 
each segment's local reader.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

2010-03-25 Thread Trejkaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-2348:


Component/s: (was: Search)
 contrib/*

Changing to contrib, only just realised it was in that location...


 DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
 readers
 -

 Key: LUCENE-2348
 URL: https://issues.apache.org/jira/browse/LUCENE-2348
 Project: Lucene - Java
  Issue Type: Bug
  Components: contrib/*
Affects Versions: 2.9.2
Reporter: Trejkaz

 DuplicateFilter currently works by building a single doc ID set, without 
 taking into account that getDocIdSet() will be called once per segment and 
 only with each segment's local reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-25 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12850012#action_12850012
 ] 

Robert Muir commented on LUCENE-2323:
-

Committed 927696 (and 927697 for the solr piece).

Will keep the issue open and work on a patch for the next part.

 reorganize contrib modules
 --

 Key: LUCENE-2323
 URL: https://issues.apache.org/jira/browse/LUCENE-2323
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Robert Muir
Assignee: Robert Muir
 Attachments: LUCENE-2323.patch


 it would be nice to reorganize contrib modules, so that they are bundled 
 together by functionality.
 For example:
 * the wikipedia contrib is a tokenizer, i think really belongs in 
 contrib/analyzers
 * there are two highlighters, i think could be one highlighters package.
 * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org