ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
Due to the odd behaviour of a custom Scorer of mine I discovered 
ConjunctionScorer.doNext() could loop indefinitely.
It does not bail out as soon as any scorer.advance() call it makes reports back 
"NO_MORE_DOCS". Is there not a performance optimisation to be gained in exiting 
as soon as this happens?
At this stage I cannot see any point in continuing to advance other scorers - 
a quick look at TermScorer suggests that any questionable calls made by 
ConjunctionScorer to advance to NO_MORE_DOCS receives no special treatment and 
disk will be hit as a consequence.
I added an extra condition to the while loop on the 3.5 source:

    while ((doc != NO_MORE_DOCS)  && ((firstScorer = scorers[first]).docID() < 
doc)) {
    
and Junit tests passed.I haven't been able to benchmark performance 
improvements but it looks like it would be sensible to make the change anyway.

Cheers,
Mark

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
I got round to some benchmarking of this change on Wikipedia content which 
shows a small improvement:   http://goo.gl/60wJG

Aside from the small performance gain to be had, it just feels more logical if 
ConjunctionScorer does not issue sub scorers with a request to advance to 
"NO_MORE_DOCS".




- Original Message -----
From: mark harwood 
To: "dev@lucene.apache.org" 
Cc: 
Sent: Thursday, 1 March 2012, 9:39
Subject: ConjunctionScorer.doNext() overstays?

Due to the odd behaviour of a custom Scorer of mine I discovered 
ConjunctionScorer.doNext() could loop indefinitely.
It does not bail out as soon as any scorer.advance() call it makes reports back 
"NO_MORE_DOCS". Is there not a performance optimisation to be gained in exiting 
as soon as this happens?
At this stage I cannot see any point in continuing to advance other scorers - 
a quick look at TermScorer suggests that any questionable calls made by 
ConjunctionScorer to advance to NO_MORE_DOCS receives no special treatment and 
disk will be hit as a consequence.
I added an extra condition to the while loop on the 3.5 source:

    while ((doc != NO_MORE_DOCS)  && ((firstScorer = scorers[first]).docID() < 
doc)) {
    
and Junit tests passed.I haven't been able to benchmark performance 
improvements but it looks like it would be sensible to make the change anyway.

Cheers,
Mark

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
I would have assumed the many int comparisons would cost less than the 
superfluous disk accesses? (I bow to your considerable experience in this area!)
What is the worst-case scenario on added disk reads? Could it be as bad 
as numberOfSegments x numberOfOtherscorers before the query winds up?
On the index I tried, it looked like an improvement - the spreadsheet I linked 
to has the source for the benchmark on a second worksheet if you want to give 
it a whirl on a different dataset.



- Original Message -
From: Michael McCandless 
To: dev@lucene.apache.org; mark harwood 
Cc: 
Sent: Thursday, 1 March 2012, 13:31
Subject: Re: ConjunctionScorer.doNext() overstays?

Hmm, the tradeoff is an added per-hit check (doc != NO_MORE_DOCS), vs
the one-time cost at the end of calling advance(NO_MORE_DOCS) for each
sub-clause?  I think in general this isn't a good tradeoff?

Ie what about the case where we and high-freq, and similarly freq'd,
terms together?  Then, the per-hit check will at some point dominate?

It's valid to pass NO_MORE_DOCS to DocsEnum.advance.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 1, 2012 at 7:22 AM, mark harwood  wrote:
> I got round to some benchmarking of this change on Wikipedia content which 
> shows a small improvement:   http://goo.gl/60wJG
>
> Aside from the small performance gain to be had, it just feels more logical 
> if ConjunctionScorer does not issue sub scorers with a request to advance to 
> "NO_MORE_DOCS".
>
>
>
>
> - Original Message -
> From: mark harwood 
> To: "dev@lucene.apache.org" 
> Cc:
> Sent: Thursday, 1 March 2012, 9:39
> Subject: ConjunctionScorer.doNext() overstays?
>
> Due to the odd behaviour of a custom Scorer of mine I discovered 
> ConjunctionScorer.doNext() could loop indefinitely.
> It does not bail out as soon as any scorer.advance() call it makes reports 
> back "NO_MORE_DOCS". Is there not a performance optimisation to be gained in 
> exiting as soon as this happens?
> At this stage I cannot see any point in continuing to advance other scorers - 
> a quick look at TermScorer suggests that any questionable calls made by 
> ConjunctionScorer to advance to NO_MORE_DOCS receives no special treatment 
> and disk will be hit as a consequence.
> I added an extra condition to the while loop on the 3.5 source:
>
>     while ((doc != NO_MORE_DOCS)  && ((firstScorer = scorers[first]).docID() 
> < doc)) {
>
> and Junit tests passed.I haven't been able to benchmark performance 
> improvements but it looks like it would be sensible to make the change anyway.
>
> Cheers,
> Mark
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread mark harwood
Fair points.
I've tried several sized indexes and blends of query term frequencies now and 
the results swing only marginally between the 2 implementations.
Sometimes the "exiting early" logic is marginally faster and other times 
marginally slower. Using a larger index seemed to reduce the improvement I had 
seen on my initial results.

So overall, not a clear improvement and not worth bothering with because, as 
you suggest, various disk caching strategies probably mitigate the cost of the 
added reads.

Based on your comments re the added int comparison cost in that "hot" loop it 
made me think that the abstract docIdSetIterator.docId() method call could be 
questioned on that basis too?
It looks like all DocIdSetIterator subclasses maintain a doc variable mutated 
elsewhere in advance() and next() calls and docID() is meant to be idempotent 
so presumably a shared variable in the base class could avoid a docID() method 
invocation? 
Anyhoo the profiler did not show that method up as any sort of hotspot so I 
don't think it's an issue.


Thanks, Mike.




- Original Message -
From: Michael McCandless 
To: dev@lucene.apache.org; mark harwood 
Cc: 
Sent: Thursday, 1 March 2012, 14:18
Subject: Re: ConjunctionScorer.doNext() overstays?

On Thu, Mar 1, 2012 at 8:49 AM, mark harwood  wrote:
> I would have assumed the many int comparisons would cost less than the 
> superfluous disk accesses? (I bow to your considerable experience in this 
> area!)
> What is the worst-case scenario on added disk reads? Could it be as bad 
> as numberOfSegments x numberOfOtherscorers before the query winds up?

Well, it depends -- the disk access is a one-time thing but the added
per-hit check is per-hit.  At some point it'll cross over...

I think likely the advance(NO_MORE_DOCS) will not usually hit disk:
our skipper impl fully pre-buffers (in RAM) the top skip lists I
think?  Even if we do go to disk it's likely the OS pre-cached those
bytes in its IO buffer.

> On the index I tried, it looked like an improvement - the spreadsheet I 
> linked to has the source for the benchmark on a second worksheet if you want 
> to give it a whirl on a different dataset.

Maybe try it on a more balanced case?  Ie, N high-freq terms whose
freq is "close-ish"?  And on slow queries (I think the results in your
spreadsheet are very fast queries right?  The slowest one was ~0.95
msec per query, if I'm reading it right?).

In general I think not slowing down the worst-case queries is much
more important that speeding up the super-fast queries.

Mike

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: ConjunctionScorer.doNext() overstays?

2012-03-01 Thread Mark Harwood

> Ideally, consumers of DISI should hold onto the int docID returned
> from next/advance and use that... (ie, don't call docID() again,
> unless it's too hard to hold onto the returned doc).
> 

Yes, I remember raising that way back when: 
https://issues.apache.org/jira/browse/LUCENE-584?focusedCommentId=12564415&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12564415

Back then Mike B raised the issue of backwards compatibility so I don't know if 
the 4.0 release presents the opportunity to revisit that idea



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: GSOC 2012?

2012-03-02 Thread mark harwood
>>Does anyone have any ideas?

A framework for match metadata?

Similar to the way tokenization was changed to allow tokenizers to to enrich a 
stream of tokens with arbitrary "attributes", Scorers could provide 
"MatchAttributes" to provide arbitrary metadata about the stream of matches 
they produce.
Same model is used - callers decide in advance which attribute decorations they 
want to consume and Scorers modify a singleton object which can be cloned if 
multiple attributes need to be retained by the caller.

Helps support highlighting, explain and enables communication of added 
information between query objects in the tree.
LUCENE-1999 was an example of a horrible work-around where additional match 
information that was required was smuggled through by bit-twiddling the score  
- this is because score is the only bit of match context we currently pass in 
Lucene APIs.

Cheers
Mark





From: Robert Muir 
To: dev@lucene.apache.org 
Sent: Friday, 2 March 2012, 10:30
Subject: GSOC 2012?

Hello,

I was asked by a student if we are participating in GSOC this year. I
hope the answer is yes?

If we are planning to, I think it would be good if we came up with a
list on the wiki of potential tasks. Does anyone have any ideas?

One suggested idea I had (similar to LUCENE-2959 last year) would be
to add a flexible query expansion framework.

-- 
lucidimagination.com

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread mark harwood
I've been spending quite a bit of time recently benchmarking various Key-Value 
stores for a demanding project and been largely disappointed with results
However, I have developed a promising implementation based on these concepts:  
http://www.slideshare.net/MarkHarwood/lucene-kvstore

The code needs some packaging before I can release it but the slide deck should 
give a good overview of the design.


Is this something that it is likely to be of interest as a contrib module here?
I appreciate this is a departure from the regular search focus but it builds on 
some common ground in Lucene core and may have some applications here.

Cheers,
Mark


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread Mark Harwood


> Mark, can you share more on what K-V (NoSQL) stores have you've been 
> benchmarking and what have been the results?
> 

Mongo, Cassandra, Krati, Bdb a Java version of BitCask, Lucene, MySQL

I was interested in benchmarking the single-server stores rather than a 
distributed setup because your choice of store could be plugged into the likes 
of Voldemort for scale out. 

The design is similar to the Bitcask paper but keeps only hashes of keys in ram 
not the full key. 

My implementation was the only store that didn't degrade noticeably as you get 
into 10s of millions of keys in the store. 





> Did you try all the well known ones?
> http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
> 
> -- J
> 
> On Thu, Mar 22, 2012 at 10:42 AM, mark harwood  
> wrote:
> I've been spending quite a bit of time recently benchmarking various 
> Key-Value stores for a demanding project and been largely disappointed with 
> results
> However, I have developed a promising implementation based on these concepts: 
>  http://www.slideshare.net/MarkHarwood/lucene-kvstore
> 
> The code needs some packaging before I can release it but the slide deck 
> should give a good overview of the design.
> 
> 
> Is this something that it is likely to be of interest as a contrib module 
> here?
> I appreciate this is a departure from the regular search focus but it builds 
> on some common ground in Lucene core and may have some applications here.
> 
> Cheers,
> Mark
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 


Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-22 Thread Mark Harwood
> 
> Random question: Do you basically end up with something very similar to 
> LevelDB that many people where talking about a few weeks ago ? 
> 


Haven't looked at LevelDB because I was concentrating on Java implementations.

Riak's Bitcask is the most similar in principle but I didn't like the idea of 
holding keys in RAM. See  http://downloads.basho.com/papers/bitcask-intro.pdf





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Proposal - a high performance Key-Value store based on Lucene APIs/concepts

2012-03-24 Thread Mark Harwood
OK I have some code and benchmarks for this solution up on a Google Code 
project here: http://code.google.com/p/graphdb-load-tester/

The project exists to address the performance challenges I have encountered 
when dealing with large graphs. It  uses all of the Wikipedia links as a test 
dataset and a choice of graph databases (most of which use Lucene BTW).
The test data is essentially 130 million edges representing links between pages 
e.g.  Communism->Russia.
To load the data all of the graph databases have to translate user-defined keys 
like "Russia" into an internally-generated node ID using a service that looks 
like this: 
interface KeyService
{ 
//Returns existing nodeid or -1 if is not already in store
public long getGraphNodeId(String udk);

//Adds a new record - assumption is client has checked user 
defined key (udk) is not stored already using getGraphNodeId
public void put(String udk, long graphNodeId);
}

This is a challenge on a dataset of this size. I tried using a Lucene-based 
implementation for this service with the following optimisations:
1) a Bloomfilter to quickly "know what we don't know"
2) an LRUCache to hold on to commonly referenced vertices e.g the Wikipdedia 
article for "United States"
3) a hashmap representing the unflushed state of Lucene's IndexWriter to avoid 
the need for excessive flushing with NRT reader etc

The search/write performance showed the familiar saw-toothing as the Lucene 
index grew in size and merge operations kicked in.

The KVStore implementation I wrote attempts to tackle this problem using a 
fundamentally different form of index. The results from the KVStore runs show 
it was twice as fast as this  Lucene solution and maintains constant 
performance without the saw toothing effect.

Benchmark figures are here: http://goo.gl/VQ027
The KVStore source code is here: http://goo.gl/ovkop and the Lucene 
implementation I compare against is also in the project.

Cheers
Mark





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Continuous stream indexing and time-based segment management

2012-06-19 Thread mark harwood
There are a number of scenarios where Lucene might be used to index a fixed 
time range on a continuous stream of data e.g. a news feed.

In these scenarios I imagine the following facilities would be useful:

a) A MergePolicy that organized content into segments on the basis of 
increasing time units e.g. 5min->10 min->1 hour->1 day
b) The ability to drop entire segments e.g. the day-level segment from exactly 
a week ago 
c) Various new analysis functions comparing term frequencies across time e.g 
discovery of "trending" topics.

I can see that a) could be implemented using a custom MergePolicy and c) can be 
done via existing APIs but I'm not sure if there is way to simply drop entire 
segments currently?

Anyone else had thoughts in this area?

Cheers
Mark


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Continuous stream indexing and time-based segment management

2012-06-19 Thread mark harwood
> you can do that by subclassing IW and call some package private APIs /


To date I have used separate physical indexes with a MultiReader to combine 
them then dropping the outdated indexes.
At least this has the benefit that a custom MergePolicy is not required to keep 
content from the different dates segregated.

Where I saw the potential is when looking at S4 or Esper stream processing 
technologies when they try to count things in time windows.
It struck me that careful organisation of Lucene segments along time units 
could provide an efficient means of accessing and comparing counts of many 
things over time.
It looked like the "Hello World' example in S4 for counting top Twitter topics 
instantiated a Java object per unique topic String which was then responsible 
for maintaining counts on things - this seems a fairly inefficient way of 
modelling things.

>>If you are willing/able to close the IndexWriter, it's easy to drop segments 
>>by reading the SegmentInfos, editing, and writing back.

My assumption was that ultimately that's what it comes down to - I just wonder 
if this is likely to be a common requirement, deserving of a supported API



> members. We can certainly make that easier but I personally don't want
> to open this as a public API. I can certainly imagine to have a
> protected API that allows dropping entire segment.
>
> simon
>
>> c) Various new analysis functions comparing term frequencies across time e.g 
>> discovery of "trending" topics.
>>
>> I can see that a) could be implemented using a custom MergePolicy and c) can 
>> be done via existing APIs but I'm not sure if there is way to simply drop 
>> entire segments currently?
>>
>> Anyone else had thoughts in this area?

I had some ideas to add statistics to DocValues that get created
during index time. You can already do that and expose it via
Attributes maybe we can add some API to docvlaues you can hook into so
that you don't need to write you own DV impl.
>>
>> Cheers
>> Mark
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Greg Bowyer

2012-06-21 Thread mark harwood
Good to have you aboard, Greg!


- Original Message -
From: Erick Erickson 
To: dev@lucene.apache.org
Cc: 
Sent: Thursday, 21 June 2012, 11:56
Subject: Welcome Greg Bowyer

I'm pleased to announce that Greg Bowyer has been added as a
Lucene/Solr committer.

Greg:
It's a tradition that you reply with a brief bio.

Your SVN access should be set up and ready to go.

Congratulations!

Erick Erickson

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Adding another dimension to Lucene searches

2010-05-07 Thread mark harwood
I have been working on a hierarchical search capability for a while now and 
wanted to see if there was general interest in adopting some of the thinking 
into Lucene.

The idea needs a little explanation so I've put some slides up here to kick 
things off:

http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

Cheers
Mark





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Adding another dimension to Lucene searches

2010-05-08 Thread Mark Harwood
OK, seems like there is some interest.
I'll work on packaging the code/unit tests/demos and make it available.


> matching ids ... but I didn't quite catch from the slides how you encode
> the parent-child link... is it just "the next docs are sub-documents
> until the next parent doc"? 

Yes - using physical proximity avoids any kind of costly look-ups and allows 
efficient streaming/skipTo logic to work as per usual.

The downside is the need to maintain sequences of related docs in the same 
segment - something Lucene currently doesn't make easy with its limited control 
over when segments are flushed. I suspect we'll need some discussion on how 
best to support this.

Another dependency is that Lucene maintains sequencing of documents when 
merging segments together - this is something I think we can rely on currently 
(please correct me if I'm wrong) but I would like to formalise this with a 
Junit test or some other form of commitment which guarantees this state of 
affairs.

Cheers
Mark


On 8 May 2010, at 08:32, Andrzej Bialecki wrote:

> On 2010-05-07 18:25, mark harwood wrote:
>> I have been working on a hierarchical search capability for a while now and 
>> wanted to see if there was general interest in adopting some of the thinking 
>> into Lucene.
>> 
>> The idea needs a little explanation so I've put some slides up here to kick 
>> things off:
>> 
>> http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene
> 
> Very cool stuff. If I understand the design correctly, the cost of the
> query is roughly the same as constructing a Filter Query from the parent
> query, and then executing the child query with this filter. You probably
> use childScorer.skipTo(nextParentId) to avoid actually traversing all
> matching ids ... but I didn't quite catch from the slides how you encode
> the parent-child link... is it just "the next docs are sub-documents
> until the next parent doc"? or is it a field in the children that points
> to a unique id field of the parent?
> 
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Adding another dimension to Lucene searches

2010-05-10 Thread mark harwood
I've put up code, example data and tests for the Nested Document feature here: 
http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip

The data used in the unit tests is chosen to illustrate practical use of 
real-world content.
The final unit tests will work on more abstract data for more formal/exhaustive 
testing of functionality.

This packaging changes no existing Lucene code and is bundled with 3.0.1 but 
should work with 2.9.1. The readme.txt highlights the issues with segment 
flushing that may need addressing before adoption.


Cheers
Mark





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Adding another dimension to Lucene searches

2010-05-10 Thread mark harwood
Having implemented this code on a few projects I find that the major challenge 
shifts from the back end to the problem of the front end and how to get end 
users to articulate the questions Lucene can answer with this.
Certainly an interesting challenge but that's another topic...





- Original Message 
From: J. Delgado 
To: dev@lucene.apache.org
Sent: Mon, 10 May, 2010 16:47:50
Subject: Re: Adding another dimension to Lucene searches

Hierachical documents is a key concept towads a unified
structured+unstructured search. It should allow us to fully implement
things such as XQuery + Full-Text
(http://www.w3.org/TR/xquery-full-text/)

Additionally it solves a century old problem: how to deal with
section/sub-sections in very large documents. Long time ago I was
indexing text books (in PDF) and had to break down the book into pages
and store the main doc id in a field as pointer to maintain the
relation.

Mark, way to go!

-- Joaquin

On Mon, May 10, 2010 at 8:03 AM, Grant Ingersoll  wrote:
> Very cool stuff, Mark.
>
> Can you just open a JIRA and attach there?
>
> On May 10, 2010, at 8:38 AM, mark harwood wrote:
>
>> I've put up code, example data and tests for the Nested Document feature 
>> here: http://www.inperspective.com/lucene/LuceneNestedDocumentSupport.zip
>>
>> The data used in the unit tests is chosen to illustrate practical use of 
>> real-world content.
>> The final unit tests will work on more abstract data for more 
>> formal/exhaustive testing of functionality.
>>
>> This packaging changes no existing Lucene code and is bundled with 3.0.1 but 
>> should work with 2.9.1. The readme.txt highlights the issues with segment 
>> flushing that may need addressing before adoption.
>>
>>
>> Cheers
>> Mark
>>
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Web-Based Luke

2010-07-09 Thread Mark Harwood
See 
http://search.lucidimagination.com/search/document/63cef9e98692a126/webluke_include_jetty_in_lucene_binary_distribution

There's a link to a zip file with source there which should still be available. 



On 9 Jul 2010, at 15:14, Mark Miller  wrote:

> Did the GWT version of Luke that Mark Harwood started ever get dumped to
> JIRA or anything? All I can find is a link to a war, but not the source.
> Mark? Anyone?
> 
> - Mark
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Web-Based Luke

2010-07-11 Thread Mark Harwood
I had concerns about the bloat the gwt compiler would add to the source distro. 
If of interest it could do with upgrading to the latest gwt. Ideally all Luke 
front ends (swing/gwt/thinlet) would share the same back end api. Decoupling 
from thinlet as done in this webluke code is the first step down the road to a 
version of Luke that is apache-license-friendly and can therefore be maintained 
by the apache community.  

---

On 12 Jul 2010, at 00:02, John Wang  wrote:

> Mark:
> 
>This is a super useful tool! 
> 
>Any plans of putting it under lucene contrib?
> 
> Thanks
> 
> -John
> 
> On Fri, Jul 9, 2010 at 7:26 AM, Mark Harwood  wrote:
> See 
> http://search.lucidimagination.com/search/document/63cef9e98692a126/webluke_include_jetty_in_lucene_binary_distribution
> 
> There's a link to a zip file with source there which should still be 
> available.
> 
> 
> 
> On 9 Jul 2010, at 15:14, Mark Miller  wrote:
> 
> > Did the GWT version of Luke that Mark Harwood started ever get dumped to
> > JIRA or anything? All I can find is a link to a war, but not the source.
> > Mark? Anyone?
> >
> > - Mark
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 


Re: Web-Based Luke

2010-07-12 Thread Mark Harwood
Agreed. I think apache is a preferable home. 
The major change to Luke in providing a  Luke core api is the need to be 
remotable i.e. Use of an interface and serializable data objects used for args. 
Gwt rpc should take care of the marshalling and I've used similar frameworks 
for applet clients. 

Like Andrzej I have limited time to work on this though. :(



On 12 Jul 2010, at 08:54, Andrzej Bialecki  wrote:

> On 2010-07-12 09:14, John Wang wrote:
>> share FE with luke is defn a good idea.
>> 
>> any thoughts on putting webluke up on goog code or github?
> 
> Guys, if you want to move forward with webluke, I think it's better to do 
> this under Lucene contrib. The reason is that if there's a substantial 
> development done outside Apache then it will need a code grant, and also it's 
> more difficult for other Lucene committers to participate in the outside 
> development and to bring its results back to ASF.
> 
> I'm perfectly willing to donate all Luke's code to ASF, as I've said many 
> times in the past, if there's any chance of someone stepping in and removing 
> the Thinlet dependency. I'm also willing to work together as a Lucene 
> committer on an abstracted Luke core, if not on the GWT front-end (I don't 
> know GWT and I have too little time to learn it now).
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Nested Document support in Lucene

2011-03-22 Thread mark harwood
>AFAIK this is still under heavy development and it doesn't seem to be ready in 
>the near future.

It's stable as far as I'm concerned. 
Lucene-2454 includes the code and Junit tests that work with the latest 3.0.3 
release. I have versions of this running in production with 2.4 and 2.9-based 
releases.
The only concern for users is the need to carefully control when flushing 
occurs 
and the accompanying readme.txt gives advice on how to achieve this.







From: Kapil Charania 
To: simon.willna...@gmail.com
Cc: Simon Willnauer ; dev@lucene.apache.org
Sent: Tue, 22 March, 2011 9:12:20
Subject: Re: Nested Document support in Lucene

May I know in which release will it ready to use.


On Sat, Mar 19, 2011 at 2:23 PM, Simon Willnauer 
 wrote:

On Sat, Mar 19, 2011 at 9:39 AM, Kapil Charania
> wrote:
>> Hi,
>>
>> I am a newbie to Lucene. I have already created indexes for my project. But
>> now requirement is to go with Nested Document. I googled a lot but can not
>> find much implementation of nested documents.
>>
>> My I know if its already implemented in any release of Lucene.
>>
>> Thanks in Advances !!!
>
>AFAIK this is still under heavy development and it doesn't seem to be
>ready in the near future. I has not yet been released.
>
>simon
>>
>> --
>> Kapil Charania.
>>
>


-- 
Kapil Charania.



  

Re: revisit naming for grouping/join?

2011-07-01 Thread mark harwood
>> I think what would be best is a smallish but feature complete demo,

For the nested stuff I had a reasonable demo on LUCENE-2454 that was based 
around resumes - that use case has the one-to-many characteristics that lends 
itself to nested e.g. a person has many different qualifications and records of 
employment.
This scenario was illustrated 
here: 
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

I also had the "book search" type scenario where a book has many sections and 
for the purposes of efficient highlighting/summarisation  these sections were 
treated as child docs which could be read quickly (rather than highlighting a 
whole book)

I'm not sure what the "parent" was in your doctor and cities example, Mike. If 
a 
doctor is in only one city then there is no point making city a child doc as 
the 
one city info can happily be combined with the doctor info into a single 
document with no conflict (doctors have different properties to cities).
If the city is the parent with many child doctor docs that makes more sense but 
feels like a less likely use case e.g. "find me a city with doctor x and a 
different doctor y"
Searching for a person with excellent java and prefrerably good lucene skills 
feels like a more real-world example.

It feels like documenting some of the trade-offs behind index design choices is 
useful too e.g. nesting is not too great for very volatile content with 
constantly changing children while search-time join is more costly in RAM and 
2-pass processing

Cheers
Mark



- Original Message 
From: Michael McCandless 
To: dev@lucene.apache.org
Sent: Fri, 1 July, 2011 13:51:04
Subject: Re: revisit naming for grouping/join?

I think joining and grouping are two different functions, and we
should keep different modules for them...

On Thu, Jun 30, 2011 at 10:30 PM, Robert Muir  wrote:
> Hi,
>
> when looking at just a very quick glance at some of the newer
> grouping/join features, I found myself a little confused about what is
> exactly what, and I think users might too.

They are confusing!

> I discussed some of this with hossman, and it only seemed to make me
> even more totally confused about:
> * difference between field collapsing and grouping

I like the name grouping better here: I think field collapsing
undersells (it's only one specific way to use grouping).  EG, grouping
w/o collapsing is useful (eg, Best Buy grouping hits by product
category and showing the top 5 in each).

> * difference between nested documents and the index-time join

Similarly I think nested docs undersells index-time join: you can
join (either during indexing or during searching) in many different
ways, and nested docs is just one use case.

EG, maybe your docs are doctors but during indexing you join to a city
table with facts about that city (each doctor's office is in a
specific city) and then you want to run queries like "city's avg
annual temp > 60 and doctor has good bedside manner" or something.

> * difference between index-time-join/nested documents and single-pass
> index-time grouping. Is the former only a more general case of the
> latter?

Grouping is purely a presentation concern -- you are not altering
which docs hit; you are simply changing how you pick which hits to
display ("top N by group").  So we only have collectors here.

The "generic" (requires 2 passes) collectors can group on anything at
search time; the "doc block" collector requires that you indexed all
docs in each group as a block.

Join is both about restricting matches and also presentation of hits,
because your query needs to match fields from different [logical]
tables (so, the module has a Query and a Collector).  When you get the
results back, you may or may not be interested in retaining the table
structure in your result set (ie, you may not have selected fields
from the child table).

Similarly, "generic" joining (in Solr/ElasticSearch today but I'd like
to factor into the join module) can do any join at search time, while
the "doc block" collector requires that you did the necessary join(s)
during indexing.

> * difference between the above joinish capabilities and solr's join
> impl... other than the single-pass/index-time limitation (which is
> really an implementation detail), I'm talking about use cases.

Solr's/ElasticSearch's join is more general because you can join
anything at search time (even, across 2 different indexes), vs doc
block join where you must pick which joins you will ever want to use
and then build the index accordingly.

You can also mix the two.  Maybe you do certain joins while indexing,
but then at search time you do other joins "generically".  That's
fine.  (Same is true for grouping).

> I think its especially interesting since the join module depends on
> the grouping module.

The join module does currently depend on the grouping module, but for
a silly reason: just for the TopGroups, to represent the returned
hits.  We could mov

BlockJoin concerns

2011-10-14 Thread mark harwood
I've been looking at the BlockJoin stuff in 3.4 in relation to children of 
multiple types and have a couple of concerns which are either issues, or my 
ignorance of the API:

Concern #1

If I only retrieve children of type A all is well.

If I only retrieve children of type B all is well.
If I try retrieve children of type A and then B I get a null TopGroups returned 
for B.
(test code for this at the end of this email)


Concern #2

I'm not sure where I get to control how many children of type A and of Type B 
are returned per parent?
BlockJoinCollector's constructor only controls how many parents are collected.

*Post-search* I can 
call BlockJoinCollector'.getTopGroups(childQueryA,...maxDocsPerGroup..) to 
define how many children I get back. Does this imply if I ask for more child 
docs than are cached by the collector the search is somehow automatically 
repeated?
If so, what would be the "default" number of child docs cached by the collector 
and where would I set that?

Cheers
Mark


Below is the code I added to the existing TestBlockJoin which exercises the 
above.

//=

 public void testMultiChildTypes() throws Exception {

    final Directory dir = newDirectory();
    final RandomIndexWriter w = new RandomIndexWriter(random, dir);

    final List docs = new ArrayList();

    docs.add(makeJob("java", 2007));
    docs.add(makeJob("python", 2010));
    docs.add(makeQualification("maths", 1999));
    docs.add(makeResume("Lisa", "United Kingdom"));
    w.addDocuments(docs);

    IndexReader r = w.getReader();
    w.close();
    IndexSearcher s = new IndexSearcher(r);

    // Create a filter that defines "parent" documents in the index - in this 
case resumes
    Filter parentsFilter = new CachingWrapperFilter(new QueryWrapperFilter(new 
TermQuery(new Term("docType", "resume";

    // Define child document criteria (finds an example of relevant work 
experience)
    BooleanQuery childJobQuery = new BooleanQuery();
    childJobQuery.add(new BooleanClause(new TermQuery(new Term("skill", 
"java")), Occur.MUST));
    childJobQuery.add(new BooleanClause(NumericRangeQuery.newIntRange("year", 
2006, 2011, true, true), Occur.MUST));

    BooleanQuery childQualificationQuery = new BooleanQuery();
    childQualificationQuery.add(new BooleanClause(new TermQuery(new 
Term("qualification", "maths")), Occur.MUST));
    childQualificationQuery.add(new 
BooleanClause(NumericRangeQuery.newIntRange("year", 1980, 2000, true, true), 
Occur.MUST));


    // Define parent document criteria (find a resident in the UK)
    Query parentQuery = new TermQuery(new Term("country", "United Kingdom"));

    // Wrap the child document query to 'join' any matches
    // up to corresponding parent:
    BlockJoinQuery childJobJoinQuery = new BlockJoinQuery(childJobQuery, 
parentsFilter, BlockJoinQuery.ScoreMode.Avg);
    BlockJoinQuery childQualificationJoinQuery = new 
BlockJoinQuery(childQualificationQuery, parentsFilter, 
BlockJoinQuery.ScoreMode.Avg);

    // Combine the parent and nested child queries into a single query for a 
candidate
    BooleanQuery fullQuery = new BooleanQuery();
    fullQuery.add(new BooleanClause(parentQuery, Occur.MUST));
    fullQuery.add(new BooleanClause(childJobJoinQuery, Occur.MUST));
    fullQuery.add(new BooleanClause(childQualificationJoinQuery, Occur.MUST));

    //? How do I control volume of jobs vs qualifications per parent?
    BlockJoinCollector c = new BlockJoinCollector(Sort.RELEVANCE, 10, true, 
false);

    s.search(fullQuery, c);

    //Examine "Job" children
    boolean showNullPointerIssue=true;
    if(showNullPointerIssue)
    {
    TopGroups jobResults = c.getTopGroups(childJobJoinQuery, null, 0, 
10, 0, true);

    //assertEquals(1, results.totalHitCount);
    assertEquals(1, jobResults.totalGroupedHitCount);
    assertEquals(1, jobResults.groups.length);

    final GroupDocs group = jobResults.groups[0];
    assertEquals(1, group.totalHits);

    Document childJobDoc = s.doc(group.scoreDocs[0].doc);
    //System.out.println("  doc=" + group.scoreDocs[0].doc);
    assertEquals("java", childJobDoc.get("skill"));
    assertNotNull(group.groupValue);
    Document parentDoc = s.doc(group.groupValue);
    assertEquals("Lisa", parentDoc.get("name"));
    }

    //Now Examine qualification children
    TopGroups qualificationResults = 
c.getTopGroups(childQualificationJoinQuery, null, 0, 10, 0, true);

    //! This next line can null pointer - but only if prior "jobs" section 
called first
    assertEquals(1, qualificationResults.totalGroupedHitCount);
    assertEquals(1, qualificationResults.groups.length);

    final GroupDocs qGroup = qualificationResults.groups[0];
    assertEquals(1, qGroup.totalHits);

    Document childQualificationDoc = s.doc(qGroup.scoreDocs[0].doc);
    assertEquals("maths", childQualificationDoc.get("qualification"));
    assertNotNull(qGroup.groupValue);
    Document parentDoc = s.doc(qGroup.groupValu

Re: BlockJoin concerns

2011-10-14 Thread mark harwood
>>I opened LUCENE-3519 for the unexpected null when pulling the
>>TopGroups, and added your test case (thanks!).


Great, thanks. 

>>the collector internally gathers all child docIDs for a given collected 
>>parent docID

OK - I guess that scales OK because the numbers of docIDs per parent is 
naturally limited by the number of docs you can hold in RAM as part of the 
original IW.addDocuments call - i.e. not in the millions.

Cheers,
Mark



- Original Message -
From: Michael McCandless 
To: dev@lucene.apache.org; mark harwood 
Cc: 
Sent: Friday, 14 October 2011, 13:56
Subject: Re: BlockJoin concerns

Hi Mark,

I opened LUCENE-3519 for the unexpected null when pulling the
TopGroups, and added your test case (thanks!).

On Concern #2, this is not limited today: the collector internally
gathers all child docIDs for a given collected parent docID, and only
in the end when ask for the top groups does it sort the child docIDs
within each group and keep the topN you passed to it.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Oct 14, 2011 at 7:09 AM, mark harwood  wrote:
> I've been looking at the BlockJoin stuff in 3.4 in relation to children of 
> multiple types and have a couple of concerns which are either issues, or my 
> ignorance of the API:
>
> Concern #1
> 
> If I only retrieve children of type A all is well.
>
> If I only retrieve children of type B all is well.
> If I try retrieve children of type A and then B I get a null TopGroups 
> returned for B.
> (test code for this at the end of this email)
>
>
> Concern #2
> 
> I'm not sure where I get to control how many children of type A and of Type B 
> are returned per parent?
> BlockJoinCollector's constructor only controls how many parents are collected.
>
> *Post-search* I can 
> call BlockJoinCollector'.getTopGroups(childQueryA,...maxDocsPerGroup..) to 
> define how many children I get back. Does this imply if I ask for more child 
> docs than are cached by the collector the search is somehow automatically 
> repeated?
> If so, what would be the "default" number of child docs cached by the 
> collector and where would I set that?
>
> Cheers
> Mark
>
>
> Below is the code I added to the existing TestBlockJoin which exercises the 
> above.
>
> //=
>
>  public void testMultiChildTypes() throws Exception {
>
>     final Directory dir = newDirectory();
>     final RandomIndexWriter w = new RandomIndexWriter(random, dir);
>
>     final List docs = new ArrayList();
>
>     docs.add(makeJob("java", 2007));
>     docs.add(makeJob("python", 2010));
>     docs.add(makeQualification("maths", 1999));
>     docs.add(makeResume("Lisa", "United Kingdom"));
>     w.addDocuments(docs);
>
>     IndexReader r = w.getReader();
>     w.close();
>     IndexSearcher s = new IndexSearcher(r);
>
>     // Create a filter that defines "parent" documents in the index - in this 
> case resumes
>     Filter parentsFilter = new CachingWrapperFilter(new 
> QueryWrapperFilter(new TermQuery(new Term("docType", "resume";
>
>     // Define child document criteria (finds an example of relevant work 
> experience)
>     BooleanQuery childJobQuery = new BooleanQuery();
>     childJobQuery.add(new BooleanClause(new TermQuery(new Term("skill", 
> "java")), Occur.MUST));
>     childJobQuery.add(new BooleanClause(NumericRangeQuery.newIntRange("year", 
> 2006, 2011, true, true), Occur.MUST));
>
>     BooleanQuery childQualificationQuery = new BooleanQuery();
>     childQualificationQuery.add(new BooleanClause(new TermQuery(new 
> Term("qualification", "maths")), Occur.MUST));
>     childQualificationQuery.add(new 
> BooleanClause(NumericRangeQuery.newIntRange("year", 1980, 2000, true, true), 
> Occur.MUST));
>
>
>     // Define parent document criteria (find a resident in the UK)
>     Query parentQuery = new TermQuery(new Term("country", "United Kingdom"));
>
>     // Wrap the child document query to 'join' any matches
>     // up to corresponding parent:
>     BlockJoinQuery childJobJoinQuery = new BlockJoinQuery(childJobQuery, 
> parentsFilter, BlockJoinQuery.ScoreMode.Avg);
>     BlockJoinQuery childQualificationJoinQuery = new 
> BlockJoinQuery(childQualificationQuery, parentsFilter, 
> BlockJoinQuery.ScoreMode.Avg);
>
>     // Combine the parent and nested child queries into a single query for a 
> candidate
>     BooleanQuery fullQuery = new BooleanQuery();
>     fullQuery.add(new BooleanClause(parentQuery, Occur.MUST

Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

2010-09-10 Thread mark harwood
Hi Mark
I've played with Shingles recently in some auto-categorisation work where my 
starting assumption was that multi-word terms will hold more information value 
than individual words and that phrase queries on seperate terms will not give 
these term combos their true reward (in terms of IDF) - or if they did compute 
the true IDF,  would require lots of disk IO to do this. Shingles present a 
conveniently pre-aggregated score for these combos.
Looking at the results of MoreLikeThis queries based on a shingling analyzers 
the results I saw generally seemed good but did not formally bench mark this 
against non-shingled indexes. Not everything was rosy in that I did see some 
tendency to over-reward certain rare shingles (e.g. a shared mention of "New 
Years Eve Party" pulled otherwise mostly unrelated news articles together). 
This 
led me to look at using the links in resulting documents to help identify 
clusters of on-topic and potentially off-topic results to tune these 
discrepancies out but that's another topic.
BTW, the Luke tool has a "Zipf" plugin that you may find useful in examining 
index term distributions in Lucene indexes..

Cheers
Mark



From: Mark Bennett 
To: java-...@lucene.apache.org
Sent: Fri, 10 September, 2010 1:42:11
Subject: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

I want to boost the relevancy of some Question and Answer content. I'm using 
stop words, Dismax, and I'm already a fan of Phrase Boosting and have cranked 
that up a bit. But I'm considering using long Shingles to make use of some of 
the normally stopped out "junk words" in the content to help relevancy further.

Reminder: "Shingles" are artificial tokens created by gluing together adjacent 
words.
Input text: This is a sentence
Normal tokens: this, is, a, sentence  (without removing stop words)
2+3 word shingles: this-is, is-a, a-sentence, this-is-a, is-a-sentence

A few questions on relevance and shingles:

1: How similar are the relevancy calculations compare between Shingles and 
exact 
phrases?

I've seen material saying that shingles can give better performance than normal 
phrase searching, and I'm assuming this is exact phrase (vs. allowing for 
phrase 
slop)

But do the relevancy calculations for normal exact phrase and Shingles wind up 
being *identical*, for the same documents and searches?  That would seem an 
unlikely coincidence, but possibly it could have been engineered to 
intentionally behave that way.

2: What's the latest on Shingles and Dismax?

The low front end low level tokenization in Dismax would seem to be a problem, 
but does the new parser stuff help with this?

3: I'm thinking of a minimum 3 word shingle, does anybody have comments on 
shingle length?

Eyeballing the 2 word shingles, they don't seem much better than stop words.  
Obviously my shingle field bypasses stop words.

But the 3 word shingles start to look more useful, expressing more intent, such 
as "how do i", "do i need" and "it work with", etc.

Has there been any Lucene/Solr studies specifically on shingle length?

and finally...

4: Is it useful to examine your token occurrences against a Power-Law log-log 
curve?

So, with either single words, or shingles, you do a histogram, and then plot 
the 
histogram in an X-Y graph, with both axis being logarithmic. Then see if the 
resulting graph follows (or diverges) from a straight line.  This "Long Tail" / 
Pareto / powerlaw mathematics were very popular a few years ago for looking at 
histograms of DVD rentals and human activities, and prior to the web, the power 
law and 80/20 rules has been observed in many other situations, both man made 
and natural.

Also of interest, when a distribution is expected to follow a power line, but 
the actual data deviates from that theoretical line, then this might indicate 
some other factors at work, or so the theory goes.

So if users' searches follow any type of histogram with a hidden powerlaw line, 
then it makes sense to me that the source content might also follow a similar 
distribution.  Is the normal IDF ranking inspired by that type of curve?

And *if* word occurrences, in either searches or source documents, were 
expected 
to follow a power law distribution, then possible shingles would follow such a 
curve as well.

Thinking that document text, like many other things in nature, might follow 
such 
a curve, I used the Lucene index to generate such a curve. And I did the same 
thing for 3 word tokens. The 2 curves do have different slopes, but neither is 
very straight.

So I was wondering if anybody else has looked at IDF curves (actually 
non-inverted document frequency curves) or raw word instance counts and power 
law graphs?  I haven't found a smoking gun in my online searches, but I'm 
thinking some of you would know this.


--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513




Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

2010-09-11 Thread mark harwood
>>What is the "best practices" formula for determining above average 
>>correlations 
>>of adjacent terms

I gave this some thought in https://issues.apache.org/jira/browse/LUCENE-474
I found the Jaccard cooefficient favoured rare words too strongly and so went 
for a blend as shown below:


public float getScore()
{
float overallIntersectionPercent = coIncidenceDocCount
/ (float) (termADocFreq + termBDocFreq);
float termBIntersectionPercent = coIncidenceDocCount
/ (float) (termBDocFreq);

//using just the termB intersection favours common words as
// coincidents eg "new" food
//  return termBIntersectionPercent;
//using just the overall intersection favours rare words as
// coincidents eg "scezchuan" food
//return overallIntersectionPercent;
// so here we take an average of the two:
return (termBIntersectionPercent + overallIntersectionPercent) / 2;
}





From: Mark Bennett 
To: dev@lucene.apache.org
Sent: Fri, 10 September, 2010 18:44:31
Subject: Re: Relevancy, Phrase Boosting, Shingles and Long Tail Curves

Thanks Mark H,

Maybe I'll look at MLT (More Like This) again.  I'll also check out zipf.

It's claimed that Question and Answer wording is different enough for generic 
text content that different techniques might be indicated. From what I remember:
1: Though nouns normally convey 60% of relevancy in general text, Q&A content 
is 
skewed a bit more towards verbs.
2: Questions may contain more noise words (though perhaps in useful groupings)
3: Vocabulary mismatch of Interrogative vs. declarative / narrative (Q vs A)
4: Vocabulary mismatch of novices vs experts (Q vs A)

It was item 2 that I was hoping to capitalize on with NGrams / Shingles.

Still waiting for the relevancy math nerds to chime in about the log-log and 
IDF 
stuff ... ;-)

I was thinking a bit more about the math involved here

What is the "best practices" formula for determining above average correlations 
of adjacent terms, beyond what random chance would give. So you notice that 
"white" and "house" appear next to each other more than what chance 
distribution 
would explain, so you decide it's an important NGram.

The "noise floor" isn't too bad for the typical shopping cart items calculation.
You analyze the items present or not present in 1,000 shopping cart receipts.
If grocery items were completely independent then "random" level is  just 
the odds of the 2 items multiplied together:
1,000 shopping carts
200 have cereal
250 have milk
chance of
cereal = 200/1,000 = 20%
milk = 250/1,000 = 25%
IF independent then
P(cereal AND milk) = P(cereal) * P(milk)
20% * 25% = 5%
So 50 carts likely to have both cereal and milk
And if MORE than 50 carts have cereal and milk, then it's worth  noting.
The classic example is diapers and beer, which is a bit apocryphal and NOT 
expected, but I like the breakfast cereal and milk example better because it IS 
expected.

Now back to word-A appearing directly before word-B, and finding the base level 
number of times you'd expect just from random chance.

Although Lucene/Luke gives you total word instances and document counts, what 
you'd really want is the number of possible N-Grams, which is affected by 
document boundaries, so it gets a little weird.

Some other differences between the word-A word-B calculation vs milk and cereal:
1: I want ordered pairs, "white" before "house"
2: A document is NOT like a shopping cart in that I DO care how many times 
"white" appears before "house", whereas in the shopping carts I only cared 
about 
present or not present, so document count is less helpful here.

I'm sure some companies and PHD's have super secret formulas for this, but I'd 
be content to just compare it to baseline random chance.

Mark B

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513



On Fri, Sep 10, 2010 at 3:17 AM, mark harwood  wrote:

Hi Mark
>I've played with Shingles recently in some auto-categorisation work where my 
>starting assumption was that multi-word terms will hold more information value 
>than individual words and that phrase queries on seperate terms will not give 
>these term combos their true reward (in terms of IDF) - or if they did compute 
>the true IDF,  would require lots of disk IO to do this. Shingles present a 
>conveniently pre-aggregated score for these combos.
>Looking at the results of MoreLikeThis queries based on a shingling analyzers 
>the results I saw generally seemed good but did not formally b

Document links

2010-09-20 Thread mark harwood
I've been looking at Graph Databases recently (neo4j, OrientDb, InfiniteGraph) 
as a faster alternative to relational stores. I notice they either embed Lucene 
for indexing node properties or (in the case of OrientDB) are talking about 
doing this. 

I think their fundamental performance advantage over relational stores is that 
they don't have to de-reference foreign keys in a b-tree index to get from a 
source node to a target node. Instead they use internally-generated IDs to act 
like pointers with more-or-less direct references between nodes/vertexes.  As a 
result they can follow links very quickly. This got me thinking could Lucene 
adopt the idea of creating links between documents that were equally fast using 
Lucene doc ids?

Maybe the user API would look something like this...

indexWriter.addLink(fromDocId, toDocId);
DocIdSet reader.getInboundLinks(docId);
DocIdSet reader.getOutboundLinks(docId);


Internally a new index file structure would be needed to record link info. Any 
recorded links that connect documents from different segments would need 
careful 
adjustment of referenced link IDs when segments merge and Lucene doc IDs are 
shuffled.

As well as handling typical graphs (social networks, web data) this could 
potentially be used to support tagging operations where apps could create "tag" 
documents and then link them to existing documents that are being tagged 
without 
having to update the target doc. There are probably a ton of applications for 
this stuff.

I see the Graph DBs busy recreating transactional support, indexes, segment 
merging etc and it seems to me that Lucene has a pretty good head start with 
this stuff.
Anyone else think this might be an area worth exploring?

Cheers
Mark




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-09-21 Thread mark harwood
>>Wouldn't that be sufficient?

Not for some apps. I tried playing the "Kevin Bacon" game using a Lucene index 
of IMDB data using actorID and movieID keys.
The difference between that and Neo4j on the same data and query  is night and 
day. The graph databases are really onto something when resolving a 
relationship 
doesn't first require an index to look up endpoints.





- Original Message 
From: Paul Elschot 
To: dev@lucene.apache.org
Sent: Tue, 21 September, 2010 17:25:31
Subject: Re: Document links

When the (primary) key values are provided by the user,
one could use additional small documents to only store/index
these relations whenever they change.

Wouldn't that be sufficient?

Regards,
Paul Elschot



Op dinsdag 21 september 2010 00:35:02 schreef mark harwood:
> I've been looking at Graph Databases recently (neo4j, OrientDb, 
> InfiniteGraph) 

> as a faster alternative to relational stores. I notice they either embed 
> Lucene 
>
> for indexing node properties or (in the case of OrientDB) are talking about 
> doing this. 
> 
> I think their fundamental performance advantage over relational stores is 
> that 

> they don't have to de-reference foreign keys in a b-tree index to get from a 
> source node to a target node. Instead they use internally-generated IDs to 
> act 

> like pointers with more-or-less direct references between nodes/vertexes.  As 
> a 
>
> result they can follow links very quickly. This got me thinking could Lucene 
> adopt the idea of creating links between documents that were equally fast 
> using 
>
> Lucene doc ids?
> 
> Maybe the user API would look something like this...
> 
> indexWriter.addLink(fromDocId, toDocId);
> DocIdSet reader.getInboundLinks(docId);
> DocIdSet reader.getOutboundLinks(docId);
> 
> 
> Internally a new index file structure would be needed to record link info. 
> Any 

> recorded links that connect documents from different segments would need 
>careful 
>
> adjustment of referenced link IDs when segments merge and Lucene doc IDs are 
> shuffled.
> 
> As well as handling typical graphs (social networks, web data) this could 
> potentially be used to support tagging operations where apps could create 
> "tag" 
>
> documents and then link them to existing documents that are being tagged 
>without 
>
> having to update the target doc. There are probably a ton of applications for 
> this stuff.
> 
> I see the Graph DBs busy recreating transactional support, indexes, segment 
> merging etc and it seems to me that Lucene has a pretty good head start with 
> this stuff.
> Anyone else think this might be an area worth exploring?
> 
> Cheers
> Mark
> 
> 
>  
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-09-21 Thread Mark Harwood

> It should be possible to randomly add and delete such relationships after
> indexWriter.addDocument(), is that the idea?


Yes. A "like" action may, for example allow me to tag an existing document by 
connecting 2 documents - my personal "like" document and a document with 
content of interest.
   doc 1   = [user:mark   tag:like]
   doc 56 = [title:Lucenebody:Lucene is a search library...]

I then call:
   indexWriter.addLink(1,56)

If this was my first "Like" then I may need to contemplate using a variation of 
the above API that allows a yet-to-be-committed "Document" object in place of 
the doc ids.


> Adding such relationships by docId would need the addition of
> a separate (from the segments) index structure


Yes, I need to think about the detail of file structures next. For now I'm 
sticking with thinking about user API and functionality and assuming we can 
maintain cross-segment docid references that get updated somehow at merge time.

> 
> 
> Would each link also have an attribute (think payload)?

I was thinking if attributes are needed (e.g. a star rating on my document 
"like" example) then this could be catered for with a document e.g. rather than 
linking the single doc [user:mark tag:like] to all my liked docs I could create 
specific doc instances of [user:mark rating:5 tag:like] and linking via that. 


> Would such relationships be named (sth like foreign key field names)?

For now I was thinking of storing simple docid->docid links.

Once we have these links we could do some funky things:
{pseudo code:}
   //My fave docs from last week
   int myLikesDocId=searchForLuceneDocWithUserNameAndTag("mark", "like");
   DocIdSet myLikedDocs =indexReader.getOutboundLinks(myLikesDocId)
   searcher.search(lastWeekRangeQuery, new Filter(myLikedDocs));

  //Other users who share my interests
  DocIdSet usersWhoLikeWhatILike = indexReader.getInboundLinks(myLikedDocs);


Cheers
Mark



Re: Document links

2010-09-22 Thread mark harwood
Some inital thoughts on the challenges in maintaining docid->docid links:

https://spreadsheets.google.com/ccc?key=0AsKVSn5SGg_wdHhMUW9ya0xxUFI3VXBHZGZHVUo4RkE&hl=en&authkey=CLOmwrgL#gid=0




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-09-24 Thread mark harwood
This slideshow has a first-cut on the Lucene file format extensions required to 
support fast linking between documents:

http://www.slideshare.net/MarkHarwood/linking-lucene-documents


Interested in any of your thoughts.

Cheers,
Mark





-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-09-24 Thread mark harwood
>>While not exactly equivalent, it reminds me of our earlier discussion around 
>>"layered segments" for dealing with field updates

Right. Fast discovery of document relations is a foundation on which lots of 
things like this can build. Relations can be given types to support a number of 
different use cases.



- Original Message 
From: Grant Ingersoll 
To: dev@lucene.apache.org
Sent: Fri, 24 September, 2010 16:26:27
Subject: Re: Document links

While not exactly equivalent, it reminds me of our earlier discussion around 
"layered segments" for dealing with field updates [1], [2], albeit this is a 
bit 
more generic since one could not only use the links for relating documents, but 
one could use "special" links underneath the covers in Lucene to maintain/mark 
which fields have been updated and then traverse to them.

[1] 
http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384

[2] 
http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e


On Sep 24, 2010, at 10:36 AM, mark harwood wrote:

> This slideshow has a first-cut on the Lucene file format extensions required 
> to 
>
> support fast linking between documents:
> 
> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
> 
> 
> Interested in any of your thoughts.
> 
> Cheers,
> Mark
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

--
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-09-25 Thread Mark Harwood
My starting point in the solution I propose was to eliminate linking via any 
type of key. Key lookups mean indexes and indexes mean disk seeks. Graph 
traversals have exponential numbers of links and so all these index disk seeks 
start to stack up. The solution I propose uses doc ids as more-or-less direct 
pointers into file structures avoiding any index lookup.
I've started coding up some tests using the file structures I outlined and will 
compare that with a traditional key-based approach.

For reference - playing the "Kevin Bacon game" on a traditional Lucene index of 
IMDB data took 18 seconds to find a short path that Neo4j finds in 200 
milliseconds on the same data (and this was a disk based graph of 3m nodes, 10m 
edges).
Going from actor->movies->actors->movies produces a lot of key lookups and the 
difference between key indexes and direct node pointers becomes clear.
I know path finding analysis is perhaps not a typical Lucene application but 
other forms of link analysis e.g. recommendation engines require similar 
performance.

Cheers
Mark



On 25 Sep 2010, at 11:41, Paul Elschot wrote:

> Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>>>> While not exactly equivalent, it reminds me of our earlier discussion 
>>>> around 
>>>> "layered segments" for dealing with field updates
>> 
>> Right. Fast discovery of document relations is a foundation on which lots of 
>> things like this can build. Relations can be given types to support a number 
>> of 
>> different use cases.
> 
> How about using this (bsd licenced) tree as a starting point:
> http://bplusdotnet.sourceforge.net/
> It has various keys: ao. byte array, String and long.
> 
> A fixed size byte array as key seems to be just fine: two bytes for a field 
> number,
> four for the segment number and four for the in-segment document id.
> The separate segment number would allow to minimize the updates
> in the tree during merges. One could also use the normal doc id directly.
> 
> The value could then be a similar to the key, but without
> the field number, and with an indication of the direction of the link.
> Or perhaps the direction of the link should be added to the key.
> A link would be present twice, once for each direction.
> Also both directions could have their own payloads.
> 
> It could be put in its own file as a separate 'segment', or maybe
> each segment could allow for allocation of a part of this tree.
> 
> I like this somehow, in case it is done right one might never
> need a relational database again. Well, almost...
> 
> Regards,
> Paul Elschot
> 
> 
>> 
>> 
>> 
>> - Original Message 
>> From: Grant Ingersoll 
>> To: dev@lucene.apache.org
>> Sent: Fri, 24 September, 2010 16:26:27
>> Subject: Re: Document links
>> 
>> While not exactly equivalent, it reminds me of our earlier discussion around 
>> "layered segments" for dealing with field updates [1], [2], albeit this is a 
>> bit 
>> more generic since one could not only use the links for relating documents, 
>> but 
>> one could use "special" links underneath the covers in Lucene to 
>> maintain/mark 
>> which fields have been updated and then traverse to them.
>> 
>> [1] 
>> http://www.lucidimagination.com/search/document/c871ea4672dda844/aw_incremental_field_updates#7ef11a70cdc95384
>> 
>> [2] 
>> http://www.lucidimagination.com/search/document/ee102692c8023548/incremental_field_updates#13ffdd50440cce6e
>> 
>> 
>> On Sep 24, 2010, at 10:36 AM, mark harwood wrote:
>> 
>>> This slideshow has a first-cut on the Lucene file format extensions 
>>> required to 
>>> 
>>> support fast linking between documents:
>>> 
>>> http://www.slideshare.net/MarkHarwood/linking-lucene-documents
>>> 
>>> 
>>> Interested in any of your thoughts.
>>> 
>>> Cheers,
>>> Mark
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> 
>> 
>> --
>> Grant Ingersoll
>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> 
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-09-25 Thread mark harwood
>>Both these on disk data structures and the ones in a B+ tree have seek 
>>offsets 
>>into files
>>that require disk seeks. And both could use document ids as key values.

Yep. However my approach doesn't use a doc id as a key that is searched in any 
B+ tree index (which involves disk seeks) - it is used as direct offset into a 
file to get the pointer into a "links" data structure. 



>>But do these disk data structures support dynamic addition and deletion of 
>>(larger
>>numbers of) document links?

Yes, the slide deck I linked to shows how links (like documents) spend the 
early 
stages of life being merged frequently in the smaller, newer segments and over 
time migrate into larger, more stable segments as part of Lucene transactions.

That's the theory - I'm currently benchmarking an early prototype.



- Original Message 
From: Paul Elschot 
To: dev@lucene.apache.org
Sent: Sat, 25 September, 2010 22:03:28
Subject: Re: Document links

Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
> My starting point in the solution I propose was to eliminate linking via any 
>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph 
>traversals have exponential numbers of links and so all these index disk seeks 
>start to stack up. The solution I propose uses doc ids as more-or-less direct 
>pointers into file structures avoiding any index lookup.
> I've started coding up some tests using the file structures I outlined and 
> will 
>compare that with a traditional key-based approach.

Both these on disk data structures and the ones in a B+ tree have seek offsets 
into files
that require disk seeks. And both could use document ids as key values.

But do these disk data structures support dynamic addition and deletion of 
(larger
numbers of) document links?

B+ trees are a standard solution for problems like this one, and it would 
probably
not be easy to outperform them.
It may be possible to improve performance of B+ trees somewhat by specializing
for the fairly simple keys that would be needed, and by encoding very short 
lists of links
for a single document directly into a seek offset to avoid the actual seek, but 
that's
about it.

Regards,
Paul Elschot

> 
> For reference - playing the "Kevin Bacon game" on a traditional Lucene index 
> of 
>IMDB data took 18 seconds to find a short path that Neo4j finds in 200 
>milliseconds on the same data (and this was a disk based graph of 3m nodes, 
>10m 
>edges).
> Going from actor->movies->actors->movies produces a lot of key lookups and 
> the 
>difference between key indexes and direct node pointers becomes clear.
> I know path finding analysis is perhaps not a typical Lucene application but 
>other forms of link analysis e.g. recommendation engines require similar 
>performance.
> 
> Cheers
> Mark
> 
> 
> 
> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
> 
> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
> >>>> While not exactly equivalent, it reminds me of our earlier discussion 
>around 
>
> >>>> "layered segments" for dealing with field updates
> >> 
> >> Right. Fast discovery of document relations is a foundation on which lots 
> >> of 
>
> >> things like this can build. Relations can be given types to support a 
> >> number 
>of 
>
> >> different use cases.
> > 
> > How about using this (bsd licenced) tree as a starting point:
> > http://bplusdotnet.sourceforge.net/
> > It has various keys: ao. byte array, String and long.
> > 
> > A fixed size byte array as key seems to be just fine: two bytes for a field 
>number,
> > four for the segment number and four for the in-segment document id.
> > The separate segment number would allow to minimize the updates
> > in the tree during merges. One could also use the normal doc id directly.
> > 
> > The value could then be a similar to the key, but without
> > the field number, and with an indication of the direction of the link.
> > Or perhaps the direction of the link should be added to the key.
> > A link would be present twice, once for each direction.
> > Also both directions could have their own payloads.
> > 
> > It could be put in its own file as a separate 'segment', or maybe
> > each segment could allow for allocation of a part of this tree.
> > 
> > I like this somehow, in case it is done right one might never
> > need a relational database again. Well, almost...
> > 
> > Regards,
> > Paul Elschot
> > 
> > 
> >> 
> >> 
> >> 
> >> - Original Message 
>

Re: Polymorphic Index

2010-10-21 Thread Mark Harwood
Perhaps another way of thinking about the problem:

Given a large range of IDs (eg your 300 million) you could constrain the number 
of unique terms using a double-hashing technique e.g.
Pick a number "n" for the max number of unique terms you'll tolerate e.g. 1 
million and store 2 terms for every primary key using a different hashing 
function e.g.

int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms.
int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms.

Then queries to retrieve/delete a record use a search for hashedKey1 AND 
hashedKey2. The probability of having the same collision on two different 
hashing functions is minimal and should return the original record only.
Obviously you would still have the postings recorded but these would be 
slightly more compact e.g each of your 1 million unique terms would have ~300 
gap-encoded vints entries as opposed to 300m postings of one full int.

Cheers
Mark

On 21 Oct 2010, at 20:44, eks dev wrote:

> Hi All, 
> I am trying to figure out a way to implement following use case with 
> lucene/solr. 
> 
> 
> In order to support simple incremental updates (master) I need to index  and 
> store UID Field on 300Mio collection. (My UID is a 32 byte  sequence). But I 
> do 
> not need indexed (only stored) it during normal  searching (slaves). 
> 
> 
> The problem is that my term dictionary gets blown away with sheer number  of 
> unique IDs. Number of unique terms on this collection, excluding UID  is less 
> than 7Mio.
> I can tolerate resources hit on Updater (big hardware, on disk index...).
> 
> This is a master slave setup, where searchers run from RAMDisk and  having 
> 300Mio * 32 (give or take prefix compression) plus pointers to  postings and 
> postings is something I would really love to avoid as this  is significant 
> compared to really small documents I have. 
> 
> 
> Cutting to the chase:
> How I can have Indexed UID field, and when done with indexing:
> 1) Load "searchable" index into ram from such an index on disk without one 
> field? 
> 
> 2) create 2 Indices in sync on docIDs, One containing only indexed UID
> 3) somehow transform index with indexed UID by droping UID field, preserving 
> docIs. Kind of tool smart index-editing tool. 
> 
> Something else already there i do not know?
> 
> Preserving docIds is crucial, as I need support for lovely incremental  
> updates 
> (like in solr master-slave update). Also Stored field should  remain!
> I am not looking for "use MMAPed Index and let OS deal with it advice"... 
> I do not mind doing it with flex branch 4.0, nut being in a hurry.
> 
> Thanks in advance, 
> Eks 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Polymorphic Index

2010-10-21 Thread Mark Harwood
Good point, Toke. Forgot about that. Of course doubling the number of hash 
algos used to 4 increases the space massively. 

On 21 Oct 2010, at 22:51, Toke Eskildsen  wrote:

> Mark Harwood [markharw...@yahoo.co.uk]:
>> Given a large range of IDs (eg your 300 million) you could constrain
>> the number of unique terms using a double-hashing technique e.g.
>> Pick a number "n" for the max number of unique terms you'll tolerate
>> e.g. 1 million and store 2 terms for every primary key using a 
>> different hashing function e.g.
> 
>> int hashedKey1=hashFunction1(myKey)%maxNumUniqueTerms.
>> int hashedKey2=hashFunction2(myKey)%maxNumUniqueTerms.
> 
>> Then queries to retrieve/delete a record use a search for hashedKey1
>> AND hashedKey2. The probability of having the same collision on two
>> different hashing functions is minimal and should return the original record 
>> only.
> 
> I am sorry, but this won't work. It is a variation of the birthday paradox:
> http://en.wikipedia.org/wiki/Birthday_problem
> 
> Assuming the two hash-functions are ideal so that there will be 1M different 
> values from each after the modulo, the probability for any given pair of 
> different UIDs having the same hashes is 1/(1M * 1M). That's very low. 
> Another way to look at it would be to say that there are 1M * 1M possible 
> values for the aggregated hash function.
> 
> Using the recipe from
> http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem
> we have
> n = 300M
> d = 1M * 1M
> and the formula 1-((d-1)/d)^(n*(n-1)/2) which gets expanded to
> http://www.google.com/search?q=1-(((1-1)/1)^(3*(3-1)/2)
> 
> We see that the probability of a collision is ... 1. Or rather, so close to 1 
> that Google's calculator will not show any decimals. Turning the number of 
> UIDs down to just 3M, we still get the probability 0.99881 for a 
> collision. It does not really help to increase the number unique hashes as 
> there are simply too many possible pairs of UIDs.
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Using filters to speed up queries

2010-10-23 Thread Mark Harwood
Look at BooleanQuery with 2 "must" clauses - one for the query, one for a 
ConstantScoreQuery wrapping the filter.
BooleanQuery should then use automatically use skips when reading matching docs 
from the main query and skip to the next docs identified by the filter.
Give it a try, otherwise you may be looking at using separate indexes


On 23 Oct 2010, at 23:18, Khash Sajadi wrote:

> My index contains documents for different users. Each document has the user 
> id as a field on it.
> 
> There are about 500 different users with 3 million documents.
> 
> Currently I'm calling Search with the query (parsed from user) and 
> FieldCacheTermsFilter for the user id.
> 
> It works but the performance is not great.
> 
> Ideally, I would like to perform the search only on the documents that are 
> relevant, this should make it much faster. However, it seems Search(Query, 
> Filter) runs the query first and then applies the filter.
> 
> Is there a way to improve this? (i.e. run the query only on a subset of 
> documents)
> 
> Thanks


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: How can I get started for investigating the source code of Lucene ?

2010-11-01 Thread mark harwood
Here's a rough overview I mapped out as a sequence diagram for the search side 
of things some time ago:  http://goo.gl/lE6a


- Original Message 
From: Jeff Zhang 
To: dev@lucene.apache.org
Sent: Mon, 1 November, 2010 5:43:08
Subject: How can I get started for investigating the source code of Lucene ?

Hi all,

I'd like to study the source code of Lucene, but I found there's not
so much documents about the internal structure of lucene. And the
classes are so big that not so readable, could anyone give me
suggestion about How can I get started for investigating the source
code of Lucene ? Any document or blog post would be good .

Thanks


-- 
Best Regards

Jeff Zhang

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-11-08 Thread mark harwood
I came to the conclusion that the transient meaning of document ids is too 
deeply ingrained in Lucene's design to use them to underpin any reliable 
linking.
While it might work for relatively static indexes, any index with a reasonable 
number of updates or deletes will invalidate any stored document references in 
ways which are very hard to track. Lucene's compaction shuffles IDs without 
taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling 
IDs" here: http://goo.gl/5UbJi )


Cheers,
Mark


- Original Message 
From: Ryan McKinley 
To: dev@lucene.apache.org
Sent: Mon, 8 November, 2010 19:03:59
Subject: Re: Document links

Any updates/progress with this?

I'm looking at ways to implement an RTree with lucene -- and this
discussion seems relevant

thanks
ryan


On Sat, Sep 25, 2010 at 5:42 PM, mark harwood  wrote:
>>>Both these on disk data structures and the ones in a B+ tree have seek 
offsets
>>>into files
>>>that require disk seeks. And both could use document ids as key values.
>
> Yep. However my approach doesn't use a doc id as a key that is searched in any
> B+ tree index (which involves disk seeks) - it is used as direct offset into a
> file to get the pointer into a "links" data structure.
>
>
>
>>>But do these disk data structures support dynamic addition and deletion of
>>>(larger
>>>numbers of) document links?
>
> Yes, the slide deck I linked to shows how links (like documents) spend the 
>early
> stages of life being merged frequently in the smaller, newer segments and over
> time migrate into larger, more stable segments as part of Lucene transactions.
>
> That's the theory - I'm currently benchmarking an early prototype.
>
>
>
> - Original Message 
> From: Paul Elschot 
> To: dev@lucene.apache.org
> Sent: Sat, 25 September, 2010 22:03:28
> Subject: Re: Document links
>
> Op zaterdag 25 september 2010 15:23:39 schreef Mark Harwood:
>> My starting point in the solution I propose was to eliminate linking via any
>>type of key. Key lookups mean indexes and indexes mean disk seeks. Graph
>>traversals have exponential numbers of links and so all these index disk seeks
>>start to stack up. The solution I propose uses doc ids as more-or-less direct
>>pointers into file structures avoiding any index lookup.
>> I've started coding up some tests using the file structures I outlined and 
>will
>>compare that with a traditional key-based approach.
>
> Both these on disk data structures and the ones in a B+ tree have seek offsets
> into files
> that require disk seeks. And both could use document ids as key values.
>
> But do these disk data structures support dynamic addition and deletion of
> (larger
> numbers of) document links?
>
> B+ trees are a standard solution for problems like this one, and it would
> probably
> not be easy to outperform them.
> It may be possible to improve performance of B+ trees somewhat by specializing
> for the fairly simple keys that would be needed, and by encoding very short
> lists of links
> for a single document directly into a seek offset to avoid the actual seek, 
but
> that's
> about it.
>
> Regards,
> Paul Elschot
>
>>
>> For reference - playing the "Kevin Bacon game" on a traditional Lucene index 
>of
>>IMDB data took 18 seconds to find a short path that Neo4j finds in 200
>>milliseconds on the same data (and this was a disk based graph of 3m nodes, 
10m
>>edges).
>> Going from actor->movies->actors->movies produces a lot of key lookups and 
the
>>difference between key indexes and direct node pointers becomes clear.
>> I know path finding analysis is perhaps not a typical Lucene application but
>>other forms of link analysis e.g. recommendation engines require similar
>>performance.
>>
>> Cheers
>> Mark
>>
>>
>>
>> On 25 Sep 2010, at 11:41, Paul Elschot wrote:
>>
>> > Op vrijdag 24 september 2010 17:57:45 schreef mark harwood:
>> >>>> While not exactly equivalent, it reminds me of our earlier discussion
>>around
>>
>> >>>> "layered segments" for dealing with field updates
>> >>
>> >> Right. Fast discovery of document relations is a foundation on which lots 
>of
>>
>> >> things like this can build. Relations can be given types to support a 
>number
>>of
>>
>> >> different use cases.
>> >
>> > How about using this (bsd licenced) tree as a starting point:
>> > http://bplusdotnet.sourceforge.net/
>> > It has various keys: ao. byt

Re: Document links

2010-11-08 Thread Mark Harwood
What about if we define an id field (like in solr)?


Last time I floated the idea of supporting primary keys as a core concept in 
Lucene (in the context of helping doc updates, not linking) there were 
objections along the lines of "lucene shouldn't try to be a database" 


On 8 Nov 2010, at 20:47, Ryan McKinley  wrote:

On Mon, Nov 8, 2010 at 2:52 PM, mark harwood  wrote:
I came to the conclusion that the transient meaning of document ids is too
deeply ingrained in Lucene's design to use them to underpin any reliable
linking.

What about if we define an id field (like in solr)?

Whatever does the traversal would need to make a Map, but
that is still better then then needing to do a query for each link.


While it might work for relatively static indexes, any index with a reasonable
number of updates or deletes will invalidate any stored document references in
ways which are very hard to track. Lucene's compaction shuffles IDs without
taking care to preserve identity, unlike graph DBs like Neo4j (see "recycling
IDs" here: http://goo.gl/5UbJi )


oh ya -- and it is even more akward since each subreader often reuses
the same docId

ryan

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org






-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Document links

2010-11-09 Thread Mark Harwood
I was using within-segment doc ids stored in link files named after both the 
source and target segments (a link after all is 2 endpoints).
For a complete solution you ultimately have to deal with the fact that doc ids 
could be references to:
* Stable, committed docs (the easy case)
* Flushed but not yet committed docs
* Buffered but not yet flushed docs
* Flushed/committed but currently merging docs

...all of which are happening in different threads e.g reader has one view of 
the world, a background thread is busy merging segments to create a new view of 
the world even after commits have completed.

All very messy.
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: New Lucene features and Solr indexes

2013-02-13 Thread mark harwood
>>Instead of making other APIs to accomodate BloomFilter's current
>>brokenness: remove its custom per-field logic so it works with
>>PerFieldPostingsFormat, like every other PF.



Not looked at it in a while but I'm pretty certain, like every other PF, you 
can go ahead and use PerFieldPF with Bloom filter just fine.

What was broken was (is?) that in this configuration PFPF isn't smart enough to 
avoid creating twice as many files as is required - see Lucene 4093.
Until that is resolved (and I have noted my pessimism about that being fixed 
easily) BloomPF contains an optimisation for those that want to avoid this 
inefficiency.
The use of that optimisation is entirely optional for users.
Internally to BloomPF, the implementation of that optimisation is trivial  - if 
a null bloom set is returned for a given field it ignores the usual bloom 
filtering logic and delegates directly to the wrapped codec. 
You can choose to implement a BloomFilterFactory that adds this field-choice 
optimisation or, more simply run the default PerFieldPF-managed configuration 
and live with the increased numbers of files.

Arguably, the inefficiencies of the PerFieldPF framework are the real issue to 
be addressed here.

>>I brought this up before it was committed, and i was ignored

You stopped engaging in the debate when I outlined the 3 proposed options for 
moving BloomPF forward :  http://goo.gl/mxtP9
Those options were:
1) ignore the inefficiencies in PFPF
2) sort out the issues in PFPF (4093 but probably a more complex solution)
3) work around existing PFPF issues with a simple but entirely optional 
optimisation to BloomPF

I opted for 3) and gave notice that I 'd take it out if anyone objected. 
I don't think there's been any movement on 2) so I guess you're still happy 
with option 1)? I recall you didn't think the business of extra files was that 
much of a concern: http://goo.gl/eJWo3


(Incidentally, probably best following up on the relevant Jiras rather than 
here)

Cheers
Mark




 From: Robert Muir 
To: dev@lucene.apache.org 
Sent: Wednesday, 13 February 2013, 13:01
Subject: Re: New Lucene features and Solr indexes
 
On Wed, Feb 13, 2013 at 4:42 AM, Adrien Grand  wrote:
> Hi Shawn,
>
> On Tue, Feb 12, 2013 at 8:58 PM, Shawn Heisey  wrote:
>> Some of these, like compressed stored fields and compressed termvectors, are
>> being turned on by default, which is awesome.  I'm already running a 4.2
>> snapshot, so I've got those in place.
>
> Excellent!
>
>> One thing that I know I would like to do is use the new BloomFilter for a
>> couple of my fields that contain only unique values.  Last time I checked
>> (which was before the 4.1 release), if you added the lucene-codecs jar, Solr
>> had a BloomFilter postings format, but didn't have any way to specify the
>> underlying format.  See SOLR-3950 and LUCENE-4394.
>
> BloomFilterPostingsFormat is a little special compared to other
> postings formats because it can wrap any postings format. So maybe it
> should require special support, like an additional attribute in the
> field type definition?

-1

Instead of making other APIs to accomodate BloomFilter's current
brokenness: remove its custom per-field logic so it works with
PerFieldPostingsFormat, like every other PF.

In other words, it should work just like pulsing.

I brought this up before it was committed, and i was ignored. Thats
fine, but I'll be damned if i let its incorrect design complicate
other parts of the codebase too. I'd rather it continue to stay
difficult to integrate and continue walking its current path to an
open source death instead.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: New Lucene features and Solr indexes

2013-02-13 Thread mark harwood
>>should be a stupid simple postings format like any other postings format with 
>>a default configuration

It does have a default config. It just needs a PF delegate in the constructor 
just like Pulsing
Like Rob said:
>>In other words, it should work just like pulsing.


So far so good.

Now where people are getting upset (for no particularly good reason in my view) 
around per-field stuff:  if you really, really want to you can supply a 
subclass of BloomFilterFactory to your BloomPF constructor which allows 
customised control over choice of hashing algo, bitset sizing and saturation 
policies if the DefaultBloomFilterFactory fails to make the right choices.  
99.9% of people will not do this. The reason it is a factory object and not 
some dumb settings is that it is called on a per-segment basis with state info 
that is useful context in making sizing choices.  Now, (horror of horrors), the 
factory's API is passed a FieldInfo object in the method designed to produce a 
bitset. It is conceivable that some rogue agents could choose to implement some 
per-field decisions here if the same BloomPF instance was registered to handle 
>1 field. In addition, BloomPF has some common-sense defensive coding that 
checks if the factory returns null
 for the bitset - in which case it delegates all calls un-bloomed directly to 
the delegate codec. 

None of this prevents the use of BloomPF with the prescribed PerFieldPF manner 
for handling field-specific choices.

I happen to use a custom BloomFilterFactory to implement a more efficient 
indexing pipeline than the prescribed PerFieldPF route of implementing all 
per-field policies "up high" in the stack -  but none of that is at the cost of 
a clean BloomPF API or with any unnecessary duplication of PerFieldPF logic. 

If anything needs changing here there may be a case for providing a convenience 
class that weds BloomPF and a default choice of Lucene40 codec so it can help 
with whatever Solr and other config-driven engines may need ie  zero arg 
constructors if that's how their registry of codecs works.

Cheers
Mark













 From: Uwe Schindler 
To: dev@lucene.apache.org 
Sent: Wednesday, 13 February 2013, 16:47
Subject: RE: New Lucene features and Solr indexes
 
Hi Shawn,

I was arguing also at the time when this was committed. I fully agree with 
Robert, the current API is not in a good shape!
I have the same feeling: Bloom Postings should be a stupid simple postings 
format like any other postings format with a default configuration. If you 
really want to change its configuration, you can subclass it as a separate 
postings format.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Shawn Heisey [mailto:s...@elyograg.org]
> Sent: Wednesday, February 13, 2013 3:59 PM
> To: dev@lucene.apache.org
> Subject: Re: New Lucene features and Solr indexes
> 
> >> BloomFilterPostingsFormat is a little special compared to other
> >> postings formats because it can wrap any postings format. So maybe it
> >> should require special support, like an additional attribute in the
> >> field type definition?
> >
> > -1
> >
> > Instead of making other APIs to accomodate BloomFilter's current
> > brokenness: remove its custom per-field logic so it works with
> > PerFieldPostingsFormat, like every other PF.
> >
> > In other words, it should work just like pulsing.
> >
> > I brought this up before it was committed, and i was ignored. Thats
> > fine, but I'll be damned if i let its incorrect design complicate
> > other parts of the codebase too. I'd rather it continue to stay
> > difficult to integrate and continue walking its current path to an
> > open source death instead.
> 
> Robert,
> 
> I have to send you a general thank you for your dedication to the quality of
> this project, and for your amazing ability to seemingly keep the entire design
> for Lucene in your head at all times.
> 
> I'm not sure what exactly you want to die here, or what you think would be
> the best option for me, the Solr end-user.  Is BloomFilter something that's
> not worth pursuing, or would you just like it to be integrated in a different
> way?
> 
> Thanks,
> Shawn
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-18 Thread Mark Harwood (JIRA)
Mark Harwood created LUCENE-4069:


 Summary: Segment-level Bloom filters for a 2 x speed up on rare 
term searches
 Key: LUCENE-4069
 URL: https://issues.apache.org/jira/browse/LUCENE-4069
 Project: Lucene - Java
  Issue Type: Improvement
  Components: core/index
Affects Versions: 3.6
Reporter: Mark Harwood
Priority: Minor
 Fix For: 3.6.1


An addition to each segment which stores a Bloom filter for selected fields in 
order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many 
segments but also speeds up general searching in my tests.

Overview slideshow here: 
http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no API changes currently - to play just add a field with "_blm" on 
the end of the name to invoke special indexing/querying capability. Clearly a 
new Field or schema declaration(!) would need adding to APIs to configure the 
service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-18 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: MHBloomFilterOn3.6Branch.patch

Initial patch

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: MHBloomFilterOn3.6Branch.patch
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PrimaryKey40PerformanceTestSrc.zip
BloomFilterCodec40.patch

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterCodec40.patch
PrimaryKey40PerformanceTestSrc.zip

I've ported this Bloom Filtering code to work as a 4.0 Codec now.
I see a 35% improvement over standard Codecs on random lookups on a warmed 
index. 

I also notice that the PulsingCodec is no longer faster than standard Codec - 
is this news to people as I thought it was supposed to be the way forward?

My test rig (adapted from Mike's original primary key test rig here 
http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html)
 is attached as a zip.
The new BloomFilteringCodec is also attached here as a patch.

Searches against plain text fields also look to be faster (using AOL500k 
queries searching Wikipedia English) but obviously that particular test rig is 
harder to include as an attachment here.

I can open a seperate JIRA issue for this 4.0 version of the code if that makes 
more sense.



> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: PrimaryKey40PerformanceTestSrc.zip)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283583#comment-13283583
 ] 

Mark Harwood commented on LUCENE-4069:
--

My current focus is speeding up primary key lookups but this may have 
applications outside of that (Zipf tells us there is a lot of low frequency 
stuff in free text).

Following the principle of the best IO is no IO the Bloom Filter helps us 
quickly understand which segments to even bother looking in. That has to be a 
win overall.

I started trying to write this Codec as a wrapper for any other Codec (it 
simply listens to a stream of terms and stores a bitset of recorded hashes in a 
".blm" file). However that was trickier than I expected - I'd need to encode a 
special entry in my blm files just to know the name of the delegated codec I 
needed to instantiate at read-time because Lucene's normal Codec-instantiation 
logic would be looking for "BloomCodec" and I'd have to discover the delegate 
that was used to write all of the non-blm files.

Not looked at FixedBitSet but I imagine that could be used instead.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-25 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283615#comment-13283615
 ] 

Mark Harwood commented on LUCENE-4069:
--

Update- I've discovered this Bloom Filter Codec currently has a bug where it 
doesn't handle indexes with >1 field.
It's probably all tangled up in the "PerField..." codec logic so I need to do 
some more digging.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>    Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: MHBloomFilterOn3.6Branch.patch, 
> PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterCodec40.patch

Fixed the issue with >1 field in an index.
Tests on random lookups on Wikipedia titles (unique keys) now show a 3 x speed 
up for a Bloom-filtered index over standard 4.0 Codec for fully warmed indexes.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>    Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-28 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284536#comment-13284536
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  I wonder if that also helps indexing in terms of applying deletes. did you 
test that

I've not looked into that particularly but I imagine this may be relevant. 

Thanks for the tips for making the patch more generic. I'll get on it tomorrow 
and changed to FixedBitSet while I'm at it.

bq. I also wonder if we can extract a "bloomfilter" class into utils

There's some reusable stuff in this patch for downsizing the Bitset (according 
to desired saturation levels) having accumulated a stream of values.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: MHBloomFilterOn3.6Branch.patch, 
> PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterCodec40.patch

Updated to work with trunk.
* Changed to use FixedBitSet
* Is now a PostingsFormat abstract base class
* Added missing MurmurHash class

TODOs
* Move Bloom filter logic to common utils classes
* Use Service Providers for pluggable choice of hash algos?
* Expose settings for memory/saturation

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no API changes currently - to play just add a field with "_blm" on 
> the end of the name to invoke special indexing/querying capability. Clearly a 
> new Field or schema declaration(!) would need adding to APIs to configure the 
> service properly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

  Description: 
An addition to each segment which stores a Bloom filter for selected fields in 
order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many 
segments but also speeds up general searching in my tests.

Overview slideshow here: 
http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no 3.6 API changes currently - to play just add a field with "_blm" 
on the end of the name to invoke special indexing/querying capability. Clearly 
a new Field or schema declaration(!) would need adding to APIs to configure the 
service properly.

Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

  was:
An addition to each segment which stores a Bloom filter for selected fields in 
order to give fast-fail to term searches, helping avoid wasted disk access.

Best suited for low-frequency fields e.g. primary keys on big indexes with many 
segments but also speeds up general searching in my tests.

Overview slideshow here: 
http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments

Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU

Patch based on 3.6 codebase attached.
There are no API changes currently - to play just add a field with "_blm" on 
the end of the name to invoke special indexing/querying capability. Clearly a 
new Field or schema declaration(!) would need adding to APIs to configure the 
service properly.

Affects Version/s: 4.0
Fix Version/s: 4.0

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-29 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285100#comment-13285100
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I think you should really not provide any field handling at all.

By that I think you mean keep the abstract BloomFilteringPostingsFormatBase and 
dispense with the BloomFilteringLucene40Codec (and 
BloomFilteredLucene40PostingsFormat?). I was trying to limit the extensions 
apps would have to write to use this service (1 custom postings format 
subclass, 1 custom Codec subclass and 1 custom SPI config file) but I can see 
that equally we shouldn't offer implementations for all the many different 
service permutations.

I'll look at adding something to RandomCodec for Bloom-plus-random delegate 
PostingsFormat.

bq. I am still worried about the TermsEnum reuse code, are you planning to look 
into this?
bq.  you keep on switching back an forth creating new delegated TermsEnum 
instances

I'm not sure what you mean in creating new delegated TermsEnum instances?
In my impl of"iterator(TermsEnum reuse)" I take care to unwrap my wrapper for 
TermsEnum to find the original delegate's TermsEnum and then call the 
delegateTerms iterator method with this object as the reuse parameter. At this 
stage shouldn't the delegate Terms just recycle that unwrapped TermsEnum as per 
the normal reuse contract when no wrapping has been done? 

bq. you should also add license headers to the files you are adding

Will do.
 

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285518#comment-13285518
 ] 

Mark Harwood commented on LUCENE-4069:
--

Thanks for the comment, Rob.
While the choice of Codec can be an anonymous inner class, resolving the choice 
of PostingsFormat is trickier.
BloomFilterPostingsFormat is now intended to wrap any another choice of 
PostingsFormat and Simon has suggested leaving the Bloom support purely 
abstract.
However, as an end user if I want to use Bloom support on the standard Lucene 
codec I would then have to write one of these:
{code:title=MyBloomFilteredLucene40Postings.java}
public class MyBloomFilteredLucene40Postingsextends 
BloomFilteringPostingsFormatBase {

  public MyBloomFilteredLucene40Postings() {
//declare my choice of PostingsFormat to be wrapped and provide a unique 
name for this combo of Bloom-plus-delegate
super("myBL40", new Lucene40PostingsFormat());
  }
}
{code}
The resulting index files are then named [segname]_myBL40.[filetypesuffix].
At read-time the "myBL40" bit of the filename is used to lookup via Service 
Provider registrations the decoding class so 
"com.xx.MyBloomFilteredLucene40Postings" would need adding to a 
o.a.l.codecs.PostingsFormat file for the registration to work.

I imagine Bloom-plus-Lucene40Postings would be a common combo and if both are 
in core it would be annoying to have to code support for this in each app or 
for things like Luke to have to have classpaths redefined to access the 
app-specific class that was created purely to bind this combo of core 
components.

I think a better option might be to change the Bloom filtering base class to 
record the choice of delegate PostingsFormat in it's own "blm" file at 
write-time and instantiate the appropriate delegate instance at read-time using 
the recorded name. The BloomFilteringBaseClass would need changing to a final 
class rather than an abstract so that core Lucene could load it as the handler 
for [segname]_BloomPosting.xxx files and it would then have to examine the 
[segname].blm file to discover and instantiate the choice of delegate 
PostingsFormat using the standard service registration mechanism. At write-time 
clients would need to instantiate the BloomFilterPostingsFormat, passing a 
choice of PostingsFormat delegate to the constructor. At read-time Lucene core 
would invoke a zero-arg constructor.
I'll look into this as an approach.






 

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285706#comment-13285706
 ] 

Mark Harwood commented on LUCENE-4069:
--

Aaaargh. Unless I've missed something, I have concerns with the fundamental 
design of the current Codec loading mechanism.

It seems too tied to the concept of a ServiceProvider class-loading mechanism, 
forcing users to write new SPI-registered classes in order to simply declare 
what amount to index schema configuration choices.

Example: If I take Rob's sample Codec above and choose to use a subtly 
different configuration of the same PostingsFormat class for different fields 
it breaks:

{code:title=ThisBreaks.java}
  Codec fooCodec=new Lucene40Codec() {
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if ("text".equals(field)) {
return new FooPostingsFormat(1);
  }
  if ("title".equals(field)) {
//same impl as "text" field, different constructor settings
return new FooPostingsFormat(2);
  }  
  return super.getPostingsFormatForField(field);
} 
  };
{code}
This causes a file overwrite error as PerFieldPostingsFormat uses the same name 
from FooPostingsFormat(1) and FooPostingsFormat(2) to create files.
In order to safely make use of differently configured choices of the same 
PostingsFormat we are forced to declare a brand new subclass with a unique new 
service name and entry in the service provider registration. This is 
essentially where I have got to in trying to integrate this Bloom filtering 
logic.

This dependency on writing custom classes seems to make everything a bit 
fragile, no? What hope has Luke got in opening the average index without 
careful assembly of classpaths etc?
If I contrast this with the world of database schemas it seems absurd to have a 
reliance on writing custom classes with no behaviour simply in order to 
preserve a configuration of an application's schema settings. Even an IOC 
container with XML declarations would offer a more agile means of assembling 
pre-configured *beans* rather than relying on a Service Provider mechanism that 
is only serving as a registry of *classes*.

Anyone else see this as a major pain?











> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285744#comment-13285744
 ] 

Mark Harwood commented on LUCENE-4069:
--

This fails if you add docs with title and text fields:
{code:title=ThisCrashes.java}
  Codec fooCodec=new Lucene40Codec() {
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if ("text".equals(field)) {
return new MemoryPostingsFormat(false);
  }
  if ("title".equals(field)) {
return new MemoryPostingsFormat(true);
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}

  Exception in thread "main" java.io.IOException: Cannot overwrite: 
C:\temp\luceneCodecs\_2_Memory.ram

This also fails:

{code:title=ThisToo.java}
   Codec fooCodec=new Lucene40Codec() {
SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if ("text".equals(field)) {
return new SimpleTextPostingsFormat();
  }
  if ("title".equals(field)) {
return new SimpleTextPostingsFormat();
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}
with 
  Exception in thread "main" java.io.IOException: Cannot overwrite: 
C:\temp\luceneCodecs\_1_SimpleText.pst

Whereas sharing the same instance of a PostingsFormat class across fields works:

{code:title=ThisWorks.java}
  Codec fooCodec=new Lucene40Codec() {
SimpleTextPostingsFormat theSimple = new SimpleTextPostingsFormat();
@Override
public PostingsFormat getPostingsFormatForField(String field) {
  if (("text".equals(field))|| ("title".equals(field))) {
return theSimple;
  }  
  else {
return super.getPostingsFormatForField(field);
  }
} 
  };
{code}


> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285765#comment-13285765
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. its just an issue with PerFieldPostingsFormat

OK, thanks. My guess is you'll effectively be having to supplement 
postingsformat.getName() with object-instanceID in file names.


> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-30 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285773#comment-13285773
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. When I run all tests with the bloom 4.0 postings format (ant test-core 
-Dtests.postingsformat=BloomFilteredLucene40PostingsFormat), 

Thanks for the pointer on targeting codec testing. I have another patch to 
upload with various tweaks e.g. configurable choice of hash functions, 
RandomCodec additions so I will concentrate testing on that before uploading.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4090) PerFieldPostingsFormat cannot use name as suffix

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286411#comment-13286411
 ] 

Mark Harwood commented on LUCENE-4090:
--

Thanks for the quick fix, Rob :)
Working fine for me here now.

> PerFieldPostingsFormat cannot use name as suffix
> 
>
> Key: LUCENE-4090
> URL: https://issues.apache.org/jira/browse/LUCENE-4090
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 4.0
>Reporter: Robert Muir
>Assignee: Robert Muir
> Fix For: 4.0, 5.0
>
> Attachments: LUCENE-4090.patch, LUCENE-4090.patch
>
>
> Currently PFPF just records the name in the metadata, which matches up to the 
> segment suffix. But this isnt enough, e.g. someone can use Pulsing(1) on one 
> field and Pulsing(2) on another field.
> See Mark Harwood's examples struggling with this on LUCENE-4069.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostings40.patch

This is looking more promising.

Running "ant test-core 
-Dtests.postingsformat=TestBloomFilteredLucene40Postings" now passes all tests 
but causes OOM exception on 3 tests:
* TestConsistentFieldNumbers.testManyFields
* TestIndexableField.testArbitraryFields
* TestIndexWriter.testManyFields

Any pointers on how to annotate or otherwise avoid the BloomFilter class for 
"many-field" tests would be welcome. These are not realistic tests for this 
class (we don't expect indexes with 100s of primary-key like fields).

In this patch I've
* added an SPI lookup mechanism for pluggable hash algos.
* documented the file format
* fixed issues with TermVector tests
* changed the API


To use:
BloomFilteringPostingFormat now takes a delegate PostingsFormat and a set of 
field names that are to have bloom-filters created.
Fields that are not listed in the filter set can be safely indexed as per 
normal and doing so is beneficial because it allows filtered and non filtered 
field data to co-exist in the same physical files created by the delegate 
PostingsFormat.


> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>    Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterCodec40.patch, BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterCodec40.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostings40.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostings40.patch

Added missing class

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286598#comment-13286598
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. Instead i think the concrete Bloom+Lucene40 that you have in tests should 
be moved into src/java and registered there

What problem would that be trying to solve? Registration (or creation) of any 
BloomFilteringPostingsFormat subclasses is not necessary to decode index 
contents. Offering a "Bloom40" would only buy users a pairing of 
Lucene40Postings and Bloom filtering but they would still have to declare which 
fields they want Bloom filtering on at write time. This isn't too hard using 
the code in the existing patch:

{code:title=ThisWorks.java}
final SetbloomFilteredFields=new HashSet();
bloomFilteredFields.add(PRIMARY_KEY_FIELD_NAME);

iwc.setCodec(new Lucene40Codec(){
  BloomFilteringPostingsFormat postingOptions=new 
BloomFilteringPostingsFormat(new Lucene40PostingsFormat(), bloomFilteredFields);
  @Override
  public PostingsFormat getPostingsFormatForField(String field) {
return postingOptions;
  }  
});
{code}
No extra subclasses/registration required here to read the index built with the 
above setup.


> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286600#comment-13286600
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. An alternative would be to just pick this less often in RandomCodec: see 
the SimpleText hack 

Another option might be to make the TestBloomFilteredLucene40Postings pick a 
ludicrously small Bitset sizing option for each field so that we can 
accommodate tests that create silly numbers of fields. The bitsets being so 
small will just quickly reach saturation and force all reads to hit the 
underlying FieldsProducer.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286616#comment-13286616
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I dont understand why this handles fields. Someone should just pick that 
with perfieldpostingsformat.

That would be inefficient because your PFPF will see 
BloomFilteringPostingsFormat(field1 + Lucene40) and 
BloomFilteringPostingsFormat(field2 + Lucene40) as fundamentally different 
PostingsFormat instances and consequently create multiple files named 
differently because it assumes these instances may be capable of using 
radically different file structures.
In reality, the choice of BloomFilter with field 1 or BloomFilter with field 2 
or indeed no BloomFilter does not fundamentally alter the underlying delegate 
PostingFormat's file format - it only adds a supplementary "blm" file on the 
side with the field summaries. For this reason it is a mistake to configure 
seperate BloomFilterPostingsFormat instances on a per-field basis if they can 
share a common delegate.



> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286707#comment-13286707
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq.  To solve what you speak of we just need to resolve LUCENE-4093. 

Presumably the main objective here is that in order to cut down on the number 
of files we store, content consumers of various types should aim to consolidate 
multiple fields' contents into a single file (if they share common config 
choices). 

bq. Then multiple postings format instances that are 'the same' will be 
deduplicated correctly.

The complication in this case is that we essentially have 2 consumers (Bloom 
and Lucene40), one wrapped in the other with different but overlapping choices 
of fields e.g we want a single Lucene40 to process all fields but we want Bloom 
to handle only a subset of these fields. This will be a tough one for PFPF to 
untangle while we are stuck with a delegating model for composing consumers. 

This may be made easier if instead of delegating a single stream we have a 
*stream-splitting* capability via a multicast subscription e.g. Bloom filtering 
consumer registers interest in content streams for fields A and B while 
Lucene40 is consolidating content from fields A, B, C and D. A broadcast 
mechanism feeds each consumer a copy of the relevant stream and each consumer 
is responsible for inventing their own file-naming convention that avoids 
muddling files.

While that may help for writing streams it doesn't solve the re-assembly of 
"producer" streams at read-time where BloomFilter absolutely has to position 
itself in front of the standard Lucene40 producer in order to offer fast-fail 
lookups. 

In the absence of a fancy optimised routing mechanism (this all may be 
overkill) my current solution was to put BloomFilter in the delegate chain 
armed with a subset of fieldnames to observe as a larger array of fields flow 
past to a common delegate. I added some Javadocs to describe the need to do it 
this way for an efficient configuration.
You are right that this is messy (ie open to bad configuration) but operating 
this deep down in Lucene that's always a possibility regardless of what we put 
in place.





> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286754#comment-13286754
 ] 

Mark Harwood commented on LUCENE-4069:
--

Its true to say that Bloom is a different case to Pulsing - Bloom does not 
interfere in any with the normal recording of content in the wrapped delegate 
whereas Pulsing does.
It may prove useful for us to mark a formal distinction between these 
mutating/non mutating types so we can treat them differently and provide 
optimisations?


bq. And separately, you can always contain the number of files even today by 
using only unique instances yourself when writing

Contained but not optimal - roughly double the number of required files if I 
want the common case of a primary key indexed with Bloom. I can't see a way of 
indexing with Bloom-plus-Lucene40 on field "A" and indexing with just Lucene40 
on fields B,C and D and winding up with only one Lucene40 set of files with a 
common segment suffix. The way I did find of achieving this was to add a 
"bloomFilteredFields" set into my single Bloom+Lucene40 instance used for all 
fields. Is there any other option here currently? 

Looking to the future, 4093 may have more capabilities at optimising if it 
understands the distinction between mutating wrappers and non-mutating ones and 
how they are composed?



> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>    Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286815#comment-13286815
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. Its not worth the complexity

There's no real added complexity in BloomFilterPostingsFormat - it has to be 
capable of storing blooms for >1 field anyway and using the fieldname set is 
roughly 2 extra lines of code to see if a TermsConsumer needs wrapping or not.


>From a client side you don't have to use this feature - the fieldname set can 
>be null in which case it will wrap all fields sent its way. If you do chose to 
>supply a set the wrapped PostingsFormat will have the advantage of being 
>shared for bloomed and non-bloomed fields. We could add a constructor that 
>removes the set and mark the others "expert".

For me this falls into one of the many faster-if-you-know-about-it 
optimisations like FieldSelectors or recycling certain objects. Basically a 
useful hint to Lucene to save some extra effort but one which you dont *need* 
to use.

Lucene-4093 may in future resolve the multi-file issue but I'm not sure it will 
do so without significant complication.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-05-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286916#comment-13286916
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. why is this a speed improvement?

Sorry - misleading. Replace the word "faster" in my comment with "better" and 
that makes more sense - I mean better in terms of resource usage and reduced 
open file handles. This seemed relevant given the earlier comments about Solr's 
use of non-compound files:

bq. [Solr] create massive amounts of files if we did so (add to the fact it 
disables compound files by default and its a disaster...)

I can see there is a useful simplification being sought for here if PerFieldPF 
can consider each of the unique top-level PFs presented to it as looking after 
an exclusive set of files. As the centralised allocator of file names it can 
then simply call each unique PF with a choice of segment suffix to name its 
various files without conflicting with other PFs. Lucene 4093 is all about 
better determining which PF is unique using .equals(). Unfortunately I don't 
think this approach is sufficiently complex. In order to avoid allocating all 
unnecessary file names PerFieldPF would have to further understand the nuances 
of which PFs were being wrapped by other PFs and which wrapped PFs would be 
reusable outside of their wrapped PF (as is the case with BloomPF's wrapped 
PF). That seems a more complex task than implementing equals(). 

So it seems we have 3 options:
1) Ignore the problems of creating too many files in the case of BloomPF and 
any other examples of "wrapping" PFs
2) Create a PerFieldPF implementation that reuses wrapped PFs using some 
generic means of discovering recyclable wrapped PFs (i.e go further than what 
4093 currently proposes in adding .equals support)
3) Retain my BloomPF-specific solution to the problem for those prepared to use 
lower-level APIs.

Am I missing any other options and which one do you want to go for?



> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-01 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287258#comment-13287258
 ] 

Mark Harwood commented on LUCENE-4069:
--

I've thought some more about option 2 (PerFieldPF reusing wrapped PFs) and it 
looks to get very ugly very quickly.
There's only so much PerFieldPF can do to rationalize a random jumble of PF 
instances presented to it by clients. I think the right place to draw the line 
is Lucene-4093 i.e. a simple .equals() comparison on top-level PFs to eliminate 
any duplicates. Any other approach that also tries to de-dup nested PFs looks 
to be adding a lot of complexity, especially when you consider what that does 
to the model of read-time object instantiation. This would be significant added 
complexity to solve a problem you have already suggested is insignificant (i.e. 
too many files doesn't really matter when using CFS).

I can remove the per-field stuff from BloomPF if you want but I imagine I will 
routinely subclass it to add this optimisation back in to my apps.




> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostings40.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3772) Highlighter needs the whole text in memory to work

2012-10-15 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476044#comment-13476044
 ] 

Mark Harwood commented on LUCENE-3772:
--

For bigger-than-memory docs is it not possible to use nested documents to 
represent subsections (e.g. a child doc for each of the chapters in a book) and 
then use BlockJoinQuery to select the best child docs?
Highlighting can then be used on a more-manageable subset of the original 
content and Lucene's ranking algos are being used to select the best "fragment" 
rather than the highlighter's own attempts to reproduce this logic.

Obviously depends on the shape of your content/queries but books-and-chapters 
is probably a good fit for this approach.

> Highlighter needs the whole text in memory to work
> --
>
> Key: LUCENE-3772
> URL: https://issues.apache.org/jira/browse/LUCENE-3772
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 3.5
> Environment: Windows 7 Enterprise x64, JRE 1.6.0_25
>Reporter: Luis Filipe Nassif
>  Labels: highlighter, improvement, memory
>
> Highlighter methods getBestFragment(s) and getBestTextFragments only accept a 
> String object representing the whole text to highlight. When dealing with 
> very large docs simultaneously, it can lead to heap consumption problems. It 
> would be better if the API could accept a Reader objetct additionally, like 
> Lucene Document Fields do.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException

2012-10-16 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476854#comment-13476854
 ] 

Mark Harwood commented on SOLR-3950:


BloomFilterPostingsFormat is designed to wrap another choice of PostingsFormat 
and adds ".blm" files to the other files created by the choice of delegate.

However your code has instantiated a BloomFilterPostingsFormat without passing 
a choice of delegate - presumably using the zero-arg constructor. 
The comments in the code for this zero-arg constructor state:

  // Used only by core Lucene at read-time via Service Provider instantiation -
  // do not use at Write-time in application code.





> Attempting postings="BloomFilter" results in UnsupportedOperationException
> --
>
> Key: SOLR-3950
> URL: https://issues.apache.org/jira/browse/SOLR-3950
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.1
> Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 
> SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> [root@bigindy5 ~]# java -version
> java version "1.7.0_07"
> Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>Reporter: Shawn Heisey
> Fix For: 4.1
>
>
> Tested on branch_4x, checked out after BlockPostingsFormat was made the 
> default by LUCENE-4446.
> I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and 
> copied it into my sharedLib directory.  When I subsequently tried 
> postings="BloomFilter" I got a the following exception in the log:
> {code}
> Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.UnsupportedOperationException: Error - 
> org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been 
> constructed without a choice of PostingsFormat
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3950) Attempting postings="BloomFilter" results in UnsupportedOperationException

2012-10-16 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477036#comment-13477036
 ] 

Mark Harwood commented on SOLR-3950:


bq. If there is some schema config that will tell Solr to do the right thing, 
please let me know.

Right now BloomPF is like an abstract class - you need to fill-in-the-blanks as 
to what delegate it will use before you can use it at write-time.
I think we have 3 options:

1) Solr (or you) provide a new PF impl that weds BloomPF with a choice of PF 
e.g. Lucene40PF so you would have a zero-arg-constructor class named something 
like BloomLucene40PF or...
2) Solr extends config file format to provide a generic means of assembling 
"wrapper" PFs like Bloom in their config e.g:
   postingsFormat="BloomFilter" delegatePostingsFormat="FooPF" 
   and Solr then does reflection magic to call constructors appropriately or..
3) Core Lucene is changed so that BloomPF is wedded to a default PF (e.g. 
Lucene40PF) if users e.g. Solr fail to nominate a choice of delegate for 
BloomPF.

Of these 1) feels like "the right thing".

Cheers
Mark

> Attempting postings="BloomFilter" results in UnsupportedOperationException
> --
>
> Key: SOLR-3950
> URL: https://issues.apache.org/jira/browse/SOLR-3950
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.1
> Environment: Linux bigindy5 2.6.32-279.9.1.el6.centos.plus.x86_64 #1 
> SMP Wed Sep 26 03:52:55 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> [root@bigindy5 ~]# java -version
> java version "1.7.0_07"
> Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>Reporter: Shawn Heisey
> Fix For: 4.1
>
>
> Tested on branch_4x, checked out after BlockPostingsFormat was made the 
> default by LUCENE-4446.
> I used 'ant generate-maven-artifacts' to create the lucene-codecs jar, and 
> copied it into my sharedLib directory.  When I subsequently tried 
> postings="BloomFilter" I got a the following exception in the log:
> {code}
> Oct 15, 2012 11:14:02 AM org.apache.solr.common.SolrException log
> SEVERE: java.lang.UnsupportedOperationException: Error - 
> org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat has been 
> constructed without a choice of PostingsFormat
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)
Mark Harwood created LUCENE-4275:


 Summary: Threaded tests with MockDirectoryWrapper delete active 
PostingFormat files
 Key: LUCENE-4275
 URL: https://issues.apache.org/jira/browse/LUCENE-4275
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/codecs, general/test
Affects Versions: 4.0-ALPHA
 Environment: Win XP 64bit Sun JDK 1.6
Reporter: Mark Harwood
 Fix For: 4.0


As part of testing Lucene-4069 I have encountered sporadic issues with files 
going missing. I believe this is a bug in the test framework (multi-threading 
issues in MockDirectoryWrapper?) so have raised a separate issue with 
simplified test PostingFormat class here.
Using this test PF will fail due to a missing file roughly one in four times of 
executing this test:
ant test-core  -Dtestcase=TestIndexWriterCommit 
-Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
-Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
-Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4275:
-

Attachment: Lucene-4275-TestClass.patch

Attached simple PostingsFormat used to illustrate cases of files going missing 
in PF tests.

> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>    Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-07-31 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13425895#comment-13425895
 ] 

Mark Harwood commented on LUCENE-4275:
--

Thanks, Rob. This test requires a call to "ant clean" between each run before 
it will consistently work. However, I don't consider that a fix and assume that 
we are still looking for a bug here as there's an index consistency issue 
lurking somewhere here. I've tried adding the setting 
-Dtests.directory=RAMDirectory but the test still looks to have some "memory" 
between runs.

I added some logging of creates and deletes as you suggest and it looks like on 
a second, un-cleansed run, my PF is being called to open a high-numbered 
segment which I suspect was created by an earlier run as the logging doesn't 
show signs of the PF being asked to created content for this (or any other) 
segment as part of the current run. At this point it fails as there is no 
longer a copy of  the "foobar" file listed by the directory.
I have noticed in the logs from previous runs MDW is asked to delete the 
segment's "foobar" file by IndexWriter as part of compaction into a compound 
CFS.

Hope this sheds some light as I'm finding this a complex one to debug.


> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>    Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-08-01 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426481#comment-13426481
 ] 

Mark Harwood commented on LUCENE-4275:
--


Nailed it, Mike. Yet another beer I owe you.
I removed the IllegalStateException and it looks like the retry logic is now 
kicking in and all tests pass 

This reliance on throwing a particular exception type feels like an important 
contract to document. Currently the comments in PostingsFormat.fieldsProducer() 
read as follows:

bq.   Reads a segment.  NOTE: by the time this call returns, it must hold open 
any files it will need to use; else, those files may be deleted. 

I propose adding:

bq. Additionally, required files may be deleted during the execution of this 
call before there is a chance to open them. Under these circumstances an 
IOException should be thrown by the implementation. IOExceptions are expected 
and will automatically cause a retry of the segment opening logic with the 
newly revised segments

I'll roll that documentation addition into my Lucene-4069 patch


> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (LUCENE-4275) Threaded tests with MockDirectoryWrapper delete active PostingFormat files

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood closed LUCENE-4275.


Resolution: Not A Problem

> Threaded tests with MockDirectoryWrapper delete active PostingFormat files
> --
>
> Key: LUCENE-4275
> URL: https://issues.apache.org/jira/browse/LUCENE-4275
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/codecs, general/test
>Affects Versions: 4.0-ALPHA
> Environment: Win XP 64bit Sun JDK 1.6
>    Reporter: Mark Harwood
> Fix For: 4.0
>
> Attachments: Lucene-4275-TestClass.patch
>
>
> As part of testing Lucene-4069 I have encountered sporadic issues with files 
> going missing. I believe this is a bug in the test framework (multi-threading 
> issues in MockDirectoryWrapper?) so have raised a separate issue with 
> simplified test PostingFormat class here.
> Using this test PF will fail due to a missing file roughly one in four times 
> of executing this test:
> ant test-core  -Dtestcase=TestIndexWriterCommit 
> -Dtests.method=testCommitThreadSafety -Dtests.seed=EA320250471B75AE 
> -Dtests.slow=true -Dtests.postingsformat=TestNonCoreDummyPostingsFormat 
> -Dtests.locale=no -Dtests.timezone=Europe/Belfast -Dtests.file.encoding=UTF-8 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
> LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated with fix to issue explored in Lucene-4275

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostingsBranch4x.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, LUCENE-4069-tryDeleteDocument.patch, 
> LUCENE-4203.patch, MHBloomFilterOn3.6Branch.patch, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-01 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated patch to bring in line with latest core API changes.
All tests now pass clean so will commit soon

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood resolved LUCENE-4069.
--

Resolution: Fixed
  Assignee: Mark Harwood

Committed to 4.0 branch, revision 1368442

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>    Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427322#comment-13427322
 ] 

Mark Harwood commented on LUCENE-4069:
--

Will do.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-08-02 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Fix Version/s: 5.0

Applied to trunk in revision 1368567

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0-ALPHA
>Reporter: Mark Harwood
>    Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 5.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters

2012-08-13 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433045#comment-13433045
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. Removing misleading 2X perf gain: it seems to depend heavily on the exact 
use case.

Fair enough - the original patch targeted Lucene 3.6 which benefited more 
heavily from this technique. The issue then morphed into a 4.x patch where 
performance gains were harder to find. 
I think the sweet spot is in primary key searches on indexes with ongoing heavy 
changes (more segment fragmentation, less OS-level caching?). This is the use 
case I am targeting currently and my final tests using our primary-key-counting 
test rig saw a 10 to 15% improvement over Pulsing.

bq. I'm asking because I need his feature but I'm stuck with 3.x for a while. 

I have a client in a similar situation who are contemplating using the 3.6 
patch.

bq. Is there bugs which should be fixed in initial 3.6 patch? 

It has been a while since I looked at it - a quick run of "ant test" on my copy 
here showed no errors. I will be giving it a closer review if my client decides 
to go down this route and can post any fixes here.
I expect if you use the patch and get into trouble you can use an un-patched 
version of 3.6 to read the same index files (it should just ignore the extra 
"blm" files created by the patched version).


> Segment-level Bloom filters
> ---
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>    Affects Versions: 3.6, 4.0-ALPHA
>    Reporter: Mark Harwood
>Assignee: Mark Harwood
>Priority: Minor
> Fix For: 4.0-BETA, 5.0
>
> Attachments: 4069Failure.zip, BloomFilterPostingsBranch4x.patch, 
> LUCENE-4069-tryDeleteDocument.patch, LUCENE-4203.patch, 
> MHBloomFilterOn3.6Branch.patch, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PKLookupUpdatePerfTest.java, 
> PKLookupUpdatePerfTest.java, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful

2012-09-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452900#comment-13452900
 ] 

Mark Harwood commented on LUCENE-4369:
--

SingleTermField ?

Not sure "matching vs searching" is a commonly understood differentiation.

> StringFields name is unintuitive and not helpful
> 
>
> Key: LUCENE-4369
> URL: https://issues.apache.org/jira/browse/LUCENE-4369
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
> Attachments: LUCENE-4369.patch
>
>
> There's a huge difference between TextField and StringField, StringField 
> screws up scoring and bypasses your Analyzer.
> (see java-user thread "Custom Analyzer Not Called When Indexing" as an 
> example.)
> The name we use here is vital, otherwise people will get bad results.
> I think we should rename StringField to MatchOnlyField.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4369) StringFields name is unintuitive and not helpful

2012-09-11 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452914#comment-13452914
 ] 

Mark Harwood commented on LUCENE-4369:
--

Agreed on the need for a change - names are important.

I have a problem with using "match" on its own because the word is often 
associated with partial matching e.g. "best match" or "fuzzy match".
A quick google suggests "match" has more connotations with fuzziness than 
exactness - there are 162m results for "best match" vs only 45m results for 
"exact match".

So how about "ExactMatchField"?




> StringFields name is unintuitive and not helpful
> 
>
> Key: LUCENE-4369
> URL: https://issues.apache.org/jira/browse/LUCENE-4369
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
> Attachments: LUCENE-4369.patch
>
>
> There's a huge difference between TextField and StringField, StringField 
> screws up scoring and bypasses your Analyzer.
> (see java-user thread "Custom Analyzer Not Called When Indexing" as an 
> example.)
> The name we use here is vital, otherwise people will get bad results.
> I think we should rename StringField to MatchOnlyField.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: BloomFilterPostings40.patch)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: MHBloomFilterOn3.6Branch.patch, 
> PrimaryKey40PerformanceTestSrc.zip
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: BloomFilterPostingsBranch4x.patch

Updated as follows:
* Extracted Bloom filter functionality as new oal.util.FuzzySet class - the 
name is changed because Bloom filtering is one application for a FuzzySet, 
fuzzy count distincts being another.
* BloomFilterPostingsFormat now take a factory that can tailor choice of 
BloomFilter per field (bitset size/saturation settings and choice of hash 
algo). Provided a default factory implementation.
* All Unit tests pass now that I have a test PostingsFormat class that uses v 
small bitsets where before the many-field unit tests would cause OOM. 

Will follow up with benchmarks when I have more time to run and document them. 
Initial results from my large-scale tests on growing indexes show a nice flat 
line in the face of a growing index whereas a non-Bloomed index saw-tooths 
upwards as segments grow/merge.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: (was: PrimaryKey40PerformanceTestSrc.zip)

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-08 Thread Mark Harwood (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Harwood updated LUCENE-4069:
-

Attachment: PrimaryKeyPerfTest40.java

Benchmark tool adapted from Mike's original Pulsing codec benchmark. Now 
includes Bloom postings example.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395773#comment-13395773
 ] 

Mark Harwood commented on LUCENE-4069:
--

Interesting results, Mike - thanks for taking the time to run them.

bq.  BloomFilteredFieldsProducer should just pass through intersect to the 
delegate?

I have tried to make the BloomFilteredFieldsProducer get out of the way of the 
client app and the delegate PostingsFormat as soon as it is safe to do so i.e. 
when the user is safely focused on a non-filtered field. While there is a 
chance the client may end up making a call to TermsEnum.seekExact(..) on a 
filtered field then I need to have a wrapper object in place which is in a 
position to intercept this call. In all other method invocations I just end up 
delegating calls so I wonder if all these extra method calls are the cause of 
the slowdown you see e.g. when Fuzzy is enumerating over many terms. 
The only other alternatives to endlessly wrapping in this way are:
a) API change - e.g. allow TermsEnum.seekExact to have a pluggable call-out for 
just this one method.
b) Mess around with byte-code manipulation techniques to weave in Bloom 
filtering(the sort of thing I recall Hibernate resorts to)

Neither of these seem particularly appealing options so I think we may have to 
live with fuzzy+bloom not being as fast as straight fuzzy.

For completeness sake - I don't have access to your benchmarking code but I 
would hope that PostingsFormat.fieldsProducer() isn't called more than once for 
the same segment as that's where the Bloom filters get loaded from disk so 
there's inherent cost there too. I can't imagine this is the case.

BTW I've just finished a long-running set of tests which mixes up reads and 
writes here: http://goo.gl/KJmGv
This benchmark represents how graph databases such as Neo4j use Lucene for an 
index when loading (I typically use the Wikipedia links as a test set). I look 
to get a 3.5 x speed up in Lucene 4 and Lucene 3.6 gets nearly 9 x speedup over 
the comparatively slower 3.6 codebase.


> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4069) Segment-level Bloom filters for a 2 x speed up on rare term searches

2012-06-18 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13395857#comment-13395857
 ] 

Mark Harwood commented on LUCENE-4069:
--

bq. I think the fix is simple: you are not overriding Terms.intersect now, in 
BloomFilteredTerms

Good catch - a quick test indeed shows a speed up on fuzzy queries. 
I'll prepare a new patch.

I'm not sure on why 3.6+Bloom is faster than 4+Bloom in my tests. I'll take a 
closer look at your benchmark.

> Segment-level Bloom filters for a 2 x speed up on rare term searches
> 
>
> Key: LUCENE-4069
> URL: https://issues.apache.org/jira/browse/LUCENE-4069
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/index
>Affects Versions: 3.6, 4.0
>Reporter: Mark Harwood
>Priority: Minor
> Fix For: 4.0, 3.6.1
>
> Attachments: BloomFilterPostingsBranch4x.patch, 
> MHBloomFilterOn3.6Branch.patch, PrimaryKeyPerfTest40.java
>
>
> An addition to each segment which stores a Bloom filter for selected fields 
> in order to give fast-fail to term searches, helping avoid wasted disk access.
> Best suited for low-frequency fields e.g. primary keys on big indexes with 
> many segments but also speeds up general searching in my tests.
> Overview slideshow here: 
> http://www.slideshare.net/MarkHarwood/lucene-bloomfilteredsegments
> Benchmarks based on Wikipedia content here: http://goo.gl/X7QqU
> Patch based on 3.6 codebase attached.
> There are no 3.6 API changes currently - to play just add a field with "_blm" 
> on the end of the name to invoke special indexing/querying capability. 
> Clearly a new Field or schema declaration(!) would need adding to APIs to 
> configure the service properly.
> Also, a patch for Lucene4.0 codebase introducing a new PostingsFormat

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   >