Re: TestCodecs running time

2010-04-14 Thread Shai Erera
See you already did that Mike :). Thanks ! now the tests run for 2s.

Shai

On Fri, Apr 9, 2010 at 12:49 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> It's also slow because it repeats all the tests for each of the core
> codecs (standard, sep, pulsing, intblock).
>
> I think it's fine to reduce the number of iterations -- just make sure
> there's no seed to newRandom() so the distributing testing is
> "effective".
>
> Mike
>
> On Fri, Apr 9, 2010 at 12:43 AM, Shai Erera  wrote:
> > Hi
> >
> > I've noticed that TestCodecs takes an insanely long time to run on my
> > machine - between 35-40 seconds. Is that expected?
> > The reason why it runs so long, seems to be that its threads make (each)
> > 4000 iterations ... is that really required to ensure correctness?
> >
> > Shai
> >
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


Re: Proposal about Version API "relaxation"

2010-04-14 Thread Shai Erera
So then I don't understand this:

{quote}
* A major release always bumps the major release number (2.x ->
   3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
   releases along that branch

* There is no back compat across major releases (index nor APIs),
   but full back compat within branches.

{quote}

What's different than what's done today? How can we remove Version in that
world, if we need to maintain full back-compat between 3.1 and 3.2, index
and API-wise? We'll still need to deprecate and come up w/ new classes every
time, and we'll still need to maintain runtime changes back-compat.

Unless you're telling me we'll start releasing major releases more often?
Well ... then we're saying the same thing, only I think that instead of
releasing 4, 5, 6, 7, 8 every 6 months, we can release 3.1, 3.2, 3.5 ...
because if you look back, every minor release included API deprecations as
well as back-compat breaks. That means that every minor release should have
been a major release right?

Point is, if I understand correctly and you agree w/ my statement above - I
don't see why would anyone releases a 3.x after 4.0 is out unless someone
really wants to work hard on maintaining back-compat of some features.

If it's just a numbering thing, then I don't think it matters what is
defined as 'major' vs. 'minor'. One way is to define 'major' as X and minor
X.Y, and another is to define major as 'X.Y' and minor as 'X.Y.Z'. I prefer
the latter but don't have any strong feelings against the former. Just
pointing out that X will grow more rapidly than today. That's all.

So did I get it right?

Shai

On Thu, Apr 15, 2010 at 8:19 AM, Mark Miller  wrote:

> I don't read what you wrote and what Mike wrote as even close to the same.
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Apr 15, 2010, at 12:05 AM, Shai Erera  wrote:
>
> Ahh ... a dream finally comes true ... what a great way to start a day :).
> +1 !!!
>
> I have some questions/comments though:
>
> * Index back compat should be maintained between major releases, like it is
> today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
> segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
> call optimize() to ensure 4.0 still works on their index. I hope that will
> still be the case? Otherwise I don't see how we can prevent reindexing by
> apps.
> ** Index behavioral/runtime changes, like those of Analyzers, are ok to
> require a reindex, as proposed.
>
> So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
> API? Cool and convenient. For how long do we keep the 3.1 branch around?
> Also, it used to only fix bugs, but from now on it'll be allowed to
> introduce new features, if they maintain back-compat? So 3.1.1 can have
> 'flex' (going for the extreme on purpose) if someone maintains back-compat?
>
> I think the back-compat on branches should be only for index runtime
> changes. There's no point, in my opinion, to maintain API back-compat
> anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
> just to get a new feature but get it API back-supported? As soon as they
> upgrade to 3.2, that means a new set of API right?
>
> Major releases will just change the index structure format then? Or move to
> Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
> Java 1.6 ... no API back-compat right :).
>
> That's definitely a great step forward !
>
> Shai
>
> On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda < 
> va...@osafoundation.org> wrote:
>
>>
>> On Thu, 15 Apr 2010, Earwin Burrfoot wrote:
>>
>>  Can't believe my eyes.
>>>
>>> +1
>>>
>>
>> Likewise. +1 !
>>
>> Andi..
>>
>>
>>> On Thu, Apr 15, 2010 at 01:22, Michael McCandless
>>> < luc...@mikemccandless.com> wrote:
>>>
 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 < mar...@rectangular.com> wrote:

  Essentially, we're free to break back compat within "Lucy" at any time,
> but
> we're not able to break back compat within a stable fork like "Lucy1",
> "Lucy2", etc.  So what we'll probably do during normal development with
> Analyzers is just change them and note the break in the Changes file.
>

 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x ->
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a fu

Re: Proposal about Version API "relaxation"

2010-04-14 Thread Mark Miller
I don't read what you wrote and what Mike wrote as even close to the  
same.


- Mark

http://www.lucidimagination.com (mobile)

On Apr 15, 2010, at 12:05 AM, Shai Erera  wrote:

Ahh ... a dream finally comes true ... what a great way to start a  
day :). +1 !!!


I have some questions/comments though:

* Index back compat should be maintained between major releases,  
like it is today, STRUCTURE-wise. So apps get a chance to  
incrementally upgrade their segments when they move from 2.x to 3.x  
before 4.0 lands and they'll need to call optimize() to ensure 4.0  
still works on their index. I hope that will still be the case?  
Otherwise I don't see how we can prevent reindexing by apps.
** Index behavioral/runtime changes, like those of Analyzers, are ok  
to require a reindex, as proposed.


So after 3.1 is out, trunk can break the API and 3.2 will have a new  
set of API? Cool and convenient. For how long do we keep the 3.1  
branch around? Also, it used to only fix bugs, but from now on it'll  
be allowed to introduce new features, if they maintain back-compat?  
So 3.1.1 can have 'flex' (going for the extreme on purpose) if  
someone maintains back-compat?


I think the back-compat on branches should be only for index runtime  
changes. There's no point, in my opinion, to maintain API back- 
compat anymore for jars drop-in, if apps will need to upgrade from  
3.1 to 3.1.1 just to get a new feature but get it API back- 
supported? As soon as they upgrade to 3.2, that means a new set of  
API right?


Major releases will just change the index structure format then? Or  
move to Java 1.6? Well ... not even that because as I understand it,  
3.2 can move to Java 1.6 ... no API back-compat right :).


That's definitely a great step forward !

Shai

On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda  
 wrote:


On Thu, 15 Apr 2010, Earwin Burrfoot wrote:

Can't believe my eyes.

+1

Likewise. +1 !

Andi..


On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 wrote:
On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 wrote:

Essentially, we're free to break back compat within "Lucy" at any  
time, but

we're not able to break back compat within a stable fork like "Lucy1",
"Lucy2", etc.  So what we'll probably do during normal development  
with

Analyzers is just change them and note the break in the Changes file.

So... what if we change up how we develop and release Lucene:

 * A major release always bumps the major release number (2.x ->
   3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
   releases along that branch

 * There is no back compat across major releases (index nor APIs),
   but full back compat within branches.

This would match how many other projects work (KS/Lucy, as Marvin
describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
if any devs have the itch, they could freely back-port improvements
from trunk as long as they kept back-compat within the branch.

I think in such a future world, we could:

 * Remove Version entirely!

 * Not worry at all about back-compat when developing on trunk

 * Give proper names to new improved classes instead of
   StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
   today; rename existing classes.

 * Let analyzers freely, incrementally improve

 * Use interfaces without fear

 * Stop spending the truly substantial time (look @ Uwe's awesome
   back-compat layer for analyzers!) that we now must spend when
   adding new features, for back-compat

 * Be more free to introduce very new not-fully-baked features/APIs,
   marked as experimental, on the expectation that once they are used
   (in trunk) they will iterate/change/improve vs trying so hard to
   get things right on the first go for fear of future back compat
   horrors.

Thoughts...?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





--
Kirill Zakharenko/?? ? (ear...@gmail.com)

Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Shai Erera
Also, we will still need to maintain the Backwards section in CHANGES (or
move it to API Changes), to help people upgrade from release to release.
Just pointing that out as well.

Shai

On Thu, Apr 15, 2010 at 7:05 AM, Shai Erera  wrote:

> Ahh ... a dream finally comes true ... what a great way to start a day :).
> +1 !!!
>
> I have some questions/comments though:
>
> * Index back compat should be maintained between major releases, like it is
> today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
> segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
> call optimize() to ensure 4.0 still works on their index. I hope that will
> still be the case? Otherwise I don't see how we can prevent reindexing by
> apps.
> ** Index behavioral/runtime changes, like those of Analyzers, are ok to
> require a reindex, as proposed.
>
> So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
> API? Cool and convenient. For how long do we keep the 3.1 branch around?
> Also, it used to only fix bugs, but from now on it'll be allowed to
> introduce new features, if they maintain back-compat? So 3.1.1 can have
> 'flex' (going for the extreme on purpose) if someone maintains back-compat?
>
> I think the back-compat on branches should be only for index runtime
> changes. There's no point, in my opinion, to maintain API back-compat
> anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
> just to get a new feature but get it API back-supported? As soon as they
> upgrade to 3.2, that means a new set of API right?
>
> Major releases will just change the index structure format then? Or move to
> Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
> Java 1.6 ... no API back-compat right :).
>
> That's definitely a great step forward !
>
> Shai
>
>
> On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda wrote:
>
>>
>> On Thu, 15 Apr 2010, Earwin Burrfoot wrote:
>>
>>  Can't believe my eyes.
>>>
>>> +1
>>>
>>
>> Likewise. +1 !
>>
>> Andi..
>>
>>
>>> On Thu, Apr 15, 2010 at 01:22, Michael McCandless
>>>  wrote:
>>>
 On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
  wrote:

  Essentially, we're free to break back compat within "Lucy" at any time,
> but
> we're not able to break back compat within a stable fork like "Lucy1",
> "Lucy2", etc.  So what we'll probably do during normal development with
> Analyzers is just change them and note the break in the Changes file.
>

 So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x ->
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

 This would match how many other projects work (KS/Lucy, as Marvin
 describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

 The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
 if any devs have the itch, they could freely back-port improvements
 from trunk as long as they kept back-compat within the branch.

 I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when developing on trunk

  * Give proper names to new improved classes instead of
StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
today; rename existing classes.

  * Let analyzers freely, incrementally improve

  * Use interfaces without fear

  * Stop spending the truly substantial time (look @ Uwe's awesome
back-compat layer for analyzers!) that we now must spend when
adding new features, for back-compat

  * Be more free to introduce very new not-fully-baked features/APIs,
marked as experimental, on the expectation that once they are used
(in trunk) they will iterate/change/improve vs trying so hard to
get things right on the first go for fear of future back compat
horrors.

 Thoughts...?

 Mike

 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org



>>>
>>>
>>> --
>>> Kirill Zakharenko/?? ? (ear...@gmail.com)
>>>
>>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>>> ICQ: 104465785
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.o

Re: Proposal about Version API "relaxation"

2010-04-14 Thread Shai Erera
Ahh ... a dream finally comes true ... what a great way to start a day :).
+1 !!!

I have some questions/comments though:

* Index back compat should be maintained between major releases, like it is
today, STRUCTURE-wise. So apps get a chance to incrementally upgrade their
segments when they move from 2.x to 3.x before 4.0 lands and they'll need to
call optimize() to ensure 4.0 still works on their index. I hope that will
still be the case? Otherwise I don't see how we can prevent reindexing by
apps.
** Index behavioral/runtime changes, like those of Analyzers, are ok to
require a reindex, as proposed.

So after 3.1 is out, trunk can break the API and 3.2 will have a new set of
API? Cool and convenient. For how long do we keep the 3.1 branch around?
Also, it used to only fix bugs, but from now on it'll be allowed to
introduce new features, if they maintain back-compat? So 3.1.1 can have
'flex' (going for the extreme on purpose) if someone maintains back-compat?

I think the back-compat on branches should be only for index runtime
changes. There's no point, in my opinion, to maintain API back-compat
anymore for jars drop-in, if apps will need to upgrade from 3.1 to 3.1.1
just to get a new feature but get it API back-supported? As soon as they
upgrade to 3.2, that means a new set of API right?

Major releases will just change the index structure format then? Or move to
Java 1.6? Well ... not even that because as I understand it, 3.2 can move to
Java 1.6 ... no API back-compat right :).

That's definitely a great step forward !

Shai

On Thu, Apr 15, 2010 at 1:34 AM, Andi Vajda  wrote:

>
> On Thu, 15 Apr 2010, Earwin Burrfoot wrote:
>
>  Can't believe my eyes.
>>
>> +1
>>
>
> Likewise. +1 !
>
> Andi..
>
>
>> On Thu, Apr 15, 2010 at 01:22, Michael McCandless
>>  wrote:
>>
>>> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
>>>  wrote:
>>>
>>>  Essentially, we're free to break back compat within "Lucy" at any time,
 but
 we're not able to break back compat within a stable fork like "Lucy1",
 "Lucy2", etc.  So what we'll probably do during normal development with
 Analyzers is just change them and note the break in the Changes file.

>>>
>>> So... what if we change up how we develop and release Lucene:
>>>
>>>  * A major release always bumps the major release number (2.x ->
>>>3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
>>>releases along that branch
>>>
>>>  * There is no back compat across major releases (index nor APIs),
>>>but full back compat within branches.
>>>
>>> This would match how many other projects work (KS/Lucy, as Marvin
>>> describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).
>>>
>>> The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
>>> if any devs have the itch, they could freely back-port improvements
>>> from trunk as long as they kept back-compat within the branch.
>>>
>>> I think in such a future world, we could:
>>>
>>>  * Remove Version entirely!
>>>
>>>  * Not worry at all about back-compat when developing on trunk
>>>
>>>  * Give proper names to new improved classes instead of
>>>StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
>>>today; rename existing classes.
>>>
>>>  * Let analyzers freely, incrementally improve
>>>
>>>  * Use interfaces without fear
>>>
>>>  * Stop spending the truly substantial time (look @ Uwe's awesome
>>>back-compat layer for analyzers!) that we now must spend when
>>>adding new features, for back-compat
>>>
>>>  * Be more free to introduce very new not-fully-baked features/APIs,
>>>marked as experimental, on the expectation that once they are used
>>>(in trunk) they will iterate/change/improve vs trying so hard to
>>>get things right on the first go for fear of future back compat
>>>horrors.
>>>
>>> Thoughts...?
>>>
>>> Mike
>>>
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>>
>>
>>
>> --
>> Kirill Zakharenko/?? ? (ear...@gmail.com)
>>
>> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
>> ICQ: 104465785
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>


[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-14 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857164#action_12857164
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
It's for performance. I expect there are apps where a given
thread/pool indexes certain kind of docs, ie, the app threads
themselves have "affinity" for docs with similar term distributions.
In which case, it's best (most RAM efficient) if those docs w/
presumably similar term stats are sent back to the same DW. If you
mix in different term stats into one buffer you get worse RAM
efficiency.
{quote}

I do see your point, but I feel like we shouldn't optimize/make compromises for 
this use case.  Mainly, because I think apps with such an affinity that you 
describe are very rare?  The usual design is a queued ingestion pipeline, where 
a pool of indexer threads take docs out of a queue and feed them to an 
IndexWriter, I think?  In such a world the threads wouldn't have an affinity 
for similar docs.

And if a user really has so different docs, maybe the right answer would be to 
have more than one single index?  Even if today an app utilizes the thread 
affinity, this only results in maybe somewhat faster indexing performance, but 
the benefits would be lost after flusing/merging.  

If we assign docs randomly to available DocumentsWriterPerThreads, then we 
should on average make good use of the overall memory?  Alternatively we could 
also select the DWPT from the pool of available DWPTs that has the highest 
amount of free memory?  

Having a fully decoupled memory management is compelling I think, mainly 
because it makes everything so much simpler.  A DWPT could decide itself when 
it's time to flush, and the other ones can keep going independently.  

If you do have a global RAM management, how would the flushing work?  E.g. when 
a global flush is triggered because all RAM is consumed, and we pick the DWPT 
with the highest amount of allocated memory for flushing, what will the other 
DWPTs do during that flush?  Wouldn't we have to pause the other DWPTs to make 
sure we don't exceed the maxRAMBufferSize?
Of course we could say "always flush when 90% of the overall memory is 
consumed", but how would we know that the remaining 10% won't fill up during 
the time the flush takes?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857158#action_12857158
 ] 

Grant Ingersoll commented on LUCENE-2359:
-

Reverted the last patch and the other related ones.  Let's have a discussion on 
the mailing list about coordinating all of this.  I'd like to see the patches 
be focused on solving the specific issues and then we can open up a new issue 
for refactoring this to make for pluggable best fit, etc.

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857155#action_12857155
 ] 

Mark Miller commented on LUCENE-2393:
-

Perhaps this should be combined with high freq terms tool ... could make a ton 
of this little guys, so prob best to consolidate them.

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Andi Vajda


On Thu, 15 Apr 2010, Earwin Burrfoot wrote:


Can't believe my eyes.

+1


Likewise. +1 !

Andi..



On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 wrote:

On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 wrote:


Essentially, we're free to break back compat within "Lucy" at any time, but
we're not able to break back compat within a stable fork like "Lucy1",
"Lucy2", etc.  So what we'll probably do during normal development with
Analyzers is just change them and note the break in the Changes file.


So... what if we change up how we develop and release Lucene:

 * A major release always bumps the major release number (2.x ->
   3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
   releases along that branch

 * There is no back compat across major releases (index nor APIs),
   but full back compat within branches.

This would match how many other projects work (KS/Lucy, as Marvin
describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
if any devs have the itch, they could freely back-port improvements
from trunk as long as they kept back-compat within the branch.

I think in such a future world, we could:

 * Remove Version entirely!

 * Not worry at all about back-compat when developing on trunk

 * Give proper names to new improved classes instead of
   StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
   today; rename existing classes.

 * Let analyzers freely, incrementally improve

 * Use interfaces without fear

 * Stop spending the truly substantial time (look @ Uwe's awesome
   back-compat layer for analyzers!) that we now must spend when
   adding new features, for back-compat

 * Be more free to introduce very new not-fully-baked features/APIs,
   marked as experimental, on the expectation that once they are used
   (in trunk) they will iterate/change/improve vs trying so hard to
   get things right on the first go for fear of future back compat
   horrors.

Thoughts...?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org






--
Kirill Zakharenko/?? ? (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857124#action_12857124
 ] 

Michael McCandless commented on LUCENE-2324:


This is awesome Michael!  Much simpler... nor more FreqProxMergeState, nor the 
logic to interleave/synchronize writing to the doc stores.  I like it!

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857121#action_12857121
 ] 

Michael McCandless commented on LUCENE-2393:


Programmatically indexing those docs is fine -- most tests make a MockRAMDir, 
index a few docs into it, and test against that.

This tool looks useful, thanks Tom!

Note that with flex scoring (LUCENE-2392) we are planning on storing this 
statistic (sum of tf for the term across all docs) in the terms dict, for 
fields that enable statistics.  So when that lands, this tool can pull from 
that, or regenerate it if the field didn't store stats.

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Earwin Burrfoot
Can't believe my eyes.

+1

On Thu, Apr 15, 2010 at 01:22, Michael McCandless
 wrote:
> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
>  wrote:
>
>> Essentially, we're free to break back compat within "Lucy" at any time, but
>> we're not able to break back compat within a stable fork like "Lucy1",
>> "Lucy2", etc.  So what we'll probably do during normal development with
>> Analyzers is just change them and note the break in the Changes file.
>
> So... what if we change up how we develop and release Lucene:
>
>  * A major release always bumps the major release number (2.x ->
>    3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
>    releases along that branch
>
>  * There is no back compat across major releases (index nor APIs),
>    but full back compat within branches.
>
> This would match how many other projects work (KS/Lucy, as Marvin
> describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).
>
> The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
> if any devs have the itch, they could freely back-port improvements
> from trunk as long as they kept back-compat within the branch.
>
> I think in such a future world, we could:
>
>  * Remove Version entirely!
>
>  * Not worry at all about back-compat when developing on trunk
>
>  * Give proper names to new improved classes instead of
>    StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
>    today; rename existing classes.
>
>  * Let analyzers freely, incrementally improve
>
>  * Use interfaces without fear
>
>  * Stop spending the truly substantial time (look @ Uwe's awesome
>    back-compat layer for analyzers!) that we now must spend when
>    adding new features, for back-compat
>
>  * Be more free to introduce very new not-fully-baked features/APIs,
>    marked as experimental, on the expectation that once they are used
>    (in trunk) they will iterate/change/improve vs trying so hard to
>    get things right on the first go for fear of future back compat
>    horrors.
>
> Thoughts...?
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Robert Muir
+1

On Wed, Apr 14, 2010 at 5:22 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
>  wrote:
>
> > Essentially, we're free to break back compat within "Lucy" at any time,
> but
> > we're not able to break back compat within a stable fork like "Lucy1",
> > "Lucy2", etc.  So what we'll probably do during normal development with
> > Analyzers is just change them and note the break in the Changes file.
>
> So... what if we change up how we develop and release Lucene:
>
>  * A major release always bumps the major release number (2.x ->
>3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
>releases along that branch
>
>  * There is no back compat across major releases (index nor APIs),
>but full back compat within branches.
>
> This would match how many other projects work (KS/Lucy, as Marvin
> describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).
>
> The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
> if any devs have the itch, they could freely back-port improvements
> from trunk as long as they kept back-compat within the branch.
>
> I think in such a future world, we could:
>
>  * Remove Version entirely!
>
>  * Not worry at all about back-compat when developing on trunk
>
>  * Give proper names to new improved classes instead of
>StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
>today; rename existing classes.
>
>  * Let analyzers freely, incrementally improve
>
>  * Use interfaces without fear
>
>  * Stop spending the truly substantial time (look @ Uwe's awesome
>back-compat layer for analyzers!) that we now must spend when
>adding new features, for back-compat
>
>  * Be more free to introduce very new not-fully-baked features/APIs,
>marked as experimental, on the expectation that once they are used
>(in trunk) they will iterate/change/improve vs trying so hard to
>get things right on the first go for fear of future back compat
>horrors.
>
> Thoughts...?
>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-14 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857112#action_12857112
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Michael, nice!  I guess I should've spent more time removing the PerThread 
classes, but now we're pretty much there.  Indeed the simplification should 
make things a lot better.  I guess I'll wait for the next patch to work on 
something.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857107#action_12857107
 ] 

Otis Gospodnetic commented on LUCENE-2393:
--

I think creating a small index with a couple of docs would be the way to go.

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1698) Change backwards-compatibility policy

2010-04-14 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-1698:
-

Assignee: (was: Michael Busch)

:)

> Change backwards-compatibility policy
> -
>
> Key: LUCENE-1698
> URL: https://issues.apache.org/jira/browse/LUCENE-1698
> Project: Lucene - Java
>  Issue Type: Task
>Reporter: Michael Busch
>Priority: Minor
> Fix For: 3.0
>
>
> These proposed changes might still change slightly:
> I'll call X.Y -> X+1.0 a 'major release', X.Y -> X.Y+1 a
> 'minor release' and X.Y.Z -> X.Y.Z+1 a 'bugfix release'. (we can later
> use different names; just for convenience here...)
> 1. The file format backwards-compatiblity policy will remain unchanged;
>i.e. Lucene X.Y supports reading all indexes written with Lucene
>X-1.Y. That means Lucene 4.0 will not have to be able to read 2.x
>indexes.
> 2. Deprecated public and protected APIs can be removed if they have
>been released in at least one major or minor release. E.g. an 3.1
>API can be released as deprecated in 3.2 and removed in 3.3 or 4.0
>(if 4.0 comes after 3.2).
> 3. No public or protected APIs are changed in a bugfix release; except
>if a severe bug can't be changed otherwise.
> 4. Each release will have release notes with a new section
>"Incompatible changes", which lists, as the names says, all changes that
>break backwards compatibility. The list should also have information
>about how to convert to the new API. I think the eclipse releases
>have such a release notes section. Furthermore, the Deprecation tag 
>comment will state the minimum version when this API is to be removed,  
> e.g.
>@deprecated See #fooBar().  Will be removed in 3.3 
>or
>@deprecated See #fooBar().  Will be removed in 3.3 or later.
> I'd suggest to treat a runtime change like an API change (unless it's fixing 
> a bug of course),
> i.e. giving a warning, providing a switch, switching the default behavior 
> only after a major 
> or minor release was around that had the warning/switch. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Chris Male
On Wed, Apr 14, 2010 at 11:22 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
>  wrote:
>
> > Essentially, we're free to break back compat within "Lucy" at any time,
> but
> > we're not able to break back compat within a stable fork like "Lucy1",
> > "Lucy2", etc.  So what we'll probably do during normal development with
> > Analyzers is just change them and note the break in the Changes file.
>
> So... what if we change up how we develop and release Lucene:
>
>  * A major release always bumps the major release number (2.x ->
>3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
>releases along that branch
>
>  * There is no back compat across major releases (index nor APIs),
>but full back compat within branches.
>
> This would match how many other projects work (KS/Lucy, as Marvin
> describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).
>
> The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
> if any devs have the itch, they could freely back-port improvements
> from trunk as long as they kept back-compat within the branch.
>
> I think in such a future world, we could:
>
>  * Remove Version entirely!
>
>  * Not worry at all about back-compat when developing on trunk
>
>  * Give proper names to new improved classes instead of
>StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
>today; rename existing classes.
>
>  * Let analyzers freely, incrementally improve
>
>  * Use interfaces without fear
>
>  * Stop spending the truly substantial time (look @ Uwe's awesome
>back-compat layer for analyzers!) that we now must spend when
>adding new features, for back-compat
>
>  * Be more free to introduce very new not-fully-baked features/APIs,
>marked as experimental, on the expectation that once they are used
>(in trunk) they will iterate/change/improve vs trying so hard to
>get things right on the first go for fear of future back compat
>horrors.
>
> Thoughts...?
>

+1


>
> Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl


Re: Proposal about Version API "relaxation"

2010-04-14 Thread Michael McCandless
On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
 wrote:

> Essentially, we're free to break back compat within "Lucy" at any time, but
> we're not able to break back compat within a stable fork like "Lucy1",
> "Lucy2", etc.  So what we'll probably do during normal development with
> Analyzers is just change them and note the break in the Changes file.

So... what if we change up how we develop and release Lucene:

  * A major release always bumps the major release number (2.x ->
3.0), and, starts a new branch for all minor (3.1, 3.2, 3.3)
releases along that branch

  * There is no back compat across major releases (index nor APIs),
but full back compat within branches.

This would match how many other projects work (KS/Lucy, as Marvin
describes above; Apache Tomcat; Hibernate; log4J; FreeBSD; etc.).

The 'stable' branch (say 3.x now for Lucene) would get bug fixes, and,
if any devs have the itch, they could freely back-port improvements
from trunk as long as they kept back-compat within the branch.

I think in such a future world, we could:

  * Remove Version entirely!

  * Not worry at all about back-compat when developing on trunk

  * Give proper names to new improved classes instead of
StandardAnalzyer2, or SmartStandardAnalyzer, that we end up doing
today; rename existing classes.

  * Let analyzers freely, incrementally improve

  * Use interfaces without fear

  * Stop spending the truly substantial time (look @ Uwe's awesome
back-compat layer for analyzers!) that we now must spend when
adding new features, for back-compat

  * Be more free to introduce very new not-fully-baked features/APIs,
marked as experimental, on the expectation that once they are used
(in trunk) they will iterate/change/improve vs trying so hard to
get things right on the first go for fear of future back compat
horrors.

Thoughts...?

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-04-14 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: lucene-2324.patch

The patch removes all *PerThread classes downstream of DocumentsWriter.

This simplifies a lot of the flushing logic in the different consumers.  The 
patch also removes FreqProxMergeState, because we don't have to interleave 
posting lists from different threads anymore of course.  I really like these 
simplifications!

There is still a lot to do:  The changes in DocumentsWriter and IndexWriter are 
currently just experimental to make everything compile.  Next I will introduce 
DocumentsWriterPerThread and implement the sequenceID logic (which was 
discussed here in earlier comments) and the new RAM management.  I also want to 
go through the indexing chain once again - there are probably a few more things 
to clean up or simplify.

The patch compiles and actually a surprising amount of tests pass.  Only 
multi-threaded tests seem to fail,
which is not very surprising, considering I removed all thread-handling logic 
from DocumentsWriter. :) 

So this patch isn't working yet - just wanted to post my current progress.  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324.patch, LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857081#action_12857081
 ] 

Nicolas Helleringer commented on LUCENE-2359:
-

Edit done.

I am currently browsing to find a good reference.

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856954#action_12856954
 ] 

Nicolas Helleringer edited comment on LUCENE-2359 at 4/14/10 4:40 PM:
--

Summary tables :

||Tile Level||TierLegnth||TierBoxes||TileXLength (miles)||
|0|1|1|24902|
|1|2|4|12451|
|2|4|16|6225,5|
|3|8|64|3112,75|
|4|16|256|1556,375|
|5|32|1024|778,1875|
|6|64|4096|389,09375|
|7|128|16384|194,546875|
|8|256|65536|97,2734375|
|9|512|262144|48,63671875|
|10|1024|1048576|24,31835938|
|11|2048|4194304|12,15917969|
|12|4096|16777216|6,079589844|
|13|8192|67108864|3,039794922|
|14|16384|268435456|1,519897461|
|15|32768|1073741824|0,75994873|

||Radius (miles)||legacy bestFit||legacy bestFit TileLength||legacy bestFit max 
number of Box to fetch||new bestFit||new bestFit TileLength||new bestFit number 
of Box to fetch||
|1|18|0,75994873|9|14|1,519897461|4|
|5|16|0,75994873|64|12|6,079589844|4|
|10|15|0,75994873|225|11|12,15917969|4|
|25|13|3,039794922|100|9|48,63671875|4|
|50|12|6,079589844|100|8|97,2734375|4|
|100|11|12,15917969|100|7|194,546875|4|
|250|10|24,31835938|144|6|389,09375|4|
|500|9|48,63671875|144|5|778,1875|4|
|1000|8|97,2734375|144|4|1556,375|4|
|2500|7|194,546875|196|3|3112,75|4|
|5000|6|389,09375|196|2|6225,5|4|
|1|5|778,1875|196|1|12451|4|


  was (Author: nicolas.helleringer):
Summary tables :

||Tile Level||TierLegnth||TierBoxes||TileXLength (miles)||
|0|1|1|24902|
|1|2|4|12451|
|2|4|16|6225,5|
|3|8|64|3112,75|
|4|16|256|1556,375|
|5|32|1024|778,1875|
|6|64|4096|389,09375|
|7|128|16384|194,546875|
|8|256|65536|97,2734375|
|9|512|262144|48,63671875|
|10|1024|1048576|24,31835938|
|11|2048|4194304|12,15917969|
|12|4096|16777216|6,079589844|
|13|8192|67108864|3,039794922|
|14|16384|268435456|1,519897461|
|15|32768|1073741824|0,75994873|

||Radius (miles)||legacy bestFit||legacy bestFit TileLength||legacy bestFit max 
number of Box to fetch||new bestFit||new bestFit TileLength||new bestFit number 
of Box to fetch||
|1|18|0,75994873|9|14|1,519897461|4|
|5|16|0,75994873|64|12|6,079589844|4|
|10|15|0,75994873|225|11|12,15917969|4|
|25|13|3,039794922|100|9|24,31835938|9|
|50|12|6,079589844|100|8|97,2734375|4|
|100|11|12,15917969|100|7|194,546875|4|
|250|10|24,31835938|144|6|389,09375|4|
|500|9|48,63671875|144|5|778,1875|4|
|1000|8|97,2734375|144|4|1556,375|4|
|2500|7|194,546875|196|3|3112,75|4|
|5000|6|389,09375|196|2|6225,5|4|
|1|5|778,1875|196|1|12451|4|

  
> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2394) Factories for cache creation

2010-04-14 Thread Oswaldo Dantas (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oswaldo Dantas updated LUCENE-2394:
---

Comment: was deleted

(was: By the way, in http://wiki.apache.org/lucene-java/HowToContribute is said 
that 2.X releases should be compatible with Java 1.4 but I've found some 
autoboxing, annotations and generics in FieldCacheImpl that where being caught 
by ant, that is enforcing 1.4 source compatibility, so those things are removed 
in the patch, not that it has anything to do specifically with the improvement.)

> Factories for cache creation
> 
>
> Key: LUCENE-2394
> URL: https://issues.apache.org/jira/browse/LUCENE-2394
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Oswaldo Dantas
> Fix For: 2.9.3, 3.0.2
>
> Attachments: factoriesPatch.patch
>
>
> Hello all,
> I've seen the LUCENE-831 (Complete overhaul of FieldCache API/Implementation) 
> targeted for version 3.1 and I think that maybe, before this overhaul, it 
> would be good to have a more cirurgical change, that would need smaller 
> effort in new unit tests, without behavior changes and almost no performance 
> impact.
> One way to achieve that is inserting strategically positioned calls to a 
> factory structure that would allow every already developed code to continue 
> working without changes, at the same time giving the opportunity to put 
> alternative factories to work.
> Focusing on the cache idea (not specifically the FieldCache, that has it's 
> own specific responsabilities, but in the key/value structure that will 
> ultimately hold the cached objects) i've done the small change contained in 
> the patch I'm attaching to this.
> It has default implementations that encapsulate what was being originally 
> used in FieldCache, so all current test cases passes, and creates the 
> possibility to create a EHCacheFactory or InfinispanCacheFactory, or even 
> MyOwnCachingStructureFactory.
> With this, it would be easy to take advantage of the features provided by 
> this kind of project in a uniform way and rapidly allowing new possibilities 
> in scalability and tuning.
> The code in the patch is small (16kb file is small if compared to the 
> hundreds of kbs in other patchs) and even though it doesn't have javadoc 
> right now (sorry) I hope it can be easly understood. So, if Lucene 
> maintainers see that this contribution could be used (in a 2.9.n+1 and 
> 3.0.n+1 and maybe influencing future versions) we could put some more effort 
> in it, documenting, adding necessary unit tests and maybe contributing other 
> factory implementations.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2394) Factories for cache creation

2010-04-14 Thread Oswaldo Dantas (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857069#action_12857069
 ] 

Oswaldo Dantas commented on LUCENE-2394:


By the way, in http://wiki.apache.org/lucene-java/HowToContribute is said that 
2.X releases should be compatible with Java 1.4 but I've found some autoboxing, 
annotations and generics in FieldCacheImpl that where being caught by ant, that 
is enforcing 1.4 source compatibility, so those things are removed in the 
patch, not that it has anything to do specifically with the improvement.

> Factories for cache creation
> 
>
> Key: LUCENE-2394
> URL: https://issues.apache.org/jira/browse/LUCENE-2394
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Oswaldo Dantas
> Fix For: 2.9.3, 3.0.2
>
> Attachments: factoriesPatch.patch
>
>
> Hello all,
> I've seen the LUCENE-831 (Complete overhaul of FieldCache API/Implementation) 
> targeted for version 3.1 and I think that maybe, before this overhaul, it 
> would be good to have a more cirurgical change, that would need smaller 
> effort in new unit tests, without behavior changes and almost no performance 
> impact.
> One way to achieve that is inserting strategically positioned calls to a 
> factory structure that would allow every already developed code to continue 
> working without changes, at the same time giving the opportunity to put 
> alternative factories to work.
> Focusing on the cache idea (not specifically the FieldCache, that has it's 
> own specific responsabilities, but in the key/value structure that will 
> ultimately hold the cached objects) i've done the small change contained in 
> the patch I'm attaching to this.
> It has default implementations that encapsulate what was being originally 
> used in FieldCache, so all current test cases passes, and creates the 
> possibility to create a EHCacheFactory or InfinispanCacheFactory, or even 
> MyOwnCachingStructureFactory.
> With this, it would be easy to take advantage of the features provided by 
> this kind of project in a uniform way and rapidly allowing new possibilities 
> in scalability and tuning.
> The code in the patch is small (16kb file is small if compared to the 
> hundreds of kbs in other patchs) and even though it doesn't have javadoc 
> right now (sorry) I hope it can be easly understood. So, if Lucene 
> maintainers see that this contribution could be used (in a 2.9.n+1 and 
> 3.0.n+1 and maybe influencing future versions) we could put some more effort 
> in it, documenting, adding necessary unit tests and maybe contributing other 
> factory implementations.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857054#action_12857054
 ] 

Grant Ingersoll commented on LUCENE-2359:
-

So, you're saying then that your approach only ever has to retrieve 4 boxes no 
matter the radius?  Do you have a reference URL to where we can read more about 
it?

Also, please edit your table to reflect the error

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2394) Factories for cache creation

2010-04-14 Thread Oswaldo Dantas (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oswaldo Dantas updated LUCENE-2394:
---

Attachment: factoriesPatch.patch

Attaching factory suggestion (patch for changes to 
https://svn.apache.org/repos/asf/lucene/java/tags/lucene_2_9_2) focusing in 
cache, being used in specifically in FieldCacheImpl and TermInfosReader to 
exemplify.

> Factories for cache creation
> 
>
> Key: LUCENE-2394
> URL: https://issues.apache.org/jira/browse/LUCENE-2394
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Oswaldo Dantas
> Fix For: 2.9.3, 3.0.2
>
> Attachments: factoriesPatch.patch
>
>
> Hello all,
> I've seen the LUCENE-831 (Complete overhaul of FieldCache API/Implementation) 
> targeted for version 3.1 and I think that maybe, before this overhaul, it 
> would be good to have a more cirurgical change, that would need smaller 
> effort in new unit tests, without behavior changes and almost no performance 
> impact.
> One way to achieve that is inserting strategically positioned calls to a 
> factory structure that would allow every already developed code to continue 
> working without changes, at the same time giving the opportunity to put 
> alternative factories to work.
> Focusing on the cache idea (not specifically the FieldCache, that has it's 
> own specific responsabilities, but in the key/value structure that will 
> ultimately hold the cached objects) i've done the small change contained in 
> the patch I'm attaching to this.
> It has default implementations that encapsulate what was being originally 
> used in FieldCache, so all current test cases passes, and creates the 
> possibility to create a EHCacheFactory or InfinispanCacheFactory, or even 
> MyOwnCachingStructureFactory.
> With this, it would be easy to take advantage of the features provided by 
> this kind of project in a uniform way and rapidly allowing new possibilities 
> in scalability and tuning.
> The code in the patch is small (16kb file is small if compared to the 
> hundreds of kbs in other patchs) and even though it doesn't have javadoc 
> right now (sorry) I hope it can be easly understood. So, if Lucene 
> maintainers see that this contribution could be used (in a 2.9.n+1 and 
> 3.0.n+1 and maybe influencing future versions) we could put some more effort 
> in it, documenting, adding necessary unit tests and maybe contributing other 
> factory implementations.
> What do you think?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Bug in contrib/misc/HighFreqTerms.java?

2010-04-14 Thread Michael McCandless
OK I committed the fix.  I ran it on a flex wikipedia index I had...
it produces output like this:

body:[3c 21 2d 2d] 509050
body:[73 68 6f 75 6c 64] 515495
body:[74 68 65 6e] 525176
body:[74 69 74 6c 65] 525361
body:[5b 5b 55 6e 69 74 65 64] 532586
body:[6b 6e 6f 77 6e] 533558
body:[75 6e 64 65 72] 536480
body:[55 6e 69 74 65 64] 543746

Which is not very readable, but, it does this because flex terms are
arbitrary byte[], not necessarily utf8... maybe we should fix it to
print both hex and String if we assume bytes are utf8?

Mike

On Wed, Apr 14, 2010 at 3:25 PM, Michael McCandless
 wrote:
> Ugh, I'll fix this.
>
> With the new flex API, you can't ask a composite (Multi/DirReader) for
> its postings -- you have to go through the static methods on
> MultiFields.  I'm trying to put some distance b/w IndexReader and
> composite readers... because I'd like to eventually deprecate them.
> Ie, the composite readers should "hold" an ordered collection of
> sub-readers, but should not themselves implement IndexReader's API, I
> think.
>
> Thanks for raising this Tom,
>
> Mike
>
> On Wed, Apr 14, 2010 at 2:14 PM, Burton-West, Tom  wrote:
>> When I try to run HighFreqTerms.java in Lucene Revision: 933722  I get the
>> the exception appended below.  I believe the line of code involved is a
>> result of the flex indexing merge. Should I post this as a comment to
>> LUCENE-2370 (Reintegrate flex branch into trunk)?
>>
>> Or is there simply something wrong with my configuration?
>>
>> Exception in thread "main" java.lang.UnsupportedOperationException: please
>> use MultiFields.getFields if you really need a top level Fields (NOTE that
>> it's usually better to work per segment instead)
>>     at
>> org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762)
>>     at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71)
>>
>> Tom Burton-West
>>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2394) Factories for cache creation

2010-04-14 Thread Oswaldo Dantas (JIRA)
Factories for cache creation


 Key: LUCENE-2394
 URL: https://issues.apache.org/jira/browse/LUCENE-2394
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Reporter: Oswaldo Dantas
 Fix For: 2.9.3, 3.0.2


Hello all,

I've seen the LUCENE-831 (Complete overhaul of FieldCache API/Implementation) 
targeted for version 3.1 and I think that maybe, before this overhaul, it would 
be good to have a more cirurgical change, that would need smaller effort in new 
unit tests, without behavior changes and almost no performance impact.
One way to achieve that is inserting strategically positioned calls to a 
factory structure that would allow every already developed code to continue 
working without changes, at the same time giving the opportunity to put 
alternative factories to work.
Focusing on the cache idea (not specifically the FieldCache, that has it's own 
specific responsabilities, but in the key/value structure that will ultimately 
hold the cached objects) i've done the small change contained in the patch I'm 
attaching to this.
It has default implementations that encapsulate what was being originally used 
in FieldCache, so all current test cases passes, and creates the possibility to 
create a EHCacheFactory or InfinispanCacheFactory, or even 
MyOwnCachingStructureFactory.
With this, it would be easy to take advantage of the features provided by this 
kind of project in a uniform way and rapidly allowing new possibilities in 
scalability and tuning.
The code in the patch is small (16kb file is small if compared to the hundreds 
of kbs in other patchs) and even though it doesn't have javadoc right now 
(sorry) I hope it can be easly understood. So, if Lucene maintainers see that 
this contribution could be used (in a 2.9.n+1 and 3.0.n+1 and maybe influencing 
future versions) we could put some more effort in it, documenting, adding 
necessary unit tests and maybe contributing other factory implementations.
What do you think?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Bug in contrib/misc/HighFreqTerms.java?

2010-04-14 Thread Michael McCandless
Ugh, I'll fix this.

With the new flex API, you can't ask a composite (Multi/DirReader) for
its postings -- you have to go through the static methods on
MultiFields.  I'm trying to put some distance b/w IndexReader and
composite readers... because I'd like to eventually deprecate them.
Ie, the composite readers should "hold" an ordered collection of
sub-readers, but should not themselves implement IndexReader's API, I
think.

Thanks for raising this Tom,

Mike

On Wed, Apr 14, 2010 at 2:14 PM, Burton-West, Tom  wrote:
> When I try to run HighFreqTerms.java in Lucene Revision: 933722  I get the
> the exception appended below.  I believe the line of code involved is a
> result of the flex indexing merge. Should I post this as a comment to
> LUCENE-2370 (Reintegrate flex branch into trunk)?
>
> Or is there simply something wrong with my configuration?
>
> Exception in thread "main" java.lang.UnsupportedOperationException: please
> use MultiFields.getFields if you really need a top level Fields (NOTE that
> it's usually better to work per segment instead)
>     at
> org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762)
>     at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71)
>
> Tom Burton-West
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Robert Muir
On Wed, Apr 14, 2010 at 2:49 PM, Uwe Schindler  wrote:

> > And 2.9's backwards compatibility layer in
> > TokenStream
> > was significantly slower.
>
> I protest! No, it was not slower, only at the beginning because of missing
> reflection caching! But this also affected the *new* API. With 2.9.x and old
> TokenStreams there is no speed difference, really.
>

but it wasn't like this initially. only after you put even more work into
the backwards compatibility layer, after discovering performance issues with
Solr, all happening in a minor release from major changes.

I guess Marvin is hinting that perhaps major changes could be associated
with major versions. For that example, perhaps more time could have been
instead spent upgrading Solr tokenstreams so it could move to 3.0 (rather
than almost a year later).

And I do think its a good example, you put a ton of work into this, but not
all the backwards compatibility can be done like this, and what if somehow
this one had slipped through without this caching? I think most users would
consider it strange to experience a performance degradation in a minor
release from major changes...

-- 
Robert Muir
rcm...@gmail.com


RE: Proposal about Version API "relaxation"

2010-04-14 Thread Uwe Schindler
> And 2.9's backwards compatibility layer in
> TokenStream
> was significantly slower.

I protest! No, it was not slower, only at the beginning because of missing 
reflection caching! But this also affected the *new* API. With 2.9.x and old 
TokenStreams there is no speed difference, really.

Uwe


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: Proposal about Version API "relaxation"

2010-04-14 Thread Uwe Schindler
+1, Thanks for this detailed explanation! In my apps I have no problem to 
define a static default myself. And passing this to every ctor is easy, so 
where is the problem? Look at solr, since we introduced the version param to 
solrconfig, you have exactly that behavior, but its limited to this solr 
installation using this solr config. And you can still override.

Lucene is a library, no application, so it's not in lucene's responsibility to 
handle such things. Configuration and configuration objects passing around is 
an application responsibility.

Uwe

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Wednesday, April 14, 2010 6:58 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Proposal about Version API "relaxation"
> 
> On 04/14/2010 12:29 PM, Marvin Humphrey wrote:
> > On Wed, Apr 14, 2010 at 08:30:14AM -0400, Grant Ingersoll wrote:
> >
> >> The thing I keep going back to is that somehow Lucene has managed
> for years
> >> (and I mean lots of years) w/o stuff like Version and all this
> massive back
> >> compatibility checking.
> >>
> > Non-constant global variables are an anti-pattern.
> >
> 
> I think clinging to such rules in the face of all situations is an
> anti-pattern :) I take it as a rule of thumb.
> 
> In regards to this discussion:
> 
> I agree that the Version stuff is a bit of a mess. I also agree that
> many users will want to just use one version across their app that is
> easy to change.
> 
> I disagree that we should allow that behavior by just using a
> constructor without the Version param - or that you would be forced to
> set the static Version setting by trying to run your app and seeing an
> exception happen. That is all a bit ugly.
> 
> Too many users will not understand Version or care to if they see they
> can skip passing it. IMO, you should have to specify that you are
> looking for this behavior. In which case, why not just specify it using
> the version param itself :) E.g. if a user wants to get this kind of
> static behavior, they can just choose to do it on their own, and pass
> their *own* static Version constant to all the constructors.
> 
> I don't think we need to go through this hassle and introduce a less
> than ideal solution just so that users can pass one less param -
> especially when I think you should explicitly choose this behavior
> rather than get it by ignoring the Version param.
> 
> --
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Marvin Humphrey
On Wed, Apr 14, 2010 at 12:49:52AM -0400, Robert Muir wrote:

> its very unnatural for release 3.0 to be almost a no-op and for release 3.1
> to provide a new default index format and support for customizing how the
> index is stored. And now we are looking at providing flexibility in scoring
> that will hopefully redefine lucene from being a vector-space search engine
> library to something much more flexible?  This is a minor release?!

I agree, but what really bothers me are the X.9 releases.  

2.9 changed performance characteristics dramatically enough that it was a
backwards-break in all but name for many users -- most prominently, Solr[1].
Solr's FieldCache RAM requirements doubled because of the transition to
per-segment search.  And 2.9's backwards compatibility layer in TokenStream
was significantly slower.

In my opinion, the transition to per-segment search and new-style TokenStreams
should have triggered a major version break.  Had that been the case, less
effort could have been spent on backwards compatibility shims and fewer API
design compromises would have been necessary.

To avoid such costs in the future, and to communicate disruptions in the
library to users via version numbers more accurately...

  * There should not be a Lucene 3.9.  
  * Lucene 4.0 should do more than remove deprecations.

Marvin Humphrey

[1] Thanks to Robert and Mark Miller for reminding me just what the
Solr/Lucene-2.9 problems were via IRC.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857010#action_12857010
 ] 

Nicolas Helleringer commented on LUCENE-2359:
-

Yonik, 

It is the case, but the points left out are for sure not in the search area.

Grant,

You were right ! It was a c&p error ! As you can see in the above table 
'TileLength' for Tier 9 is 48,63671875 not 24,31835938 and then the 'new 
bestFit number of Box to fetch' becomes ... 4 ! =)

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



RE: issues.apache.org compromised: please update your passwords

2010-04-14 Thread Chris Hostetter

: I disabled the account by assigning a dummy eMail and gave it a random 
password.
: 
: I was not able to unassign the issues, as most issues were "Closed", 
: where no modifications can be done anymore. Reopening and changing 

Uwe: it may be too late (depending on wether you remember the dummy 
password) but an alternate course of action would have been to change the 
email address to the PMC list (priv...@lucene) which is not publicly 
archived.


-Hoss


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857006#action_12857006
 ] 

Yonik Seeley commented on LUCENE-2359:
--

Perhaps I'm misreading the table?  I had assumed that your new algorithm was 
often less selective (allowed more points through the filter) than the old.  Is 
this not the case?

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Bug in contrib/misc/HighFreqTerms.java?

2010-04-14 Thread Burton-West, Tom
When I try to run HighFreqTerms.java in Lucene Revision: 933722  I get the the 
exception appended below.  I believe the line of code involved  is a result of 
the flex indexing merge. Should I post this as a comment to LUCENE-2370 
(Reintegrate flex branch into trunk)?

Or is there simply something wrong with my configuration?

Exception in thread "main" java.lang.UnsupportedOperationException: please use 
MultiFields.getFields if you really need a top level Fields (NOTE that it's 
usually better to work per segment instead)
at 
org.apache.lucene.index.DirectoryReader.fields(DirectoryReader.java:762)
at org.apache.lucene.misc.HighFreqTerms.main(HighFreqTerms.java:71)

Tom Burton-West



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857001#action_12857001
 ] 

Nicolas Helleringer commented on LUCENE-2359:
-

hi Yonik,

I do not aggre : as the 4 tiles requested go cover all the search area there is 
no gain in beeing less accurate.

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856984#action_12856984
 ] 

Yonik Seeley commented on LUCENE-2359:
--

Hi Nicolas, I like the idea of reducing the number of tiles that need to be 
queried, but it does look like the current reduction might be a little 
aggressive for the default.  Perhaps we could have some sort of filtering 
accuracy parameter that could give more precise control over the trade-off?


> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856980#action_12856980
 ] 

Nicolas Helleringer commented on LUCENE-2359:
-

I do agree, it is odd.

I shall go through the process again and watch where it comes from.

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856965#action_12856965
 ] 

Tom Burton-West edited comment on LUCENE-2393 at 4/14/10 1:26 PM:
--

Patch against recent trunk.   Can someone please suggest an appropriate 
existing unit test to use as a model for creating a unit test for this?   Would 
it be appropriate to include a small index file for testing or is it better to 
programatically create the index file?

  was (Author: tburtonwest):
Patch against recent trunk
  
> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856967#action_12856967
 ] 

Tom Burton-West commented on LUCENE-2393:
-

For an example of how this utility can be used please see: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Burton-West updated LUCENE-2393:


Attachment: LUCENE-2393.patch

Patch against recent trunk

> Utility to output total term frequency and df from a lucene index
> -
>
> Key: LUCENE-2393
> URL: https://issues.apache.org/jira/browse/LUCENE-2393
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Reporter: Tom Burton-West
>Priority: Trivial
> Attachments: LUCENE-2393.patch
>
>
> This is a command line utility that takes a field name, term, and index 
> directory and outputs the document frequency for the term and the total 
> number of occurrences of the term in the index (i.e. the sum of the tf of the 
> term for each document).  It is useful for estimating the size of the term's 
> entry in the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2393) Utility to output total term frequency and df from a lucene index

2010-04-14 Thread Tom Burton-West (JIRA)
Utility to output total term frequency and df from a lucene index
-

 Key: LUCENE-2393
 URL: https://issues.apache.org/jira/browse/LUCENE-2393
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Tom Burton-West
Priority: Trivial


This is a command line utility that takes a field name, term, and index 
directory and outputs the document frequency for the term and the total number 
of occurrences of the term in the index (i.e. the sum of the tf of the term for 
each document).  It is useful for estimating the size of the term's entry in 
the *prx files and consequent Disk I/O demands

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Grant Ingersoll
Thanks.  I added my comment on the issue.  I think we should revert and then 
someone can put up a patch to make this pluggable.  As it stands, this Best Fit 
calculation has nothing to do with the CartesianTierPlotter anyway, so we could 
refactor it pretty easily.

-Grant

On Apr 14, 2010, at 12:42 PM, Helleringer, Nicolas wrote:

> Tables are well on JIRA : https://issues.apache.org/jira/browse/LUCENE-2359
> 
> Nicolas
> 
> 2010/4/14 Helleringer, Nicolas 
> Here are the summary tables :
> 
> First a table to remind metrics on the Tiers :
> Tile Level TierLegnth TierBoxes TileLength (miles) 0 1 1 24902 1 2 4 12451 2 
> 4 16 6225,5 3 8 64 3112,75 4 16 256 1556,375 5 32 1024 778,1875 6 64 4096 
> 389,09375 7 128 16384 194,546875 8 256 65536 97,2734375 9 512 262144 
> 48,63671875 10 1024 1048576 24,31835938 11 2048 4194304 12,15917969 12 4096 
> 16777216 6,079589844 13 8192 67108864 3,039794922 14 16384 268435456 
> 1,519897461 15 32768 1073741824 0,75994873
> 
> 
> Then the comparaison table between legacy and new bestFit :
> Radius (miles) legacy bestFit legacy bestFit TileLength legacy bestFit max 
> number of Box to fetch new bestFit new bestFit TileLength new bestFit number 
> of Box to fetch 1 18 0,75994873 9 14 1,519897461 4 5 16 0,75994873 64 12 
> 6,079589844 4 10 15 0,75994873 225 11 12,15917969 4 25 13 3,039794922 100 9 
> 24,31835938 9 50 12 6,079589844 100 8 97,2734375 4 100 11 12,15917969 100 7 
> 194,546875 4 250 10 24,31835938 144 6 389,09375 4 500 9 48,63671875 144 5 
> 778,1875 4 1000 8 97,2734375 144 4 1556,375 4 2500 7 194,546875 196 3 3112,75 
> 4 5000 6 389,09375 196 2 6225,5 4 1 5 778,1875 196 1 12451 4
> 
> I hope mailers will keep the formating ...
> 
> 
> 
> 
> 
> If not I shall post on JIRA.
> 
> Formulas :
> TileLength is 24902 (earth circumference) / TierLength
> bestFit formulas as summarized by Grant in his email.
> number of box to fetch : pow(ceil(TileLength/Radius)+1,2) => 
> TileLength/Radius is for how many tiles are needed to cover the radius, +1 is 
> because you are not always well aligned, the pow(X,2) because there is two 
> directions/axis
> 
> Best regards,
> 
> Nicolas
> 
> 2010/4/14 Chris Male 
> 
> Hi,
> 
> On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll  wrote:
> 
> On Apr 14, 2010, at 11:06 AM, Chris Male wrote:
> 
> > Hi,
> >
> > My understanding of the benefits of the new algorithm is that it means a 
> > lower tier level resulting in fewer boxes, but more documents inside those 
> > boxes that are outside of the search radius.
> >
> > While having fewer boxes means fewer term queries to make against the 
> > index, more documents means more costly calculations to filter out those 
> > extraneous documents.
> >
> > For those doing just Cartesian Tier filtering it seems like the new 
> > approach is a win, but for those doing distance calculations on those 
> > documents passing the filter, it seems to come at a cost.
> 
> Currently, this is only used for filtering.  AIUI, Tiers aren't really that 
> useful for distance calculations, are they?  After all, all you have is a box 
> id and you'd have to reverse out the calc of that to be able to calc a 
> distance, no?  Perhaps I'm missing something.
> 
> 
> How Spatial Lucene currently works (or at least one of the ways it was 
> designed to work), is using a 2 step filtering process.  Step 1 is the 
> Cartesian Tier filtering.  The resulting set of Documents is then passed on 
> through to Step 2 which then calculates the distance from each Document to 
> the search centre.  If the distance is greater than the radius, the Document 
> is filtered out.  This means that after both filtering steps you have only 
> those Documents that are in the search radius.
> 
> How this impacts this algorithm choice is that the more Documents the pass 
> through Step 1, the more calculations that have to be done in Step 2.
>  
> I'm not sure, however, that it is a win for filtering.  It seems like you end 
> up including docs in the result set that should be in there.
> 
> I'll wait for Nicolas' summary table, but I'm inclined to revert and then 
> someone can refactor if they want to offer alternate implementations.
> 
> -Grant
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
> 
> 
> 
> -- 
> Chris Male | Software Developer | JTeam BV.| www.jteam.nl
> 
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856963#action_12856963
 ] 

Grant Ingersoll commented on LUCENE-2359:
-

Thanks, Nicolas.  To me, based on these values, the answer is to revert and 
then refactor.  

Also, is the 9 in the last column of the second table (radius 25) an outlier or 
a c & p error?  That seems really odd.

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Grant Ingersoll

On Apr 14, 2010, at 12:12 PM, Chris Male wrote:

> Hi,
> 
> On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll  wrote:
> 
> On Apr 14, 2010, at 11:06 AM, Chris Male wrote:
> 
> > Hi,
> >
> > My understanding of the benefits of the new algorithm is that it means a 
> > lower tier level resulting in fewer boxes, but more documents inside those 
> > boxes that are outside of the search radius.
> >
> > While having fewer boxes means fewer term queries to make against the 
> > index, more documents means more costly calculations to filter out those 
> > extraneous documents.
> >
> > For those doing just Cartesian Tier filtering it seems like the new 
> > approach is a win, but for those doing distance calculations on those 
> > documents passing the filter, it seems to come at a cost.
> 
> Currently, this is only used for filtering.  AIUI, Tiers aren't really that 
> useful for distance calculations, are they?  After all, all you have is a box 
> id and you'd have to reverse out the calc of that to be able to calc a 
> distance, no?  Perhaps I'm missing something.
> 
> 
> How Spatial Lucene currently works (or at least one of the ways it was 
> designed to work), is using a 2 step filtering process.  Step 1 is the 
> Cartesian Tier filtering.  The resulting set of Documents is then passed on 
> through to Step 2 which then calculates the distance from each Document to 
> the search centre.  If the distance is greater than the radius, the Document 
> is filtered out.  This means that after both filtering steps you have only 
> those Documents that are in the search radius.
> 
> How this impacts this algorithm choice is that the more Documents the pass 
> through Step 1, the more calculations that have to be done in Step 2.

OK, I see what you mean now.  I thought you were implying the box id would be 
used for calculating a distance, too.

Re: Proposal about Version API "relaxation"

2010-04-14 Thread Andi Vajda


On Apr 14, 2010, at 7:45, Yonik Seeley   
wrote:


On Wed, Apr 14, 2010 at 10:39 AM, DM Smith   
wrote:
Maybe have the index store the version(s) and use that when  
constructing a

reader or writer?


That would cause a reindex to change behavior (among other problems).


If the index contained this information it could prevent mistakes  
where one adds documents or queries them with a different analyzer  
version setting than used when the index was created, leading to  
subtle bugs...


It seems to me, then, that the only time an analyzer version would be  
required is at index (re)creation time.


Andi..



-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Mark Miller

On 04/14/2010 12:29 PM, Marvin Humphrey wrote:

On Wed, Apr 14, 2010 at 08:30:14AM -0400, Grant Ingersoll wrote:
   

The thing I keep going back to is that somehow Lucene has managed for years
(and I mean lots of years) w/o stuff like Version and all this massive back
compatibility checking.
 

Non-constant global variables are an anti-pattern.
   


I think clinging to such rules in the face of all situations is an 
anti-pattern :) I take it as a rule of thumb.


In regards to this discussion:

I agree that the Version stuff is a bit of a mess. I also agree that 
many users will want to just use one version across their app that is 
easy to change.


I disagree that we should allow that behavior by just using a 
constructor without the Version param - or that you would be forced to 
set the static Version setting by trying to run your app and seeing an 
exception happen. That is all a bit ugly.


Too many users will not understand Version or care to if they see they 
can skip passing it. IMO, you should have to specify that you are 
looking for this behavior. In which case, why not just specify it using 
the version param itself :) E.g. if a user wants to get this kind of 
static behavior, they can just choose to do it on their own, and pass 
their *own* static Version constant to all the constructors.


I don't think we need to go through this hassle and introduce a less 
than ideal solution just so that users can pass one less param - 
especially when I think you should explicitly choose this behavior 
rather than get it by ignoring the Version param.


--
- Mark

http://www.lucidimagination.com




-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Robert Muir
On Wed, Apr 14, 2010 at 12:29 PM, Marvin Humphrey wrote:
>
>  > I also am not sure whether it in the past we just missed/ignored more
> back
> > compatibility issues or whether now we are creating more back compat.
> issues
> > due to more rapid change.
>
> It would be hard to search the archives to confirm my recollection, but I
> seem
> to remember back compat for Analyzers coming up every once in a while --
> say,
> in the context of modifying StandardAnalyzer's stoplist -- and changes not
> being made because they would change search results.
>

I think even things considered bugs were not actually "fixed" by default
because of this, until Version?

-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856954#action_12856954
 ] 

Nicolas Helleringer commented on LUCENE-2359:
-

Summary tables :

||Tile Level||TierLegnth||TierBoxes||TileXLength (miles)||
|0|1|1|24902|
|1|2|4|12451|
|2|4|16|6225,5|
|3|8|64|3112,75|
|4|16|256|1556,375|
|5|32|1024|778,1875|
|6|64|4096|389,09375|
|7|128|16384|194,546875|
|8|256|65536|97,2734375|
|9|512|262144|48,63671875|
|10|1024|1048576|24,31835938|
|11|2048|4194304|12,15917969|
|12|4096|16777216|6,079589844|
|13|8192|67108864|3,039794922|
|14|16384|268435456|1,519897461|
|15|32768|1073741824|0,75994873|

||Radius (miles)||legacy bestFit||legacy bestFit TileLength||legacy bestFit max 
number of Box to fetch||new bestFit||new bestFit TileLength||new bestFit number 
of Box to fetch||
|1|18|0,75994873|9|14|1,519897461|4|
|5|16|0,75994873|64|12|6,079589844|4|
|10|15|0,75994873|225|11|12,15917969|4|
|25|13|3,039794922|100|9|24,31835938|9|
|50|12|6,079589844|100|8|97,2734375|4|
|100|11|12,15917969|100|7|194,546875|4|
|250|10|24,31835938|144|6|389,09375|4|
|500|9|48,63671875|144|5|778,1875|4|
|1000|8|97,2734375|144|4|1556,375|4|
|2500|7|194,546875|196|3|3112,75|4|
|5000|6|389,09375|196|2|6225,5|4|
|1|5|778,1875|196|1|12451|4|


> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Helleringer, Nicolas
Tables are well on JIRA : https://issues.apache.org/jira/browse/LUCENE-2359

Nicolas

2010/4/14 Helleringer, Nicolas 

> Here are the summary tables :
>
> First a table to remind metrics on the Tiers :
> Tile Level TierLegnth TierBoxes TileLength (miles) 0 1 1 24902 1 2 4 12451
> 2 4 16 6225,5 3 8 64 3112,75 4 16 256 1556,375 5 32 1024 778,1875 6 64 4096
> 389,09375 7 128 16384 194,546875 8 256 65536 97,2734375 9 512 262144
> 48,63671875 10 1024 1048576 24,31835938 11 2048 4194304 12,15917969 12 4096
> 16777216 6,079589844 13 8192 67108864 3,039794922 14 16384 268435456
> 1,519897461 15 32768 1073741824 0,75994873
>
>
> Then the comparaison table between legacy and new bestFit :
> Radius (miles) legacy bestFit legacy bestFit TileLength legacy bestFit max
> number of Box to fetch new bestFit new bestFit TileLength new bestFit number
> of Box to fetch 1 18 0,75994873 9 14 1,519897461 4 5 16 0,75994873 64 12
> 6,079589844 4 10 15 0,75994873 225 11 12,15917969 4 25 13 3,039794922 100 9
> 24,31835938 9 50 12 6,079589844 100 8 97,2734375 4 100 11 12,15917969 100 7
> 194,546875 4 250 10 24,31835938 144 6 389,09375 4 500 9 48,63671875 144 5
> 778,1875 4 1000 8 97,2734375 144 4 1556,375 4 2500 7 194,546875 196 3
> 3112,75 4 5000 6 389,09375 196 2 6225,5 4 1 5 778,1875 196 1 12451 4
>
> I hope mailers will keep the formating ...
>
>
>
>
>
> If not I shall post on JIRA.
>
> Formulas :
> TileLength is 24902 (earth circumference) / TierLength
> bestFit formulas as summarized by Grant in his email.
> number of box to fetch : pow(ceil(TileLength/Radius)+1,2) =>
> TileLength/Radius is for how many tiles are needed to cover the radius, +1
> is because you are not always well aligned, the pow(X,2) because there is
> two directions/axis
>
> Best regards,
>
> Nicolas
>
> 2010/4/14 Chris Male 
>
> Hi,
>>
>> On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll wrote:
>>
>>>
>>> On Apr 14, 2010, at 11:06 AM, Chris Male wrote:
>>>
>>> > Hi,
>>> >
>>> > My understanding of the benefits of the new algorithm is that it means
>>> a lower tier level resulting in fewer boxes, but more documents inside those
>>> boxes that are outside of the search radius.
>>> >
>>> > While having fewer boxes means fewer term queries to make against the
>>> index, more documents means more costly calculations to filter out those
>>> extraneous documents.
>>> >
>>> > For those doing just Cartesian Tier filtering it seems like the new
>>> approach is a win, but for those doing distance calculations on those
>>> documents passing the filter, it seems to come at a cost.
>>>
>>> Currently, this is only used for filtering.  AIUI, Tiers aren't really
>>> that useful for distance calculations, are they?  After all, all you have is
>>> a box id and you'd have to reverse out the calc of that to be able to calc a
>>> distance, no?  Perhaps I'm missing something.
>>>
>>>
>> How Spatial Lucene currently works (or at least one of the ways it was
>> designed to work), is using a 2 step filtering process.  Step 1 is the
>> Cartesian Tier filtering.  The resulting set of Documents is then passed on
>> through to Step 2 which then calculates the distance from each Document to
>> the search centre.  If the distance is greater than the radius, the Document
>> is filtered out.  This means that after both filtering steps you have only
>> those Documents that are in the search radius.
>>
>> How this impacts this algorithm choice is that the more Documents the pass
>> through Step 1, the more calculations that have to be done in Step 2.
>>
>>
>>> I'm not sure, however, that it is a win for filtering.  It seems like you
>>> end up including docs in the result set that should be in there.
>>>
>>> I'll wait for Nicolas' summary table, but I'm inclined to revert and then
>>> someone can refactor if they want to offer alternate implementations.
>>>
>>> -Grant
>>> -
>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>>
>>>
>>
>>
>> --
>> Chris Male | Software Developer | JTeam BV.| www.jteam.nl
>>
>
>


Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Helleringer, Nicolas
Here are the summary tables :

First a table to remind metrics on the Tiers :
Tile Level TierLegnth TierBoxes TileLength (miles) 0 1 1 24902 1 2 4 12451 2
4 16 6225,5 3 8 64 3112,75 4 16 256 1556,375 5 32 1024 778,1875 6 64 4096
389,09375 7 128 16384 194,546875 8 256 65536 97,2734375 9 512 262144
48,63671875 10 1024 1048576 24,31835938 11 2048 4194304 12,15917969 12 4096
16777216 6,079589844 13 8192 67108864 3,039794922 14 16384 268435456
1,519897461 15 32768 1073741824 0,75994873


Then the comparaison table between legacy and new bestFit :
Radius (miles) legacy bestFit legacy bestFit TileLength legacy bestFit max
number of Box to fetch new bestFit new bestFit TileLength new bestFit number
of Box to fetch 1 18 0,75994873 9 14 1,519897461 4 5 16 0,75994873 64 12
6,079589844 4 10 15 0,75994873 225 11 12,15917969 4 25 13 3,039794922 100 9
24,31835938 9 50 12 6,079589844 100 8 97,2734375 4 100 11 12,15917969 100 7
194,546875 4 250 10 24,31835938 144 6 389,09375 4 500 9 48,63671875 144 5
778,1875 4 1000 8 97,2734375 144 4 1556,375 4 2500 7 194,546875 196 3
3112,75 4 5000 6 389,09375 196 2 6225,5 4 1 5 778,1875 196 1 12451 4

I hope mailers will keep the formating ...





If not I shall post on JIRA.

Formulas :
TileLength is 24902 (earth circumference) / TierLength
bestFit formulas as summarized by Grant in his email.
number of box to fetch : pow(ceil(TileLength/Radius)+1,2) =>
TileLength/Radius is for how many tiles are needed to cover the radius, +1
is because you are not always well aligned, the pow(X,2) because there is
two directions/axis

Best regards,

Nicolas

2010/4/14 Chris Male 

> Hi,
>
> On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll wrote:
>
>>
>> On Apr 14, 2010, at 11:06 AM, Chris Male wrote:
>>
>> > Hi,
>> >
>> > My understanding of the benefits of the new algorithm is that it means a
>> lower tier level resulting in fewer boxes, but more documents inside those
>> boxes that are outside of the search radius.
>> >
>> > While having fewer boxes means fewer term queries to make against the
>> index, more documents means more costly calculations to filter out those
>> extraneous documents.
>> >
>> > For those doing just Cartesian Tier filtering it seems like the new
>> approach is a win, but for those doing distance calculations on those
>> documents passing the filter, it seems to come at a cost.
>>
>> Currently, this is only used for filtering.  AIUI, Tiers aren't really
>> that useful for distance calculations, are they?  After all, all you have is
>> a box id and you'd have to reverse out the calc of that to be able to calc a
>> distance, no?  Perhaps I'm missing something.
>>
>>
> How Spatial Lucene currently works (or at least one of the ways it was
> designed to work), is using a 2 step filtering process.  Step 1 is the
> Cartesian Tier filtering.  The resulting set of Documents is then passed on
> through to Step 2 which then calculates the distance from each Document to
> the search centre.  If the distance is greater than the radius, the Document
> is filtered out.  This means that after both filtering steps you have only
> those Documents that are in the search radius.
>
> How this impacts this algorithm choice is that the more Documents the pass
> through Step 1, the more calculations that have to be done in Step 2.
>
>
>> I'm not sure, however, that it is a win for filtering.  It seems like you
>> end up including docs in the result set that should be in there.
>>
>> I'll wait for Nicolas' summary table, but I'm inclined to revert and then
>> someone can refactor if they want to offer alternate implementations.
>>
>> -Grant
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
> --
> Chris Male | Software Developer | JTeam BV.| www.jteam.nl
>


Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Chris Male
On Wed, Apr 14, 2010 at 6:24 PM, Yonik Seeley wrote:

> On Wed, Apr 14, 2010 at 12:12 PM, Chris Male  wrote:
> > On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll 
> >> On Apr 14, 2010, at 11:06 AM, Chris Male wrote:
> >> > For those doing just Cartesian Tier filtering it seems like the new
> >> > approach is a win, but for those doing distance calculations on those
> >> > documents passing the filter, it seems to come at a cost.
> >>
> >> Currently, this is only used for filtering.  AIUI, Tiers aren't really
> >> that useful for distance calculations, are they?  After all, all you
> have is
> >> a box id and you'd have to reverse out the calc of that to be able to
> calc a
> >> distance, no?  Perhaps I'm missing something.
> >>
> >
> > How Spatial Lucene currently works (or at least one of the ways it was
> > designed to work), is using a 2 step filtering process.  Step 1 is the
> > Cartesian Tier filtering.  The resulting set of Documents is then passed
> on
> > through to Step 2 which then calculates the distance from each Document
> to
> > the search centre.
>
> IMO, being able to just do a tier or bounding box  filter is also
> useful (step 1).
> One example is if someone is going to sort by distance anyway... they
> may want to do only a bounding-box type filter for greater
> performance.
>
> We should keep both concepts (bounding box filter and distance filter)
> regardless of how the distance filter is implemented.
>

Definitely.


>
> -Yonik
> Apache Lucene Eurocon 2010
> 18-21 May 2010 | Prague
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl


Re: Proposal about Version API "relaxation"

2010-04-14 Thread Marvin Humphrey
On Wed, Apr 14, 2010 at 08:30:14AM -0400, Grant Ingersoll wrote:
> The thing I keep going back to is that somehow Lucene has managed for years
> (and I mean lots of years) w/o stuff like Version and all this massive back
> compatibility checking.

Non-constant global variables are an anti-pattern.  Having a non-constant
global determine library behavior which results in silent failure (search
results degrade subtly, as opposed to e.g. an exception being thrown) is a
particularly insidious anti-pattern. 

In the Perl world, where modules are very heavily used thanks to CPAN, you're
more likely to come across the action-at-a-distance bugs spawned by this
anti-pattern.  I have direct experience debugging such usage of global vars.
It is extremely costly and frustrating.

For instance, there was one time when some module set the global variable
$YAML::Syck::ImplicitUnicode to a true value.  Whether or not that module was
loaded affected how YAML::Syck's Load() function would interpret character
data in completely unrelated portions of the code.  As with subtly degraded
search results, the result was silent failure (incorrect text stored in a
database).  It took many hours to hunt down what was going wrong because the
code that was causing the problem was nowhere near the code where the problem
manifested.  The authors of the affected code had done nothing wrong, aside
from using a poorly designed module like YAML::Syck.

I am strongly opposed to using a global variable for versioning because I do
not wish to impose such maddening debugging sessions on a handful of unlucky
duckies who have done nothing wrong other than to choose Lucene as their
search engine library.  

This shouldn't be controversial.  The temptations of global variables are
obvious, but their flaws are well understood:

http://www.google.com/search?q=global+variables+evil

It is to be expected that the global would work most of the time.  This design
flaw, by nature, disproportionately afflicts a small number of users with
action-at-a-distance bugs.  Knowingly choosing to impose such costs on a
random few is deeply unfair.

> I also am not sure whether it in the past we just missed/ignored more back
> compatibility issues or whether now we are creating more back compat. issues
> due to more rapid change.  

It would be hard to search the archives to confirm my recollection, but I seem
to remember back compat for Analyzers coming up every once in a while -- say,
in the context of modifying StandardAnalyzer's stoplist -- and changes not
being made because they would change search results.

Marvin Humphrey


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Yonik Seeley
On Wed, Apr 14, 2010 at 12:12 PM, Chris Male  wrote:
> On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll 
>> On Apr 14, 2010, at 11:06 AM, Chris Male wrote:
>> > For those doing just Cartesian Tier filtering it seems like the new
>> > approach is a win, but for those doing distance calculations on those
>> > documents passing the filter, it seems to come at a cost.
>>
>> Currently, this is only used for filtering.  AIUI, Tiers aren't really
>> that useful for distance calculations, are they?  After all, all you have is
>> a box id and you'd have to reverse out the calc of that to be able to calc a
>> distance, no?  Perhaps I'm missing something.
>>
>
> How Spatial Lucene currently works (or at least one of the ways it was
> designed to work), is using a 2 step filtering process.  Step 1 is the
> Cartesian Tier filtering.  The resulting set of Documents is then passed on
> through to Step 2 which then calculates the distance from each Document to
> the search centre.

IMO, being able to just do a tier or bounding box  filter is also
useful (step 1).
One example is if someone is going to sort by distance anyway... they
may want to do only a bounding-box type filter for greater
performance.

We should keep both concepts (bounding box filter and distance filter)
regardless of how the distance filter is implemented.

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Chris Male
Hi,

On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll wrote:

>
> On Apr 14, 2010, at 11:06 AM, Chris Male wrote:
>
> > Hi,
> >
> > My understanding of the benefits of the new algorithm is that it means a
> lower tier level resulting in fewer boxes, but more documents inside those
> boxes that are outside of the search radius.
> >
> > While having fewer boxes means fewer term queries to make against the
> index, more documents means more costly calculations to filter out those
> extraneous documents.
> >
> > For those doing just Cartesian Tier filtering it seems like the new
> approach is a win, but for those doing distance calculations on those
> documents passing the filter, it seems to come at a cost.
>
> Currently, this is only used for filtering.  AIUI, Tiers aren't really that
> useful for distance calculations, are they?  After all, all you have is a
> box id and you'd have to reverse out the calc of that to be able to calc a
> distance, no?  Perhaps I'm missing something.
>
>
How Spatial Lucene currently works (or at least one of the ways it was
designed to work), is using a 2 step filtering process.  Step 1 is the
Cartesian Tier filtering.  The resulting set of Documents is then passed on
through to Step 2 which then calculates the distance from each Document to
the search centre.  If the distance is greater than the radius, the Document
is filtered out.  This means that after both filtering steps you have only
those Documents that are in the search radius.

How this impacts this algorithm choice is that the more Documents the pass
through Step 1, the more calculations that have to be done in Step 2.


> I'm not sure, however, that it is a win for filtering.  It seems like you
> end up including docs in the result set that should be in there.
>
> I'll wait for Nicolas' summary table, but I'm inclined to revert and then
> someone can refactor if they want to offer alternate implementations.
>
> -Grant
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl


Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Grant Ingersoll

On Apr 14, 2010, at 11:28 AM, Helleringer, Nicolas wrote:

> That minTile param allows you to trade off between filtering accuracy
> and faster tile filtering.  Without the param (or until it can be
> implemented) the correct approach seems like the above, without a
> minTile.  This sounds to me like the old approach is correct.
> minTier and maxTier at CartesianTierPlotter  level have been commited into 
> the trunk today : see https://issues.apache.org/jira/browse/LUCENE-2184

Actually, I changed those, as the CartTierPlotter doesn't need them.

> 
> Regards,
> 
> Nicolas 
>  
> 
> -Yonik
> Apache Lucene Eurocon 2010
> 18-21 May 2010 | Prague
> 
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
> 
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Grant Ingersoll

On Apr 14, 2010, at 11:06 AM, Chris Male wrote:

> Hi,
> 
> My understanding of the benefits of the new algorithm is that it means a 
> lower tier level resulting in fewer boxes, but more documents inside those 
> boxes that are outside of the search radius.
> 
> While having fewer boxes means fewer term queries to make against the index, 
> more documents means more costly calculations to filter out those extraneous 
> documents.
> 
> For those doing just Cartesian Tier filtering it seems like the new approach 
> is a win, but for those doing distance calculations on those documents 
> passing the filter, it seems to come at a cost.

Currently, this is only used for filtering.  AIUI, Tiers aren't really that 
useful for distance calculations, are they?  After all, all you have is a box 
id and you'd have to reverse out the calc of that to be able to calc a 
distance, no?  Perhaps I'm missing something.

I'm not sure, however, that it is a win for filtering.  It seems like you end 
up including docs in the result set that should be in there.

I'll wait for Nicolas' summary table, but I'm inclined to revert and then 
someone can refactor if they want to offer alternate implementations.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Helleringer, Nicolas
>
> That minTile param allows you to trade off between filtering accuracy
> and faster tile filtering.  Without the param (or until it can be
> implemented) the correct approach seems like the above, without a
> minTile.  This sounds to me like the old approach is correct.
>
minTier and maxTier at CartesianTierPlotter  level have been commited into
the trunk today : see https://issues.apache.org/jira/browse/LUCENE-2184

Regards,

Nicolas


>
> -Yonik
> Apache Lucene Eurocon 2010
> 18-21 May 2010 | Prague
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856929#action_12856929
 ] 

Nicolas Helleringer commented on LUCENE-2359:
-

What my code do :

It looks how many times you can fit the search diameter (2.0d * range) into the 
distance that will be split into longitudes range (I-e 
distanceUnit.earthCircumference()).

And then it takes the first biggest level of Tier that will have a range just 
above the search diameter (int bestFit = (int) Math.ceil(log2(times));)

This way you ll have the better comprise betwenn  fetching the less number of 
boxes and not fetching too big boxes with too many documents in them.

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Helleringer, Nicolas
I ll try to find a little bit of time tonight to make a sample data set go
through the two calculations to see the differences.
I ll make a summary table.

I ll comment the issue with some comments on 'my' version of the alogrithm
right now.

Nicolas

2010/4/14 Chris Male 

> Hi,
>
> My understanding of the benefits of the new algorithm is that it means a
> lower tier level resulting in fewer boxes, but more documents inside those
> boxes that are outside of the search radius.
>
> While having fewer boxes means fewer term queries to make against the
> index, more documents means more costly calculations to filter out those
> extraneous documents.
>
> For those doing just Cartesian Tier filtering it seems like the new
> approach is a win, but for those doing distance calculations on those
> documents passing the filter, it seems to come at a cost.
>
> Cheers
> Chris
>
>
> On Wed, Apr 14, 2010 at 5:00 PM, Grant Ingersoll wrote:
>
>> LUCENE-2359 changed the best fit calculation.  I admit, I'm not entirely
>> certain which one is right, so I thought we should step back and talk about
>> what we are trying to achieve.
>>
>> Please correct me if/where I am wrong.
>>
>> Looking at the problem of tiers/tiles/grids in general, we are taking a
>> sphere, projecting it into a 2D plane.  Next, we are dividing up the plane
>> into nested grids/tiers.  Each tier contains 2^tier id boxes.  Thus, tier
>> level 2 divides the earth up into 4 boxes.  2^15 = 32,768 boxes.  We then,
>> for each box, give it a unique label which then becomes the token that we
>> index.  During indexing, we typically will index many tiers, i.e. tiers 4
>> through 15.
>>
>> During search, we take in a lat/lon and a radius.  The goal is to do a
>> search using the fewest terms possible.  Thus, we need to pick the tier that
>> contains/covers the radius with the fewest number of boxes so that we can
>> enumerate a very small number of documents.  Thus, we need to calculate the
>> best fit, which is a method inside of the CartesianTierPlotter.
>>
>> In the old way, we did:
>>
>> bf = min( 15, ceil(log2(  earth_circumference /  ( ( miles/2) - sqrt(
>> (miles/2)^2  /  2 ) ) ) + 1 )   // we won't go higher than 15 for accuracy
>> reasons
>>
>> The new way is:
>>
>> bf' = ceil ( log2( earth_circumference / ( 2 * miles ) ) )
>>
>> These are obviously two different calculations, never mind the min(15)
>> issue, we can easily resolve that one.
>>
>> AFAICT, the new way is much less accurate, but will likely be faster.
>>
>> So, which is right?
>>
>> Unfortunately, I find almost zero documentation on this, probably b/c the
>> nomenclature is off, but...
>>
>> -Grant
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
>
> --
> Chris Male | Software Developer | JTeam BV.| www.jteam.nl
>


Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Yonik Seeley
On Wed, Apr 14, 2010 at 11:06 AM, Chris Male  wrote:
> While having fewer boxes means fewer term queries to make against the index,
> more documents means more costly calculations to filter out those extraneous
> documents.

Filtering out documents (greater selectivity) seems like it should be
the primary goal.
But perhaps the problem could be parameterized?  What if you gave a
minimum tile size that you wanted to use?  Then there would be only
one correct answer I believe?

So the problem would essentially be boiled down to this:
- find the minimum area that still encompasses the entire circle
- minimize the number of tiles
- don't use tiles smaller than minTile

That minTile param allows you to trade off between filtering accuracy
and faster tile filtering.  Without the param (or until it can be
implemented) the correct approach seems like the above, without a
minTile.  This sounds to me like the old approach is correct.


-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [SPATIAL] Best Fit Calculation

2010-04-14 Thread Chris Male
Hi,

My understanding of the benefits of the new algorithm is that it means a
lower tier level resulting in fewer boxes, but more documents inside those
boxes that are outside of the search radius.

While having fewer boxes means fewer term queries to make against the index,
more documents means more costly calculations to filter out those extraneous
documents.

For those doing just Cartesian Tier filtering it seems like the new approach
is a win, but for those doing distance calculations on those documents
passing the filter, it seems to come at a cost.

Cheers
Chris

On Wed, Apr 14, 2010 at 5:00 PM, Grant Ingersoll wrote:

> LUCENE-2359 changed the best fit calculation.  I admit, I'm not entirely
> certain which one is right, so I thought we should step back and talk about
> what we are trying to achieve.
>
> Please correct me if/where I am wrong.
>
> Looking at the problem of tiers/tiles/grids in general, we are taking a
> sphere, projecting it into a 2D plane.  Next, we are dividing up the plane
> into nested grids/tiers.  Each tier contains 2^tier id boxes.  Thus, tier
> level 2 divides the earth up into 4 boxes.  2^15 = 32,768 boxes.  We then,
> for each box, give it a unique label which then becomes the token that we
> index.  During indexing, we typically will index many tiers, i.e. tiers 4
> through 15.
>
> During search, we take in a lat/lon and a radius.  The goal is to do a
> search using the fewest terms possible.  Thus, we need to pick the tier that
> contains/covers the radius with the fewest number of boxes so that we can
> enumerate a very small number of documents.  Thus, we need to calculate the
> best fit, which is a method inside of the CartesianTierPlotter.
>
> In the old way, we did:
>
> bf = min( 15, ceil(log2(  earth_circumference /  ( ( miles/2) - sqrt(
> (miles/2)^2  /  2 ) ) ) + 1 )   // we won't go higher than 15 for accuracy
> reasons
>
> The new way is:
>
> bf' = ceil ( log2( earth_circumference / ( 2 * miles ) ) )
>
> These are obviously two different calculations, never mind the min(15)
> issue, we can easily resolve that one.
>
> AFAICT, the new way is much less accurate, but will likely be faster.
>
> So, which is right?
>
> Unfortunately, I find almost zero documentation on this, probably b/c the
> nomenclature is off, but...
>
> -Grant
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Chris Male | Software Developer | JTeam BV.| www.jteam.nl


[SPATIAL] Best Fit Calculation

2010-04-14 Thread Grant Ingersoll
LUCENE-2359 changed the best fit calculation.  I admit, I'm not entirely 
certain which one is right, so I thought we should step back and talk about 
what we are trying to achieve.

Please correct me if/where I am wrong.

Looking at the problem of tiers/tiles/grids in general, we are taking a sphere, 
projecting it into a 2D plane.  Next, we are dividing up the plane into nested 
grids/tiers.  Each tier contains 2^tier id boxes.  Thus, tier level 2 divides 
the earth up into 4 boxes.  2^15 = 32,768 boxes.  We then, for each box, give 
it a unique label which then becomes the token that we index.  During indexing, 
we typically will index many tiers, i.e. tiers 4 through 15.

During search, we take in a lat/lon and a radius.  The goal is to do a search 
using the fewest terms possible.  Thus, we need to pick the tier that 
contains/covers the radius with the fewest number of boxes so that we can 
enumerate a very small number of documents.  Thus, we need to calculate the 
best fit, which is a method inside of the CartesianTierPlotter.

In the old way, we did:

bf = min( 15, ceil(log2(  earth_circumference /  ( ( miles/2) - sqrt( 
(miles/2)^2  /  2 ) ) ) + 1 )   // we won't go higher than 15 for accuracy 
reasons

The new way is:

bf' = ceil ( log2( earth_circumference / ( 2 * miles ) ) )

These are obviously two different calculations, never mind the min(15) issue, 
we can easily resolve that one.

AFAICT, the new way is much less accurate, but will likely be faster.

So, which is right?  

Unfortunately, I find almost zero documentation on this, probably b/c the 
nomenclature is off, but...

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



pseudo document ids in my own indexreader/writer

2010-04-14 Thread Thomas Koch
Hi,

there are currently two projects, porting lucandra to HBase:
http://github.com/akkumar/hbasene
http://github.com/thkoch2001/lucehbase

hbasene stores a unique integer with each stored document, while lucehbase 
directly stores the user's primary key in the termVector table. Every 
lucehbase indexreader creates an internal map of integers to primary keys in 
ram.
This means with lucehbase lucene will see new document ids with every new 
indexreader, while the document ids remain constant with hbasene.

Do you see any problems with the approach of lucehbase? I found the following 
discussions, which seems to proof my point, that there isn't any problem:

http://www.mail-archive.com/java-u...@lucene.apache.org/msg01665.html
http://www.mail-archive.com/java-u...@lucene.apache.org/msg12172.html
http://www.mail-archive.com/lucene-net-...@incubator.apache.org/msg00298.html
http://www.mail-archive.com/lucene-...@jakarta.apache.org/msg06165.html

Best regards,

Thomas Koch, http://www.koch.ro

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856917#action_12856917
 ] 

Shai Erera commented on LUCENE-2159:


bq. There is an excellent section on it in LIA2

Indeed !

Ok so to create a task, you just extend PerfTask. You can look under 
contrib/benchmark/src/java/o.a.l/benchmark/byTask/tasks for many examples. 
OptimizeTask seems relevant here (i.e. it calls an IW API and receives a 
parameter).

For writing .alg files, that's SUPER simple, just look under 
contrib/benchmark/conf for many existing examples. You can post a patch once 
you feel comfortable enough with it and I can help you with the struggles (if 
you'll run into any). Another great source (besides LIA2) on writing .alg files 
is the package.html under 
contrib/benchmark/src/java/org/apache/lucene/benchmark/byTask.

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Yonik Seeley
On Wed, Apr 14, 2010 at 10:39 AM, DM Smith  wrote:
> Maybe have the index store the version(s) and use that when constructing a
> reader or writer?

That would cause a reindex to change behavior (among other problems).

-Yonik
Apache Lucene Eurocon 2010
18-21 May 2010 | Prague

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856916#action_12856916
 ] 

Mark Miller commented on LUCENE-2159:
-

There is an excellent section on it in LIA2 :)

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread DM Smith

On 04/14/2010 09:13 AM, Robert Muir wrote:
Its not sidetracked at all. there seem to be more compelling 
alternatives to achieve the same thing, so we should consider 
alternative solutions, too.
Maybe have the index store the version(s) and use that when constructing 
a reader or writer?
Given enough minor releases, it is likely that different analyzers would 
use different versions. So each feature would need to be represented.




On Wed, Apr 14, 2010 at 8:54 AM, Earwin Burrfoot > wrote:


The thread somehow got sidetracked. So, let's get this carriage back
on its rails?

Let me remind - we have an API on hands that is mandatory and tends to
be cumbersome.
Proposed solution does indeed have ultrascary word "static" in it. But
if you brace yourself and look closer - the use of said static is
opt-in and heavily guarded.
So even a long-standing hater of everything static like me is tempted.


On Wed, Apr 14, 2010 at 16:30, Grant Ingersoll
mailto:gsing...@apache.org>> wrote:
>
> On Apr 14, 2010, at 12:49 AM, Robert Muir wrote:
>
>>
>> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey
mailto:mar...@rectangular.com>> wrote:
>> New class names would work, too.
>>
>> I only mention that for the sake of completeness, though --
it's not a
>> suggestion.
>>
>> Right, to me this is just as bad.
>> In my eyes, the Version thing really shows the problem with the
analysis stuff:
>> * Used by QueryParsers, etc at search and index time, with no
real clean way to do back-compat
>> * Concepts like Version and class-naming push some of the
burden to the user: users decide the back-compat level, but it
still leaves devs with back-compat management hassle.
>>
>> The idea of having a real versioned-module is the same as
Version and class-naming, except it both pushes the burden to the
user in a more natural way (people are used to versioned jar files
and things like that... not Version constants), and it relieves
devs of the back compat
>>
>> In all honesty with the current scheme, release schedules of
Lucene, and Lucene's policy, the analysis stuff will soon deadlock
into being nearly unmaintainable, and to many users, the API is
already unconsumable: its difficult to write reusable analyzers
due to historical relics in the API, methods are named
inappropriately, e.g. Tokenizer.reset(Reader) and
TokenStream.reset(), they don't understand Version, and probably a
few other things I am forgetting that are basically impossible to
fix right now with the current state of affairs.
>
>
> The thing I keep going back to is that somehow Lucene has
managed for years (and I mean lots of years) w/o stuff like
Version and all this massive back compatibility checking.  I'm
still undecided as to whether that is a good thing or not.  I also
am not sure whether it in the past we just missed/ignored more
back compatibility issues or whether now we are creating more back
compat. issues due to more rapid change.  I agree, though, that
all of this stuff is making it harder and harder to develop (and I
don't mean for us committers, I mean for end consumers.)
>
> I also agree about Robert's point about the incorrectness of
naming something 3.0 versus 3.1 when 3.1 is the thing that has all
the new features and is really the "major" release.
>
> -Grant
>
-
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org

> For additional commands, e-mail: java-dev-h...@lucene.apache.org

>
>



--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com
)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-dev-h...@lucene.apache.org





--
Robert Muir
rcm...@gmail.com 




[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856913#action_12856913
 ] 

John Wang commented on LUCENE-2159:
---

Yeah, that sounds great!
I will need to learn how to write .alg files :)

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856911#action_12856911
 ] 

Shai Erera commented on LUCENE-2159:


Which is fine - I think this would be a neat task to add to benchmark, w/ 
specific documentation on how to use it and for what purposes. If you can also 
write a sample .alg file which e.g. creates a small index and then Expand it, 
that'd be great.

I've looked at the different PerfTask implementations in benchmark, and I'm 
thinking if we perhaps should do the following:
* Create an AddIndexesTask which receives one or more Directories as input and 
calls writer.addIndexesNoOptimize
* If one wants, he can add an OptimizeTask call afterwards.
* Write an expandIndex.alg which initially creates an index of size N from one 
content source and then calls the AddIndexesTask several times. The .alg file 
is meant to be an example as well as people can change it to create bigger or 
smaller indexes, use other content sources and switch between RAM/FS 
directories.

How's that sound?

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856908#action_12856908
 ] 

John Wang commented on LUCENE-2159:
---

Shai:

I am just stating our experiences. I am not commenting on how it should affect 
the benchmark proposal at all.

Whether it should be in bench or contrib/misc, this would be a call for the 
committers.

Thanks

-John

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2359) CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian

2010-04-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856902#action_12856902
 ] 

Grant Ingersoll commented on LUCENE-2359:
-

Nicolas,

Why the change in the best fit algorithm?  Do you have a reference to 
calculation of this?

> CartesianPolyFilterBuilder doesn't handle edge case around the 180 meridian
> ---
>
> Key: LUCENE-2359
> URL: https://issues.apache.org/jira/browse/LUCENE-2359
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 2.9.2, 3.0, 3.0.1
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-2359.patch, LUCENE-2359.patch, LUCENE-2359.patch, 
> TEST-2359.patch
>
>
> Test case:  
> Points all around the globe, plus two points at 0, 179.9 and 0,-179.9 (on 
> each side of the meridian).  Then, do a Cartesian Tier filter on a point 
> right near those two.  It will return all the points when it should just 
> return those two.
> The flawed logic is in the else clause below:
> {code}
> if (longX2 != 0.0) {
>   //We are around the prime meridian
>   if (longX == 0.0) {
>   longX = longX2;
>   longY = 0.0;
>   shape = getShapeLoop(shape,ctp,latX,longX,latY,longY);
>   } else {//we are around the 180th longitude
>   longX = longX2;
>   longY = -180.0;
>   shape = getShapeLoop(shape,ctp,latY,longY,latX,longX);
>   }
> {code}
> Basically, the Y and X values are transposed.  This currently says go from 
> longY (-180) all the way around  to longX which is the lower left longitude 
> of the box formed.  Instead, it should go from the lower left long to -180.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



LatLng rework

2010-04-14 Thread Helleringer, Nicolas
Hi,

I will be working on the LatLng point implementation now that LUCENE-2359,
LUCENE-2366, LUCENE-2367, LUCENE-1777 and LUCENE-1921 are ok.

I will propose small patches inside LUCENE-1934 but it will solve also part
or total of LUCENE-2149 and LUCENE-2148.

Next step after that will be to address (finally) LUCENE-1930 in current
code state.

Hope it helps.

Nicolas


[jira] Commented: (LUCENE-1777) Error on distance query where miles < 1.0

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856887#action_12856887
 ] 

Nicolas Helleringer commented on LUCENE-1777:
-

Can someone confirm my analysis and mark this issue as resolved ?

> Error on distance query where miles < 1.0
> -
>
> Key: LUCENE-1777
> URL: https://issues.apache.org/jira/browse/LUCENE-1777
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9
>Reporter: Glen Stampoultzis
>Assignee: Chris Male
> Attachments: LUCENE-1777.patch
>
>
> If miles is under 1.0 distance query will break.
> To reproduce modify the file
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/spatial/src/test/org/apache/lucene/spatial/tier/TestCartesian.java?revision=794721
> And set the line:
> final double miles = 6.0;
> to 
> final double miles = 0.5;

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1921) Absurdly large radius (miles) search fails to include entire earth

2010-04-14 Thread Nicolas Helleringer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856885#action_12856885
 ] 

Nicolas Helleringer commented on LUCENE-1921:
-

I did re check after Grant's commit of LUCENE-2184.2.patch with TEST-1921.patch 
and this is issue is now resolved.

To be counter tested by someone else and set to resolved status : Grant if you 
read me =)

> Absurdly large radius (miles) search fails to include entire earth
> --
>
> Key: LUCENE-1921
> URL: https://issues.apache.org/jira/browse/LUCENE-1921
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Chris Male
>Priority: Minor
> Fix For: 3.1
>
> Attachments: TEST-1921.patch
>
>
> Spinoff from LUCENE-1781.
> If you do a very large (eg 10 miles) radius search then the
> lat/lng bound box wraps around the entire earth and all points should
> be accepted.  But this fails today (many points are rejected).  It's
> easy to see the issue: edit TestCartesian, and insert a very large
> miles into either testRange or testGeoHashRange.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Robert Muir
Its not sidetracked at all. there seem to be more compelling alternatives to
achieve the same thing, so we should consider alternative solutions, too.

On Wed, Apr 14, 2010 at 8:54 AM, Earwin Burrfoot  wrote:

> The thread somehow got sidetracked. So, let's get this carriage back
> on its rails?
>
> Let me remind - we have an API on hands that is mandatory and tends to
> be cumbersome.
> Proposed solution does indeed have ultrascary word "static" in it. But
> if you brace yourself and look closer - the use of said static is
> opt-in and heavily guarded.
> So even a long-standing hater of everything static like me is tempted.
>
>
> On Wed, Apr 14, 2010 at 16:30, Grant Ingersoll 
> wrote:
> >
> > On Apr 14, 2010, at 12:49 AM, Robert Muir wrote:
> >
> >>
> >> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey <
> mar...@rectangular.com> wrote:
> >> New class names would work, too.
> >>
> >> I only mention that for the sake of completeness, though -- it's not a
> >> suggestion.
> >>
> >> Right, to me this is just as bad.
> >> In my eyes, the Version thing really shows the problem with the analysis
> stuff:
> >> * Used by QueryParsers, etc at search and index time, with no real clean
> way to do back-compat
> >> * Concepts like Version and class-naming push some of the burden to the
> user: users decide the back-compat level, but it still leaves devs with
> back-compat management hassle.
> >>
> >> The idea of having a real versioned-module is the same as Version and
> class-naming, except it both pushes the burden to the user in a more natural
> way (people are used to versioned jar files and things like that... not
> Version constants), and it relieves devs of the back compat
> >>
> >> In all honesty with the current scheme, release schedules of Lucene, and
> Lucene's policy, the analysis stuff will soon deadlock into being nearly
> unmaintainable, and to many users, the API is already unconsumable: its
> difficult to write reusable analyzers due to historical relics in the API,
> methods are named inappropriately, e.g. Tokenizer.reset(Reader) and
> TokenStream.reset(), they don't understand Version, and probably a few other
> things I am forgetting that are basically impossible to fix right now with
> the current state of affairs.
> >
> >
> > The thing I keep going back to is that somehow Lucene has managed for
> years (and I mean lots of years) w/o stuff like Version and all this massive
> back compatibility checking.  I'm still undecided as to whether that is a
> good thing or not.  I also am not sure whether it in the past we just
> missed/ignored more back compatibility issues or whether now we are creating
> more back compat. issues due to more rapid change.  I agree, though, that
> all of this stuff is making it harder and harder to develop (and I don't
> mean for us committers, I mean for end consumers.)
> >
> > I also agree about Robert's point about the incorrectness of naming
> something 3.0 versus 3.1 when 3.1 is the thing that has all the new features
> and is really the "major" release.
> >
> > -Grant
> > -
> > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-dev-h...@lucene.apache.org
> >
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


-- 
Robert Muir
rcm...@gmail.com


[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856877#action_12856877
 ] 

Shai Erera commented on LUCENE-2159:


bq. I understand having a general performance suite to test regression is a 
good thing. But we found having a more focused test for segmentation and merge 
is important.

Are you saying that because of the benchmark proposal? I still think that an 
ExpandIndexTask will be useful for benchmark and fits better there, than in 
contrib/misc. We can have that task together w/ a predefined .alg for using it 
...

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Earwin Burrfoot
The thread somehow got sidetracked. So, let's get this carriage back
on its rails?

Let me remind - we have an API on hands that is mandatory and tends to
be cumbersome.
Proposed solution does indeed have ultrascary word "static" in it. But
if you brace yourself and look closer - the use of said static is
opt-in and heavily guarded.
So even a long-standing hater of everything static like me is tempted.


On Wed, Apr 14, 2010 at 16:30, Grant Ingersoll  wrote:
>
> On Apr 14, 2010, at 12:49 AM, Robert Muir wrote:
>
>>
>> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey  
>> wrote:
>> New class names would work, too.
>>
>> I only mention that for the sake of completeness, though -- it's not a
>> suggestion.
>>
>> Right, to me this is just as bad.
>> In my eyes, the Version thing really shows the problem with the analysis 
>> stuff:
>> * Used by QueryParsers, etc at search and index time, with no real clean way 
>> to do back-compat
>> * Concepts like Version and class-naming push some of the burden to the 
>> user: users decide the back-compat level, but it still leaves devs with 
>> back-compat management hassle.
>>
>> The idea of having a real versioned-module is the same as Version and 
>> class-naming, except it both pushes the burden to the user in a more natural 
>> way (people are used to versioned jar files and things like that... not 
>> Version constants), and it relieves devs of the back compat
>>
>> In all honesty with the current scheme, release schedules of Lucene, and 
>> Lucene's policy, the analysis stuff will soon deadlock into being nearly 
>> unmaintainable, and to many users, the API is already unconsumable: its 
>> difficult to write reusable analyzers due to historical relics in the API, 
>> methods are named inappropriately, e.g. Tokenizer.reset(Reader) and 
>> TokenStream.reset(), they don't understand Version, and probably a few other 
>> things I am forgetting that are basically impossible to fix right now with 
>> the current state of affairs.
>
>
> The thing I keep going back to is that somehow Lucene has managed for years 
> (and I mean lots of years) w/o stuff like Version and all this massive back 
> compatibility checking.  I'm still undecided as to whether that is a good 
> thing or not.  I also am not sure whether it in the past we just 
> missed/ignored more back compatibility issues or whether now we are creating 
> more back compat. issues due to more rapid change.  I agree, though, that all 
> of this stuff is making it harder and harder to develop (and I don't mean for 
> us committers, I mean for end consumers.)
>
> I also agree about Robert's point about the incorrectness of naming something 
> 3.0 versus 3.1 when 3.1 is the thing that has all the new features and is 
> really the "major" release.
>
> -Grant
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread John Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856869#action_12856869
 ] 

John Wang commented on LUCENE-2159:
---

Shai:

  You are right, we found this tool useful with testing performance 
implications under index segmentation. I understand having a general 
performance suite to test regression is a good thing. But we found having a 
more focused test for segmentation and merge is important.

-John

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Proposal about Version API "relaxation"

2010-04-14 Thread Grant Ingersoll

On Apr 14, 2010, at 12:49 AM, Robert Muir wrote:

> 
> On Wed, Apr 14, 2010 at 12:06 AM, Marvin Humphrey  
> wrote:
> New class names would work, too.
> 
> I only mention that for the sake of completeness, though -- it's not a
> suggestion.
> 
> Right, to me this is just as bad. 
> In my eyes, the Version thing really shows the problem with the analysis 
> stuff:
> * Used by QueryParsers, etc at search and index time, with no real clean way 
> to do back-compat
> * Concepts like Version and class-naming push some of the burden to the user: 
> users decide the back-compat level, but it still leaves devs with back-compat 
> management hassle.
> 
> The idea of having a real versioned-module is the same as Version and 
> class-naming, except it both pushes the burden to the user in a more natural 
> way (people are used to versioned jar files and things like that... not 
> Version constants), and it relieves devs of the back compat
> 
> In all honesty with the current scheme, release schedules of Lucene, and 
> Lucene's policy, the analysis stuff will soon deadlock into being nearly 
> unmaintainable, and to many users, the API is already unconsumable: its 
> difficult to write reusable analyzers due to historical relics in the API, 
> methods are named inappropriately, e.g. Tokenizer.reset(Reader) and 
> TokenStream.reset(), they don't understand Version, and probably a few other 
> things I am forgetting that are basically impossible to fix right now with 
> the current state of affairs.


The thing I keep going back to is that somehow Lucene has managed for years 
(and I mean lots of years) w/o stuff like Version and all this massive back 
compatibility checking.  I'm still undecided as to whether that is a good thing 
or not.  I also am not sure whether it in the past we just missed/ignored more 
back compatibility issues or whether now we are creating more back compat. 
issues due to more rapid change.  I agree, though, that all of this stuff is 
making it harder and harder to develop (and I don't mean for us committers, I 
mean for end consumers.)

I also agree about Robert's point about the incorrectness of naming something 
3.0 versus 3.1 when 3.1 is the thing that has all the new features and is 
really the "major" release.

-Grant
-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index

2010-04-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856857#action_12856857
 ] 

Grant Ingersoll commented on LUCENE-2184:
-

Thanks, Nicolas.  Applied.

> CartesianPolyFilterBuilder doesn't properly account for which tiers actually 
> exist in the index 
> 
>
> Key: LUCENE-2184
> URL: https://issues.apache.org/jira/browse/LUCENE-2184
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 3.1
>
> Attachments: LUCENE-2184.2.patch, LUCENE-2184.patch
>
>
> In the CartesianShapeFilterBuilder, there is logic that determines the "best 
> fit" tier to create the Filter against.  However, it does not account for 
> which fields actually exist in the index when doing so.  For instance, if you 
> index tiers 1 through 10, but then choose a very small radius to restrict the 
> space to, it will likely choose a tier like 15 or 16, which of course does 
> not exist.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2159) Tool to expand the index for perf/stress testing.

2010-04-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856845#action_12856845
 ] 

Shai Erera commented on LUCENE-2159:


This looks like a nice tool. But all it does is create multiple copies of the 
same segment(s) right? So what exactly do you want to test with it? What 
worries me is that we'll be multiplying the lexicon, posting lists, statistics 
etc., therefore I'm not sure how reliable the tests will be (whatever they 
are), except for measuring things related to large number of segments (like 
merge performance). Am I right?

I also think this class better fits in benchmark rather than misc, as it's 
really for perf. testing/measurements and not as a generic utility ... You can 
create a Task out if it, like ExpandIndexTask which one can include in his 
algorithm.

> Tool to expand the index for perf/stress testing.
> -
>
> Key: LUCENE-2159
> URL: https://issues.apache.org/jira/browse/LUCENE-2159
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0
>Reporter: John Wang
> Attachments: ExpandIndex.java
>
>
> Sometimes it is useful to take a small-ish index and expand it into a large 
> index with K segments for perf/stress testing. 
> This tool does that. See attached class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2387) IndexWriter retains references to Readers used in Fields (memory leak)

2010-04-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-2387:
---

Attachment: LUCENE-2387-29x.patch

29x version of this patch.

> IndexWriter retains references to Readers used in Fields (memory leak)
> --
>
> Key: LUCENE-2387
> URL: https://issues.apache.org/jira/browse/LUCENE-2387
> Project: Lucene - Java
>  Issue Type: Bug
>Affects Versions: 3.0.1
>Reporter: Ruben Laguna
>Assignee: Michael McCandless
> Fix For: 3.1
>
> Attachments: LUCENE-2387-29x.patch, LUCENE-2387.patch
>
>
> As described in [1] IndexWriter retains references to Reader used in Fields 
> and that can lead to big memory leaks when using tika's ParsingReaders (as 
> those can take 1MB per ParsingReader). 
> [2] shows a screenshot of the reference chain to the Reader from the 
> IndexWriter taken with Eclipse MAT (Memory Analysis Tool) . The chain is the 
> following:
> IndexWriter -> DocumentsWriter -> DocumentsWriterThreadState -> 
> DocFieldProcessorPerThread  -> DocFieldProcessorPerField -> Fieldable -> 
> Field (fieldsData) 
> -
> [1] http://markmail.org/thread/ndmcgffg2mnwjo47
> [2] http://skitch.com/ecerulm/n7643/eclipse-memory-analyzer

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Google-developed posting list encoding

2010-04-14 Thread Michael McCandless
Flex has already landed (in trunk, for 3.1), so this is "just" a
matter of someone creating a codec using Group VarInt.

Mike

On Wed, Apr 14, 2010 at 4:58 AM, John Wang  wrote:
> This would be something that's excellent for contribution after the
> Flex-Indexing support is added.
> -John
>
> On Wed, Apr 14, 2010 at 12:22 AM, Mike Klaas  wrote:
>>
>> Can be quite a bit faster than vInt in some cases:
>> http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html
>>
>> -Mike
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Google-developed posting list encoding

2010-04-14 Thread John Wang
This would be something that's excellent for contribution after the
Flex-Indexing support is added.

-John

On Wed, Apr 14, 2010 at 12:22 AM, Mike Klaas  wrote:

> Can be quite a bit faster than vInt in some cases:
> http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html
>
> -Mike
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>


[jira] Resolved: (LUCENE-2316) Define clear semantics for Directory.fileLength

2010-04-14 Thread Shai Erera (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera resolved LUCENE-2316.


Lucene Fields: [New, Patch Available]  (was: [New])
 Assignee: Shai Erera
   Resolution: Fixed

Committed revision 933879.

> Define clear semantics for Directory.fileLength
> ---
>
> Key: LUCENE-2316
> URL: https://issues.apache.org/jira/browse/LUCENE-2316
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Shai Erera
>Assignee: Shai Erera
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2316.patch
>
>
> On this thread: 
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/201003.mbox/%3c126142c1003121525v24499625u1589bbef4c079...@mail.gmail.com%3e
>  it was mentioned that Directory's fileLength behavior is not consistent 
> between Directory implementations if the given file name does not exist. 
> FSDirectory returns a 0 length while RAMDirectory throws FNFE.
> The problem is that the semantics of fileLength() are not defined. As 
> proposed in the thread, we'll define the following semantics:
> * Returns the length of the file denoted by name if the file 
> exists. The return value may be anything between 0 and Long.MAX_VALUE.
> * Throws FileNotFoundException if the file does not exist. Note that you can 
> call dir.fileExists(name) if you are not sure whether the file exists or not.
> For backwards we'll create a new method w/ clear semantics. Something like:
> {code}
> /**
>  * @deprecated the method will become abstract when #fileLength(name) has 
> been removed.
>  */
> public long getFileLength(String name) throws IOException {
>   long len = fileLength(name);
>   if (len == 0 && !fileExists(name)) {
> throw new FileNotFoundException(name);
>   }
>   return len;
> }
> {code}
> The first line just calls the current impl. If it throws exception for a 
> non-existing file, we're ok. The second line verifies whether a 0 length is 
> for an existing file or not and throws an exception appropriately.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1921) Absurdly large radius (miles) search fails to include entire earth

2010-04-14 Thread Nicolas Helleringer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Helleringer updated LUCENE-1921:


Attachment: TEST-1921.patch

After LUCENE-2184 has been re resolved this test (TEST-1921.patch) can validate 
this issue and it can be closed.

> Absurdly large radius (miles) search fails to include entire earth
> --
>
> Key: LUCENE-1921
> URL: https://issues.apache.org/jira/browse/LUCENE-1921
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9
>Reporter: Michael McCandless
>Assignee: Chris Male
>Priority: Minor
> Fix For: 3.1
>
> Attachments: TEST-1921.patch
>
>
> Spinoff from LUCENE-1781.
> If you do a very large (eg 10 miles) radius search then the
> lat/lng bound box wraps around the entire earth and all points should
> be accepted.  But this fails today (many points are rejected).  It's
> easy to see the issue: edit TestCartesian, and insert a very large
> miles into either testRange or testGeoHashRange.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index

2010-04-14 Thread Nicolas Helleringer (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Helleringer updated LUCENE-2184:


Attachment: LUCENE-2184.2.patch

My work @ LUCENE-2359 did break Grant work here :s

Here is a patch to correct this.

I put the logic into the CartesianTierPlotter instead of the 
CartesianPolyFilterBuilder as there was allready code to handle Tier level 
borders there.

> CartesianPolyFilterBuilder doesn't properly account for which tiers actually 
> exist in the index 
> 
>
> Key: LUCENE-2184
> URL: https://issues.apache.org/jira/browse/LUCENE-2184
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Fix For: 3.1
>
> Attachments: LUCENE-2184.2.patch, LUCENE-2184.patch
>
>
> In the CartesianShapeFilterBuilder, there is logic that determines the "best 
> fit" tier to create the Filter against.  However, it does not account for 
> which fields actually exist in the index when doing so.  For instance, if you 
> index tiers 1 through 10, but then choose a very small radius to restrict the 
> space to, it will likely choose a tier like 15 or 16, which of course does 
> not exist.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Google-developed posting list encoding

2010-04-14 Thread Mike Klaas
Can be quite a bit faster than vInt in some cases:
http://www.ir.uwaterloo.ca/book/addenda-06-index-compression.html

-Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org