Re: [VOTE] Release PyLucene 8.8.1

2021-03-08 Thread Michael McCandless
+1

I ran my usual smoke test: install JCC, PyLucene, then index and optimize
the first 100K documents from a Wikipedia English snapshot, and run a
couple queries.

Sorry for being late to the party too!

Mike McCandless

http://blog.mikemccandless.com


On Mon, Mar 1, 2021 at 9:35 PM Andi Vajda  wrote:

>
> The PyLucene 8.8.1 (rc1) release tracking the recent release of
> Apache Lucene 8.8.1 is ready.
>
> A release candidate is available from:
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/8.8.1-rc1/
>
> PyLucene 8.8.1 is built with JCC 3.9, included in these release artifacts.
>
> JCC 3.9 supports Python 3.3 up to Python 3.9 (in addition to Python 2.3+).
> PyLucene may be built with Python 2 or Python 3.
>
> Please vote to release these artifacts as PyLucene 8.8.1.
> Anyone interested in this release can and should vote !
>
> Thanks !
>
> Andi..
>
> ps: the KEYS file for PyLucene release signing is at:
> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>
> pps: here is my +1
>


Re: [VOTE] Release PyLucene 8.6.1

2020-08-25 Thread Michael McCandless
+1 to release.

I ran my usual smoke test to index, forceMerge and search the first 100K
documents from English Wikipedia export, on Arch Linux, Java 1.11.06,
Python 3.8.1 -- test ran fine!

Thanks Andi.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Aug 24, 2020 at 7:56 PM Andi Vajda  wrote:

>
> The PyLucene 8.6.1 (rc1) release tracking the recent release of
> Apache Lucene 8.6.1 is ready.
>
> A release candidate is available from:
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/8.6.1-rc1/
>
> PyLucene 8.6.1 is built with JCC 3.8, included in these release artifacts.
>
> JCC 3.8 supports Python 3.3 up to Python 3.8 (in addition to Python 2.3+).
> PyLucene may be built with Python 2 or Python 3.
>
> Please vote to release these artifacts as PyLucene 8.6.1.
> Anyone interested in this release can and should vote !
>
> Thanks !
>
> Andi..
>
> ps: the KEYS file for PyLucene release signing is at:
> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>
> pps: here is my +1
>


Re: Memory usage

2019-11-07 Thread Michael McCandless
Hi Siddharth,

Your understanding of MMapDirectory is correct -- only give your JVM enough
heap to not spend too much CPU on GC, and then let the OS use all available
remaining RAM to cache hot pages from your index.

There are some structures Lucene loads into JVM heap, but even those are
being moved off-heap (accessed via Directory) recently such as FSTs used
for the terms index, and BKD index (for dimensional points).  I'm not sure
exactly which structures are still in heap ... maybe the live documents
bitset?

During indexing, the recently indexed documents are buffered in JVM heap,
up until the IndexWriterConfig.setRAMBufferSizeMB and then they will be
written to the Directory as new segments.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Nov 6, 2019 at 11:27 PM siddharth teotia 
wrote:

> Hi All
>
> I have some questions about the memory usage. I would really appreciate if
> someone can help answer these.
>
> I understand from the docs that during reading/querying, Lucene uses
> MMapDirectory (assuming it is supported on the platform). So the Java heap
> overhead in this case will purely come from the objects that are
> allocated/instantiated on the query path to process the query and build
> results etc.  But the whole index itself will not be loaded into memory
> because we memory mapped the file. Is my understanding correct? In this
> case, we are better off not increasing the Java heap and keep as much
> as possible available for the file system cache for mmap to do its job
> efficiently.
>
> However, are there any portions of index structures that are completely
> loaded in memory regardless of whether it is MMapDirectory or not? If so,
> are they loaded in Java heap or do we use off-heap (direct buffers) in
> such cases?
>
> Secondly, on the write path I think even though the writer opens a
> MMapDirectory, the writes are gathered/buffered in memory upto a flush
> threshold controlled by IndexWriterConfig. Is this buffering done in Java
> heap or direct memory?
>
> Thanks a lot for help
> Siddharth
>


Re: [VOTE] Release PyLucene 7.6.0 (rc1)

2019-01-07 Thread Michael McCandless
+1 to release!

I ran my usual simple test indexing the first 100K docs from an old
wikipedia export, force merging, and running a few searches.

Thank you for continuing to release PyLucene Andi!

Mike McCandless

http://blog.mikemccandless.com


On Fri, Jan 4, 2019 at 4:59 PM Andi Vajda  wrote:

>
> The PyLucene 7.6.0 (rc1) release tracking the recent release of
> Apache Lucene 7.6.0 is ready.
>
> A release candidate is available from:
>https://dist.apache.org/repos/dist/dev/lucene/pylucene/7.6.0-rc1/
>
> PyLucene 7.6.0 is built with JCC 3.4 included in these release artifacts.
>
> JCC 3.4 supports Python 3.3+ (in addition to Python 2.3+).
> PyLucene may be built with Python 2 or Python 3.
>
> Please vote to release these artifacts as PyLucene 7.6.0.
> Anyone interested in this release can and should vote !
>
> Thanks !
>
> Andi..
>
> ps: the KEYS file for PyLucene release signing is at:
> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>
> pps: here is my +1
>


Re: [VOTE] Release PyLucene 6.5.0 (rc1) (now with Python 3 support)

2017-03-29 Thread Michael McCandless
+1 to release.

I tested on Ubuntu 16.04 with Python 3.5.2 and Java 1.8.0_121.

I ran my usual smoke test of indexing first 100K docs from Wikipedia
English export and running a few searches.  But first I had to run 2to3 on
this ancient script!

I had to apply Ruediger's patch to JCC's setup.py else it was trying to
link with -lpython3.5 but I have -lpython3.5m.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Mar 27, 2017 at 6:12 PM, Andi Vajda  wrote:

>
> The PyLucene 6.5.0 (rc1) release tracking today's release of
> Apache Lucene 6.5.0 is ready.
>
> A release candidate is available from:
>   https://dist.apache.org/repos/dist/dev/lucene/pylucene/6.5.0-rc1/
>
> PyLucene 6.5.0 is built with JCC 3.0 included in these release artifacts.
>
> JCC 3.0 now supports Python 3.3+ (in addition to Python 2.3+).
> PyLucene may be built with Python 2 or Python 3.
>
> Please vote to release these artifacts as PyLucene 6.5.0.
> Anyone interested in this release can and should vote !
>
> Thanks !
>
> Andi..
>
> ps: the KEYS file for PyLucene release signing is at:
> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>
> pps: here is my +1
>


Re: [VOTE] Release PyLucene 6.4.1 (rc1)

2017-02-12 Thread Michael McCandless
+1 to release.

I ran my usual smoke test: indexing first 100K docs from English
Wikipedia export, optimizing, running a couple searches, on Ubuntu
16.04, Java 1.8.0_101, Python 2.7.12.

Mike McCandless

http://blog.mikemccandless.com


On Sun, Feb 12, 2017 at 5:25 AM, Michael McCandless
<luc...@mikemccandless.com> wrote:
> Sorry, I will have a look!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Feb 11, 2017 at 5:23 PM, Andi Vajda <va...@apache.org> wrote:
>>
>> Ping ?
>> Two more PMC votes are needed before this release can happen.
>> Thanks !
>>
>> Andi..
>>
>>> On Feb 6, 2017, at 13:38, Andi Vajda <va...@apache.org> wrote:
>>>
>>>
>>> The PyLucene 6.4.1 (rc1) release tracking today's release of
>>> Apache Lucene 6.4.1 is ready.
>>>
>>> A release candidate is available from:
>>>  https://dist.apache.org/repos/dist/dev/lucene/pylucene/6.4.1-rc1/
>>>
>>> PyLucene 6.4.1 is built with JCC 2.23 included in these release artifacts.
>>>
>>> Please vote to release these artifacts as PyLucene 6.4.1.
>>> Anyone interested in this release can and should vote !
>>>
>>> Thanks !
>>>
>>> Andi..
>>>
>>> ps: the KEYS file for PyLucene release signing is at:
>>> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
>>> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>>>
>>> pps: here is my +1


Re: [VOTE] Release PyLucene 6.4.1 (rc1)

2017-02-12 Thread Michael McCandless
Sorry, I will have a look!

Mike McCandless

http://blog.mikemccandless.com


On Sat, Feb 11, 2017 at 5:23 PM, Andi Vajda  wrote:
>
> Ping ?
> Two more PMC votes are needed before this release can happen.
> Thanks !
>
> Andi..
>
>> On Feb 6, 2017, at 13:38, Andi Vajda  wrote:
>>
>>
>> The PyLucene 6.4.1 (rc1) release tracking today's release of
>> Apache Lucene 6.4.1 is ready.
>>
>> A release candidate is available from:
>>  https://dist.apache.org/repos/dist/dev/lucene/pylucene/6.4.1-rc1/
>>
>> PyLucene 6.4.1 is built with JCC 2.23 included in these release artifacts.
>>
>> Please vote to release these artifacts as PyLucene 6.4.1.
>> Anyone interested in this release can and should vote !
>>
>> Thanks !
>>
>> Andi..
>>
>> ps: the KEYS file for PyLucene release signing is at:
>> https://dist.apache.org/repos/dist/release/lucene/pylucene/KEYS
>> https://dist.apache.org/repos/dist/dev/lucene/pylucene/KEYS
>>
>> pps: here is my +1


Re: Doing Range/NUmber queries

2016-08-09 Thread Michael McCandless
No, you must replace the entire document: the old one is removed, and the
new one is indexed in its place.

The one exception to this is update-able document value (e.g. see
IW.updateNumericDocValue).

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 9, 2016 at 2:49 PM, lukes  wrote:

> Thanks Michael,
>
>  Is there a way to partially update the document ? I know there's a API
> updateDocument on IndexWriter, but that seems to create a new document with
> just a field i am specifying. What i want is delete some fields from
> existing(indexed) document, and then add some new fields(could or not be
> same). Alternatively i tried to search for the document, and then calling
> removeFields and finally updateDocument, but now any search after the above
> process is not able for find that document(I created the new IndexReader).
> Am i missing anything ?
>
> Regards.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Doing-Range-NUmber-queries-tp4290722p4291023.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>


Re: Doing Range/NUmber queries

2016-08-09 Thread Michael McCandless
For 1), you need to copy it yourself, i.e. add another Field to the Lucene
Document you are about to index, with the same (string, numeric, etc.)
value from the first field.

For 2), it's best to use points (IntPoint, etc.) for range filtering.

For 3), to search a boolean value, just map your boolean to a token, e.g.
"true" and "false".

Mike McCandless

http://blog.mikemccandless.com

On Mon, Aug 8, 2016 at 1:37 AM, lukes  wrote:

> *Update(Found the answer for point 2).*
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/Doing-Range-NUmber-queries-tp4290722p4290725.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>


[ANNOUNCE] Apache Lucene 5.5.0 released

2016-02-23 Thread Michael McCandless
23 February 2016, Apache Lucene™ 5.5.0 available

The Lucene PMC is pleased to announce the release of Apache Lucene
5.5.0, expected to be the last 5.x feature release before Lucene
6.0.0.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below. The release is
available for immediate download at:

  http://lucene.apache.org/core/mirrors-core-latest-redir.html

Please read CHANGES.txt for a full list of new features and changes:

  https://lucene.apache.org/core/5_5_0/changes/Changes.html

Lucene 5.5.0 Release Highlights:

  * JoinUtil.createJoinQuery can now join on numeric doc values fields

  * BlendedInfixSuggester now has an exponential reciprocal scoring
model, to more strongly favor suggestions with matches closer to the
beginning

  * CustomAnalyzer has improved (compile time) type safety

  * DFISimilarity implements the divergence from independence scoring model

  * Fully wrap any other merge policy using MergePolicyWrapper

  * Sandbox geo point queries have graduated into the spatial module,
and now use a more efficient binary term encoding for smaller index
size, faster indexing, and decreased search-time heap usage

  * BooleanQuery performs some new query optimizations

  * TermsQuery constructors are more GC efficient


Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also applies to Maven access.

Mike McCandless

http://blog.mikemccandless.com


[ANNOUNCE] Apache Solr 5.5.0 released

2016-02-23 Thread Michael McCandless
23 February 2016, Apache Solr™ 5.5.0 available

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 5.5.0 is available for immediate download at:

  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Please read CHANGES.txt for a full list of new features and changes:

  https://lucene.apache.org/solr/5_5_0/changes/Changes.html

This is expected to be the last 5.x feature release before Solr 6.0.0.

Solr 5.5 Release Highlights:

  * The schema version has been increased to 1.6, and Solr now returns
non-stored doc values fields along with stored fields

  * The PERSIST CoreAdmin action has been removed

  * The  element is deprecated in favor of a similar
 element, in solrconfig.xml

  * CheckIndex now works on HdfsDirectory

  * RuleBasedAuthorizationPlugin now allows wildcards in the role, and
accepts an 'all' permission

  * Users can now choose compression mode in SchemaCodecFactory

  * Solr now supports Lucene's XMLQueryParser

  * Collections APIs now have async support

  * Uninverted field faceting is re-enabled, for higher performance on
rarely changing indices

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also applies to Maven access.

Mike McCandless

http://blog.mikemccandless.com


[ANNOUNCE] Apache Lucene 4.10.4 released

2015-03-05 Thread Michael McCandless
March 2015, Apache Lucene™ 4.10.4 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.4

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.cgi/lucene/java/4.10.4l

Lucene 4.10.4 includes 13 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com


[ANNOUNCE] Apache Solr 4.10.4 released

2015-03-05 Thread Michael McCandless
October 2014, Apache Solr™ 4.10.4 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.4

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.4 is available for immediate download at:

http://www.apache.org/dyn/closer.cgi/lucene/solr/4.10.4

Solr 4.10.4 includes 24 bug fixes, as well as Lucene 4.10.4 and its 13
bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com


Re: [ANNOUNCE] Apache Lucene 4.10.4 released

2015-03-05 Thread Michael McCandless
Correction: the download link for Lucene 4.10.4 is:

http://www.apache.org/dyn/closer.cgi/lucene/java/4.10.4

Mike McCandless

http://blog.mikemccandless.com


On Thu, Mar 5, 2015 at 10:26 AM, Michael McCandless
luc...@mikemccandless.com wrote:
 March 2015, Apache Lucene™ 4.10.4 available

 The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.4

 Apache Lucene is a high-performance, full-featured text search engine
 library written entirely in Java. It is a technology suitable for
 nearly any application that requires full-text search, especially
 cross-platform.

 The release is available for immediate download at:

 http://www.apache.org/dyn/closer.cgi/lucene/java/4.10.4l

 Lucene 4.10.4 includes 13 bug fixes.

 See the CHANGES.txt file included with the release for a full list of
 changes and further details.

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/core/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring
 network for distributing releases. It is possible that the mirror you
 are using may not have replicated the release yet. If that is the
 case, please try another mirror. This also goes for Maven access.

 Mike McCandless

 http://blog.mikemccandless.com


Re: How can I make better project than Lucene?

2014-11-18 Thread Michael McCandless
On Tue, Nov 18, 2014 at 1:16 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Sat, Nov 15, 2014 at 3:22 AM, Michael McCandless
 luc...@mikemccandless.com wrote:

 The analysis chain (attributes) is overly complex.

 If you were to start from scratch, what would the analysis chain look like?

Hi Marvin, long time no talk!  I like the new Go bindings for Lucy!

Here are some things that bug me about Lucene's analysis APIs:
Lucene's Attributes have separate interface from impl, with default
impls, and this causes complex code in oal.util.Attribute*.  It seems
like overkill.  Seems like we should just have concrete core impls for
the atts Lucene knows how to index.

There are 5 java source files in that package related to attributes
(Attribute.java AttributeFactory.java AttributeImpl.java
AttributeReflector.java AttributeSource.java): too much.

There should not be a global AttributeFactory that owns all attrs
throughout the pipeline: that's too global.  Rather, each stage should
be free to control what the next stages sees (LUCENE-2450) ... the
namespace should be private to that stage, and each stage can
delete/add/replace the incoming bindings it saw.  This may seem more
complex but I think it'd be simpler in the end?  And, the first stage
should not have to be responsible for clearing things that later
stages had inserted: common source of bugs for that first Tokenizer to
not call clearAttributes.

Reuse of token streams was an afterthought that took a long time to
work its way down to simpler APIs, but now we ReuseStrategy,
AnalyzerWrapper, DelegatingAnalzyerWrapper.

Custom analyzers can't be (easily?) serialized, so ES and Solr have
their own layers to parse a custom chain from JSON/XML.  Those layers
could do better error checking...

Can we do something better with offsets, such that TokenFilters (not
just Tokenizers/CharReaders) would also be able to set correct
offsets?

The stuffing of things into analysis that really should have been a
gentle schema is annoying: KeywordAnalyzer, Numeric*.

Token filters that want to create graphs are nearly impossible.   E.g
you cannot put a WDF in front of SynonymFilter today because
SynonymFilter can't handle an incoming graph (LUCENE-5012).

Deleted tokens should still be present, just marked as deleted (so
IW doesn't index them).  This would make it possible (to Rob's horror)
for tokenizers to preserve every single character they saw, but things
that are not tokens (punctuation, whitespace) are marked deleted.
Maybe this makes it possible for all stages to work with offsets
properly?

There is probably more, and probably lots of people disagree that
these are even problems :)

Mike McCandless

http://blog.mikemccandless.com


Re: How can I make better project than Lucene?

2014-11-15 Thread Michael McCandless
Actually I think competing projects is very healthy for open source development.

There are many things you could explore to contrast with Lucene,
e.g. write your new search engine in Go not Java: Java has many
problems, maybe Go fixes them.  Go also has a low-latency garbage
collector in development ... and Java's GC options still can't scale
to the heap sizes that are practical now.

Lucene has many limitations, so your competing engine could focus on
them.  E.g. the schemalessness of Lucene has become a big problem,
and near impossible to fix at this point, and prevents new important
features like LUCENE-5879 from being possible, so you could give your
engine a gentle schema from the start.

The Lucene Filter/Query situation is a mess: one should extend the other.

Lucene has weak support for proximity queries (SpanQuery is slow and
does not get much attention).

Lucene is showing its age, missing some compelling features like a
builtin transaction log, core support for numerics (they are sort of
hacked on top), optimistic concurrency support (sequence ids,
versions, something), distributed support (near real time replication,
etc.), multi-tenancy, an example server implementation, so the search
servers on top of Lucene have had to fill these gaps.  Maybe you could
make your engine distributed from the start (Go is a great match for
that, from what little I know).

All 3 highlighter options have problems.

The analysis chain (attributes) is overly complex.

In your competing engine you can borrow/copy/steal from Lucene's good
parts to get started...


Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 14, 2014 at 8:43 PM, swsong_dev swsong_...@websqrd.com wrote:
 I’m developing search engine, Fastcatsearch. http://github 
 hthttp://githubtp//github.com/fastcatsearch/fastcatsearch

 Lucene is widely known and famous project and I cannot beat Lucene for now.

 But is there any chance to beat Lucene?

 Anything like features, performance.

 Please, let me know what to do to make better product than Lucene.

 Thank you.


Re: How can I make better project than Lucene?

2014-11-15 Thread Michael McCandless
Well the Apache Software License is very generous about poaching.

Your ideas will go further if you don't insist on going with them.

Mike McCandless

http://blog.mikemccandless.com


On Sat, Nov 15, 2014 at 6:42 AM, Will Martin wmartin...@gmail.com wrote:
 Btw: SwSong should not steal code; which implies an existing license whose 
 terms he is willing to break. Not a good first step.;-)

 will

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Saturday, November 15, 2014 6:22 AM
 To: general@lucene.apache.org
 Subject: Re: How can I make better project than Lucene?

 Actually I think competing projects is very healthy for open source 
 development.

 There are many things you could explore to contrast with Lucene, e.g. write 
 your new search engine in Go not Java: Java has many problems, maybe Go fixes 
 them.  Go also has a low-latency garbage collector in development ... and 
 Java's GC options still can't scale to the heap sizes that are practical now.

 Lucene has many limitations, so your competing engine could focus on them.  
 E.g. the schemalessness of Lucene has become a big problem, and near 
 impossible to fix at this point, and prevents new important features like 
 LUCENE-5879 from being possible, so you could give your engine a gentle 
 schema from the start.

 The Lucene Filter/Query situation is a mess: one should extend the other.

 Lucene has weak support for proximity queries (SpanQuery is slow and does not 
 get much attention).

 Lucene is showing its age, missing some compelling features like a builtin 
 transaction log, core support for numerics (they are sort of hacked on 
 top), optimistic concurrency support (sequence ids, versions, something), 
 distributed support (near real time replication, etc.), multi-tenancy, an 
 example server implementation, so the search servers on top of Lucene have 
 had to fill these gaps.  Maybe you could make your engine distributed from 
 the start (Go is a great match for that, from what little I know).

 All 3 highlighter options have problems.

 The analysis chain (attributes) is overly complex.

 In your competing engine you can borrow/copy/steal from Lucene's good parts 
 to get started...


 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Nov 14, 2014 at 8:43 PM, swsong_dev swsong_...@websqrd.com wrote:
 I’m developing search engine, Fastcatsearch. http://github
 hthttp://githubtp//github.com/fastcatsearch/fastcatsearch

 Lucene is widely known and famous project and I cannot beat Lucene for now.

 But is there any chance to beat Lucene?

 Anything like features, performance.

 Please, let me know what to do to make better product than Lucene.

 Thank you.



Re: How can I make better project than Lucene?

2014-11-15 Thread Michael McCandless
Yes it does.

Mike McCandless

http://blog.mikemccandless.com


On Sat, Nov 15, 2014 at 8:53 AM, Will Martin wmartin...@gmail.com wrote:
 Um, doesn't the Apache license require inclusion of the license? Just sayin'


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Saturday, November 15, 2014 8:47 AM
 To: general@lucene.apache.org
 Subject: Re: How can I make better project than Lucene?

 Well the Apache Software License is very generous about poaching.

 Your ideas will go further if you don't insist on going with them.

 Mike McCandless

 http://blog.mikemccandless.com


 On Sat, Nov 15, 2014 at 6:42 AM, Will Martin wmartin...@gmail.com wrote:
 Btw: SwSong should not steal code; which implies an existing license whose 
 terms he is willing to break. Not a good first step.;-)

 will

 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Saturday, November 15, 2014 6:22 AM
 To: general@lucene.apache.org
 Subject: Re: How can I make better project than Lucene?

 Actually I think competing projects is very healthy for open source 
 development.

 There are many things you could explore to contrast with Lucene, e.g. 
 write your new search engine in Go not Java: Java has many problems, maybe 
 Go fixes them.  Go also has a low-latency garbage collector in development 
 ... and Java's GC options still can't scale to the heap sizes that are 
 practical now.

 Lucene has many limitations, so your competing engine could focus on them.  
 E.g. the schemalessness of Lucene has become a big problem, and near 
 impossible to fix at this point, and prevents new important features like 
 LUCENE-5879 from being possible, so you could give your engine a gentle 
 schema from the start.

 The Lucene Filter/Query situation is a mess: one should extend the other.

 Lucene has weak support for proximity queries (SpanQuery is slow and does 
 not get much attention).

 Lucene is showing its age, missing some compelling features like a builtin 
 transaction log, core support for numerics (they are sort of hacked on 
 top), optimistic concurrency support (sequence ids, versions, something), 
 distributed support (near real time replication, etc.), multi-tenancy, an 
 example server implementation, so the search servers on top of Lucene have 
 had to fill these gaps.  Maybe you could make your engine distributed from 
 the start (Go is a great match for that, from what little I know).

 All 3 highlighter options have problems.

 The analysis chain (attributes) is overly complex.

 In your competing engine you can borrow/copy/steal from Lucene's good parts 
 to get started...


 Mike McCandless

 http://blog.mikemccandless.com


 On Fri, Nov 14, 2014 at 8:43 PM, swsong_dev swsong_...@websqrd.com wrote:
 I’m developing search engine, Fastcatsearch. http://github
 hthttp://githubtp//github.com/fastcatsearch/fastcatsearch

 Lucene is widely known and famous project and I cannot beat Lucene for now.

 But is there any chance to beat Lucene?

 Anything like features, performance.

 Please, let me know what to do to make better product than Lucene.

 Thank you.




[ANNOUNCE] Apache Lucene 4.10.2 released

2014-10-31 Thread Michael McCandless
October 2014, Apache Lucene™ 4.10.2 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.2

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

The release is available for immediate download at:

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene 4.10.2 includes 2 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy Halloween,

Mike McCandless

http://blog.mikemccandless.com


[ANNOUNCE] Apache Solr 4.10.2 released

2014-10-31 Thread Michael McCandless
October 2014, Apache Solr™ 4.10.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.2 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.10.2 includes 10 bug fixes, as well as Lucene 4.10.2 and its 2 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Happy Halloween,

Mike McCandless

http://blog.mikemccandless.com


Re: [VOTE] Release PyLucene 4.10.1-1

2014-10-03 Thread Michael McCandless
+1 to release

I ran my usual smoke test, indexing, optimizing  searching first 100
K Wikipedia English docs...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Oct 1, 2014 at 7:13 PM, Andi Vajda va...@apache.org wrote:

 The PyLucene 4.10.1-1 release tracking the recent release of Apache Lucene
 4.10.1 is ready.

 This release candidate fixes the regression found in the previous one,
 4.10.1-0, and is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_10/CHANGES

 PyLucene 4.10.1 is built with JCC 2.21 included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_10_1/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.10.1-1.
 Anyone interested in this release can and should vote !

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1


[ANNOUNCE] Apache Lucene 4.10.1 released

2014-09-29 Thread Michael McCandless
September 2014, Apache Lucene™ 4.10.1 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.1

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

The release is available for immediate download at:

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene 4.10.1 includes 7 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com


[ANNOUNCE] Apache Solr 4.10.1 released

2014-09-29 Thread Michael McCandless
September 2014, Apache Solr™ 4.10.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.10.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.10.1 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.10.1 includes 6 bug fixes, as well as Lucene 4.10.1 and its 7 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com


[ANNOUNCE] Apache Lucene 4.9.1 released

2014-09-22 Thread Michael McCandless
September 2014, Apache Lucene™ 4.9.1 available

The Lucene PMC is pleased to announce the release of Apache Lucene 4.9.1

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

The release is available for immediate download at:

http://lucene.apache.org/core/mirrors-core-latest-redir.html

Lucene 4.9.1 includes 7 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/core/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com


[ANNOUNCE] Apache Solr 4.9.1 released

2014-09-22 Thread Michael McCandless
September 2014, Apache Solr™ 4.9.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.9.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.9.1 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.9.1 includes 2 bug fixes, as well as Lucene 4.9.1 and its 7 bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.

Mike McCandless

http://blog.mikemccandless.com


Re: [VOTE] Release PyLucene 4.9.0-0

2014-07-14 Thread Michael McCandless
+1

I ran my usual smoke test: index first 100K docs from Wikipedia (en),
do a few searches, run forceMerge.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jul 7, 2014 at 11:14 AM, Andi Vajda va...@apache.org wrote:

 The PyLucene 4.9.0-0 release tracking the recent release of Apache Lucene
 4.9.0 is ready.


 *** ATTENTION ***

 Starting with release 4.8.0, Lucene now requires Java 1.7 at the minimum.
 Using Java 1.6 with Lucene 4.8.0 and newer is not supported.

 On Mac OS X, Java 6 is still a common default, please upgrade if you haven't
 done so already. A common upgrade is Oracle Java 1.7 for Mac OS X:
   http://docs.oracle.com/javase/7/docs/webnotes/install/mac/mac-jdk.html

 On Mac OS X, once installed, a way to make Java 1.7 the default in your bash
 shell is:
   $ export JAVA_HOME=`/usr/libexec/java_home`
 Be sure to verify that this JAVA_HOME value is correct.

 On any system, if you're upgrading your Java installation, please rebuild
 JCC as well. You must use the same version of Java for both JCC and
 PyLucene.

 *** /ATTENTION ***


 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_9/CHANGES

 PyLucene 4.9.0 is built with JCC 2.20 included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_9_0/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.9.0-0.
 Anyone interested in this release can and should vote !

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1


Re: Near real time reader using ControlledRealTimeReopenThread

2014-06-25 Thread Michael McCandless
Don't call IndexWriter.commit with each added document.  Call it only
when you need to ensure durability (all index changes are written to
stable storage).

You spawn CRTRT, passing it your SearcherManager and IndexWriter, and
it periodically reopens for you, with methods to wait for a specific
indexing generation if a given search must be real-time. See
Lucene's test cases for examples on how to use
ControlledRealTimeReopenThread...

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 25, 2014 at 9:44 AM, Arun B C bcarunm...@yahoo.com.invalid wrote:
 Hi General Group,

 Am currently using Lucene 4.7.2.

 I use DirectoryReader directoryreader = DirectoryReader.open(indexWriter, 
 true); to get near real time reader. In order to manage directoryReader 
 instance being shared in a multi-threaded environment, People suggested to 
 use NRTManager.

 But I dont find NRTManager anymore in version 4.7.2. I think it was replaced 
 with ControlledRealTimeReopenThread.

 I am not able to find any information about Near real time in 
 ControlledRealTimeReopenThread java doc. I also found a sample provided by a 
 person 'futuretelematics' in 
 http://stackoverflow.com/questions/17993960/lucene-4-4-0-new-controlledrealtimereopenthread-sample-usage.
  It has indexWriter.commit in create and update index methods. Is that right, 
 will it not slow down search? or point me to a sample/information to achieve 
 near real time search using apache 4.7.2 or later.

 Please do let me know if you require any other information.

 Thanks in advance.


 Br,
 Arun BC


Re: [VOTE] Release PyLucene 4.8.0-1

2014-05-03 Thread Michael McCandless
+1 to release.

I ran my usual smoke test: index first 100K Wikipedia docs,
forceMerge, run a few searches.

Mike McCandless

http://blog.mikemccandless.com


On Wed, Apr 30, 2014 at 5:07 PM, Andi Vajda va...@apache.org wrote:

 The PyLucene 4.8.0-1 release tracking the recent release of Apache Lucene
 4.8.0 is ready.


 *** ATTENTION ***

 Starting with release 4.8.0, Lucene now requires Java 1.7 at the minimum.
 Using Java 1.6 with Lucene 4.8.0 is not supported.

 On Mac OS X, Java 6 is still a common default, please upgrade if you haven't
 done so already. A common upgrade is Oracle Java 1.7 for Mac OS X:
   http://docs.oracle.com/javase/7/docs/webnotes/install/mac/mac-jdk.html

 On Mac OS X, once installed, a way to make Java 1.7 the default in your bash
 shell is:
   $ export JAVA_HOME=`/usr/libexec/java_home`
 Be sure to verify that JAVA_HOME value.

 On any system, if you're upgrading your Java installation, please rebuild
 JCC as well. You must use the same version of Java for both JCC and
 PyLucene.

 *** /ATTENTION ***


 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_8/CHANGES

 PyLucene 4.8.0 is built with JCC 2.19 included in these release artifacts.
 The version of JCC included with PyLucene did not change since the previous
 release.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_8_0/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.8.0-1.
 Anyone interested in this release can and should vote !

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1


Re: [VOTE] Release PyLucene 4.6.1-0

2014-02-07 Thread Michael McCandless
Hmm I see many ._* files in the .tar.gz, e.g.:

mike@vine:~/src/pylucene-4.6.1-0/jcc$ tar tzf pylucene-4.6.1-0-src.tar.gz | head

./._pylucene-4.6.1-0
pylucene-4.6.1-0/
pylucene-4.6.1-0/._CHANGES
pylucene-4.6.1-0/CHANGES
pylucene-4.6.1-0/._CREDITS
pylucene-4.6.1-0/CREDITS
pylucene-4.6.1-0/._extensions.xml
pylucene-4.6.1-0/extensions.xml
pylucene-4.6.1-0/._INSTALL
pylucene-4.6.1-0/INSTALL


Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 5, 2014 at 7:29 PM, Andi Vajda va...@apache.org wrote:

 The PyLucene 4.6.1-0 release tracking the recent release of Apache Lucene
 4.6.1 is ready.

 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_6/CHANGES

 PyLucene 4.6.1 is built with JCC 2.19 included in these release artifacts:
 http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_6_1/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.6.1-0.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1


Re: [nag] [VOTE] Release PyLucene 4.5.0-2

2013-10-18 Thread Michael McCandless
+1 to wait for 4.5.1 instead?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 17, 2013 at 10:43 PM, Andi Vajda va...@apache.org wrote:

 One more PMC vote is needed to finalize this release.
 Then, we could wait for Lucene 4.5.1 to happen instead ?

 Andi..

 -- Forwarded message --
 Date: Mon, 14 Oct 2013 14:06:45 -0400
 From: Steve Rowe sar...@gmail.com
 To: general@lucene.apache.org, Andi Vajda va...@apache.org
 Cc: pylucene-...@lucene.apache.org
 Subject: Re: [VOTE] Release PyLucene 4.5.0-2

 +1

 Having installed setuptools 1.1.6 for the previous RC, I successfully
 installed JCC and pylucene and got 0 failures from 'make test'.

 One small thing about the installation instructions: I had to run 'make
 test' as root because of some permissions issues (since 'make install' was
 run as root and apparently took ownership of some files in the unpacked
 distribution directory) - seems like that shouldn't be necessary.

 -
 running build_ext
 running install
 running bdist_egg
 running egg_info
 writing lucene.egg-info/PKG-INFO
 error: lucene.egg-info/PKG-INFO: Permission denied
 make: *** [install-test] Error 1
 -

 Steve

 On Oct 13, 2013, at 11:04 PM, Andi Vajda va...@apache.org wrote:


 The PyLucene 4.5.0-2 release tracking the recent release of Apache Lucene
 4.5.0 is ready.

 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:

 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_5/CHANGES

 PyLucene 4.5.0 is built with JCC 2.18 included in these release artifacts:
 http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES

 A list of Lucene Java changes can be seen at:

 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_5_0/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.5.0-2.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1




Re: [VOTE] Release PyLucene 4.3.1-1

2013-06-30 Thread Michael McCandless
Hmm I see two test failures, on Linux, Python 2.7.3, Java 1.7.0_07
:

ERROR: testCachingWorks (__main__.CachingWrapperFilterTestCase)
--
Traceback (most recent call last):
  File test/test_CachingWrapperFilter.py, line 53, in testCachingWorks
strongRef = cacher.getDocIdSet(context, context.reader().getLiveDocs())
AttributeError: 'IndexReader' object has no attribute 'getLiveDocs'


and:

ERROR: testPayloadsPos0 (__main__.PositionIncrementTestCase)
--
Traceback (most recent call last):
  File test/test_PositionIncrement.py, line 257, in testPayloadsPos0
pspans = MultiSpansWrapper.wrap(searcher.getTopReaderContext(), snq)
  File /home/mike/src/pylucene-4.3.1-1/test/MultiSpansWrapper.py,
line 49, in wrap
return query.getSpans(ctx, ctx.reader().getLiveDocs(), termContexts)
AttributeError: 'IndexReader' object has no attribute 'getLiveDocs'

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jun 26, 2013 at 4:07 PM, Andi Vajda va...@apache.org wrote:

 The PyLucene 4.3.1-1 release tracking the recent release of Apache Lucene
 4.3.1 is ready.

 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_3/CHANGES

 PyLucene 4.3.1 is built with JCC 2.16 included in these release artifacts:
 http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_3_1/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.3.1-1.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1


Re: Re[2]: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

2013-06-05 Thread Michael McCandless
On Wed, Jun 5, 2013 at 2:51 AM, Artem Lukanin ice...@mail.ru wrote:
  OK, I will try to do it myself.

Thank you!

 As I understand I have to clone lucene_solr_4_3 from  
 https://github.com/apache/lucene-solr.git  and upload a patch to the issue 
 for review?

I'm not a git user, but that sounds right!  See here for more details:

http://wiki.apache.org/lucene-java/HowToContribute

Mike McCandless

http://blog.mikemccandless.com


Re: IndexWriter.commit() performance

2013-06-05 Thread Michael McCandless
On Tue, Jun 4, 2013 at 7:31 PM, Renata Vaccaro ren...@emailtopia.com wrote:
 Thanks.  I need the documents to be searchable as soon as they are
 added.  I also need the documents added to survive a machine crash.

 Soft commits and NRT gets might work, but from what I've read they are
 only available for Solr?

Likely commits got slower on upgrade because on your very, very old
Lucene version fsync was not called, so there was no safety on
OS/hardware crash to ensure the index was intact.

Solr's soft commit uses Lucene's near-real-time APIs, so you can
definitely do this with just Lucene: pass the IndexWriter to
DirectoryReader.open, and then use DirectoryReader.openIfChanged to
reopen (without committing).

This lets you decouple durability to crashes (how often you commit)
from index-to-search latency (how often you reopen the reader).

Mike McCandless

http://blog.mikemccandless.com


Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

2013-06-03 Thread Michael McCandless
This unfortunately is a limitation of the current FuzzySuggester
implementation: it computes edits in UTF-8 space instead of Unicode
character (code point) space.

This should be fixable: we'd need to fix TokenStreamToAutomaton to
work in Unicode character space, then fix FuzzySuggester to do the
same steps that FuzzyQuery does: do the LevN expansion in Unicode
character space, then convert that automaton to UTF-8, then intersect
with the suggest FST.

Could you open an issue for this?  I won't have any time soon to work
on this but we should open an issue to discuss / see if someone else
has time / iterate. Thanks!

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 30, 2013 at 8:39 AM, Artem Lukanin ice...@mail.ru wrote:
 BTW, I have to set maxEdits=2 to allow letter transpositions in Russian,
 because there will be actually 2 transpositions of 4 bytes representing 2
 Russian letters in UTF-8.

 The worst case is when one field has both Russian and English letters (or
 e.g. numbers), where I have to use minFuzzyLength=6 and maxEdits=2, which
 will work only for Russian words of more than 2 letters and for English
 words of more than 5 letters!



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067026.html
 Sent from the Lucene - General mailing list archive at Nabble.com.


Re: minFuzzyLength in FuzzySuggester behaves differently for English and Russian

2013-06-03 Thread Michael McCandless
Thanks Artem.  If you have time/energy to work out a patch that would
be great :)

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 3, 2013 at 7:17 AM, Artem Lukanin ice...@mail.ru wrote:
 I have opened an issue: https://issues.apache.org/jira/browse/LUCENE-5030



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/minFuzzyLength-in-FuzzySuggester-behaves-differently-for-English-and-Russian-tp4067018p4067774.html
 Sent from the Lucene - General mailing list archive at Nabble.com.


Re: How to convert TermDocs and TermEnum ??

2013-05-24 Thread Michael McCandless
Hi,

Have a look at MIGRATE.txt?

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 20, 2013 at 10:54 AM, A. Lotfi majidna...@yahoo.com wrote:
 Hi,

 I found some difficulties converting from old API to the newest one :

 import org.apache.lucene.index.TermDocs;   // does not exist
 import org.apache.lucene.index.TermEnum;  // does not exist

 I tried the migration doc, but could not figure it out, here my code :

  private ListShort[] loadFieldValues(IndexReader reader) throws IOException 
 {

 ListShort[] retArray = new List[reader.maxDoc()];
 TermEnum termEnum = reader.terms (new Term (GEO_LOC));
 try {
 do {
 Term term = termEnum.term();
 if (term==null || term.field() != GEO_LOC) break;
 TermDocs td = reader.termDocs(term);
 String value = term.text();
...
...

 td.close();
 } while (termEnum.next());

 } finally {
 termEnum.close();
 }
 return retArray;
 }

 I will appreciate your help.
 thanks


Re: [VOTE] Release PyLucene 4.3.0-1

2013-05-08 Thread Michael McCandless
+1 to release!  Exciting to finally have a PyLucene 4.x :)

I ran my usual smoke test (index first 100K Wikipedia docs and run a
couple searches) and it looks great!

Only strangeness was ... I set JDK['linux2'] to my install location
(Oracle JDK), and normally this works fine, but this time setup.py
couldn't find javac nor javadoc ... so I had to go set those two full
paths as well, and then jcc built fine.

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 6, 2013 at 8:27 PM, Andi Vajda va...@apache.org wrote:

 It looks like the time has finally come for a PyLucene 4.x release !

 The PyLucene 4.3.0-1 release tracking the recent release of Apache Lucene
 4.3.0 is ready.

 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_3/CHANGES

 PyLucene 4.3.0 is built with JCC 2.16 included in these release artifacts:
 http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_3_0/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.3.0-1.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1


Re: [VOTE] Release PyLucene 4.2.1

2013-04-15 Thread Michael McCandless
I'm having trouble on an Ubuntu 12.10 box, using Java 1.7_07 and Python 2.7.3.

I was able to build and install both JCC and PyLucene, apparently successfully.

I can import lucene in Python and print lucene.VERSION and confirm it's 4.2.1.

lucene.initVM(lucene.CLASSPATH) succeeds.

Yet, there are no Lucene classes in the lucene module?  When I print
dir(lucene) I just get this:

['CLASSPATH', 'ConstVariableDescriptor', 'FinalizerClass',
'FinalizerProxy', 'InvalidArgsError', 'JArray', 'JArray_bool',
'JArray_byte', 'JArray_char', 'JArray_double', 'JArray_float',
'JArray_int', 'JArray_long', 'JArray_object', 'JArray_short',
'JArray_string', 'JCCEnv', 'JCC_VERSION', 'JObject', 'JavaError',
'PrintWriter', 'StringWriter', 'VERSION', '__builtins__', '__dir__',
'__doc__', '__file__', '__name__', '__package__', '__path__',
'_lucene', 'findClass', 'getVMEnv', 'initVM', 'makeClass',
'makeInterface', 'os', 'sys']

Am I missing something silly...?  Shouldn't Lucene's classes (eg
FSDirectory) be visible in globals() in the lucene module?

Mike McCandless

http://blog.mikemccandless.com

On Sat, Apr 13, 2013 at 5:51 PM, Andi Vajda va...@apache.org wrote:

 It looks like the time has finally come for a PyLucene 4.x release !

 The PyLucene 4.2.1-0 release tracking the recent release of Apache Lucene
 4.2.1 is ready.

 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_4_2/CHANGES

 PyLucene 4.2.1 is built with JCC 2.16 included in these release artifacts:
 http://svn.apache.org/repos/asf/lucene/pylucene/trunk/jcc/CHANGES

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_2_1/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 4.2.1-0.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1


Re: Welcome Tommaso Teofili to the PMC

2013-03-17 Thread Michael McCandless
Welcome Tommaso!

Mike McCandless

http://blog.mikemccandless.com

On Sun, Mar 17, 2013 at 11:04 AM, Steve Rowe sar...@gmail.com wrote:
 I'm pleased to announce that Tommaso Teofili has accepted the PMC's 
 invitation to join.

 Welcome Tommaso!

 - Steve
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



Re: different result for 'OR'

2013-01-21 Thread Michael McCandless
That is odd.  Can you print the Query.toString of the actual two
queries you are running?  (I think the OR must be capitalized to be
parsed by the classic QueryParser?).

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jan 21, 2013 at 7:34 AM, Jeroen Venderbosch j...@woodwing.com wrote:
 I would expect that the query *q=description:(electronics) or
 description:(usb)* would give the same number of results as
 *q=description:(electronics or usb)*. But the first query returns 9662
 documents and the second one 9493. What is the difference?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/different-result-for-OR-tp4035014.html
 Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Is there a way to clear lucene's cache?

2013-01-07 Thread Michael McCandless
Lucene itself doesn't do any caching.  Maybe you are thinking of Solr?

The OS also does caching, so if you want a cold test you'll have to
tell the OS to flush its IO cache in between tests.  EG on Linux do
sudo echo 3  /proc/sys/vm/drop_caches.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 2, 2013 at 10:39 AM, S L sol.leder...@gmail.com wrote:
 I'm doing some performance testing and caching is not helpful for the tests.
 Is there a way to clear lucene's query cache between rounds of tests? I've
 tried restarting tomcat but that doesn't help.

 Thanks.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Is-there-a-way-to-clear-lucene-s-cache-tp4030059.html
 Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Welcome Sami Siren to the PMC

2012-12-12 Thread Michael McCandless
Welcome Sami!

Mike McCandless

http://blog.mikemccandless.com


On Wed, Dec 12, 2012 at 3:17 PM, Mark Miller markrmil...@gmail.com wrote:
 I'm please to announce that Sami Siren has accepted the PMC's
 invitation to join.

 Welcome Sami!

 - Mark

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Welcome Alan Woodward as Lucene/Solr committer

2012-10-17 Thread Michael McCandless
Welcome aboard Alan!

Happy Coding,

Mike McCandless

http://blog.mikemccandless.com

On Wed, Oct 17, 2012 at 1:36 AM, Robert Muir rcm...@gmail.com wrote:
 I'm pleased to announce that the Lucene PMC has voted Alan as a
 Lucene/Solr committer.

 Alan has been contributing patches on various tricky stuff: positions
 iterators, span queries, highlighters, codecs, and so on.

 Alan: its tradition that you introduce yourself with your background.

 I think your account is fully working and you should be able to add
 yourself to the who we are page on the website as well.

 Congratulations!


Re: Can Lucene be used where each entity to be ranked is a set of documents?

2012-08-22 Thread Michael McCandless
On Wed, Aug 22, 2012 at 10:36 AM, Robert Muir rcm...@gmail.com wrote:
 On Tue, Aug 21, 2012 at 7:42 AM, shashank shashank91.b...@gmail.com wrote:
 Hello,

 I am working on a project wherein each entity to be ranked is not a single
 document but infact a group of documents.

 So, the ranking not only involves standard search engine scoring parameters
 but also the association of documents within an entity/group i.e.
 association of documents within the group also contributes to the ranking
 score.

 You may want to look at Lucene's block join module
 (http://lucene.apache.org/core/4_0_0-BETA/join/index.html): combined
 with IndexWriter's add/updateDocuments functionality which lets you
 add documents as a 'group'.
 Currently I think the way in which the group is scored is just an enum
 with a fixed set of choices (ScoreMode), so you might have to modify
 the source code at the moment if you have a sophisticated way of
 scoring the group of documents, but this would be nice to fix so that
 its something extensible...

Also look at grouping module.

If you have no parent documents/fields (ie only child docs that must
be grouped/scored according to some criteria) then grouping should
work.

But Robert is right: the scoring of a group is fairly simplistic now
... so you may need to tweak the code to do what you need (and please
send patches back!).

Mike McCandless

http://blog.mikemccandless.com


Re: Is query-time Join actually in Lucene 3.6?

2012-08-07 Thread Michael McCandless
Query-time join lives under Lucene's contrib/join in 3.6:
http://lucene.apache.org/core/3_6_1/lucene-contrib/index.html#join

Mike McCandless

http://blog.mikemccandless.com

On Tue, Aug 7, 2012 at 11:41 AM, Homer Nabble homernab...@gmail.com wrote:
 This page states New query-time joining is more flexible (but less
 performant) than
 index-time joins.

 https://wiki.apache.org/lucene-java/Lucene3.6

 However, I download Lucene 3.6.0 (and 3.6.1) and there is no mention of
 query-time join in the CHANGES.TXT.

 Also, I see no binaries for org.apache.lucene.search.join - though the API
 doc in the same download contains information about this package:
 lucene-3.6.0/docs/api/all/org/apache/lucene/search/join/package-frame.html

 Could someone please let me know what the story is with query time joins?

 Thanks!


Re: Is it possible to Lucene with a database managed by external application ?

2012-05-29 Thread Michael McCandless
That should be fine.

You just have to separately pull the added/updated rows from the DB
and index them into your Lucene index.

Mike McCandless

http://blog.mikemccandless.com

On Tue, May 29, 2012 at 3:09 AM, Ievgen Krapyva ykrap...@gmail.com wrote:
 Hi everybody,

 I've just started reading about Lucene and thinking whether I can use it
 in my case.
 My case is that the database content I want to provide the search
 capability for is managed (new entries added / removed / edited) by
 other application (written in PHP).

 Am I right in thinking that to get the best with Lucene I have to
 regulary update indexes, i.e. make a full database indexing ?

 Thanks.



Re: How to construct the term frequency vector of all words in dictionary?

2012-05-15 Thread Michael McCandless
You can get a TermEnum (IndexReader.terms()) and then keep calling
.next() to advance to the next term, and then .docFreq() to get the
document frequency (how many documents have the term) for that term...

Mike McCandless

http://blog.mikemccandless.com


On Tue, May 15, 2012 at 1:24 PM, Aoi Morida xu.xum...@gmail.com wrote:
 Hi all,

 I want to create the term frequency vector for all words in the dictionary.
 I find that the function getTermFreqVector() can only give term frequency of
 the words existed in the particular document.

 BTW, I want to extract words in the dictionary and I find that the function
 getWordsIterator()  can do this. But as I import
 org.apache.lucene.search.spell.LuceneDictionary, there is always an error
 message. I wondered what's wrong with it. My lucene version is 2.9.4.

 Thank you.

 Regards,

 Aoi

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-construct-the-term-frequency-vector-of-all-words-in-dictionary-tp3983898.html
 Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Lucene index directory on disk: (i) do I need to keep it and (ii) how do I handle encryption?

2012-04-24 Thread Michael McCandless
FSDirectory won't load the index into RAM.

But RAMDirectory can: eg, you can init a RAMDirectory, passing your
FSDir to its ctor, to copy all files into RAM.  Then you can delete
the FSDir, but realize this means once your app shuts down you've lost
the index.

I think you can handle your encrypted case by copying the files
yourself from FSDir into RAMDir, decrypting as you go.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Apr 24, 2012 at 10:36 AM, Ilya Zavorin izavo...@caci.com wrote:
 I have two somewhat related questions. I'm working on an Android app that 
 uses Lucene indexing and search. Currently, I precompute an index on desktop 
 and then copy the resulting index folder and files to the Android device. In 
 my app, I open the index up like this:

 String indexDir = /mnt/sdcard/MyIndexDir;
 Directory dir = FSDirectory.open(indexDir);

 I have 2 questions:
 1. Does Lucene load the entire index into memory? If so, does it mean that 
 after creating the Directory object, I can delete the index dir from the 
 device? Does this depend on the size of the index? If so, do I have an option 
 of forcing it to load the whole index into memory regardless of its size?
 2. Right now the index folder is unencrypted. This is temporary: we have a 
 requirement to encrypt every single file and folder that is used by the app. 
 The problem with this is that I can't create an unencrypted copy of the 
 folder on the device, i.e. I can't do something like this:

 String indexDirEncr = /mnt/sdcard/MyIndexDirEncr;
 String indexDirUnencr = /mnt/sdcard/MyIndexDirUnencr;
 //
 // Decrypt indexDirEncr and store it in indexDirUnencr
 //
 Directory dir = FSDirectory.open(indexDirUnencr);

 Is there a good way to handle this? That is, is it somehow possible to load 
 the encrypted folder into memory, decrypt it and then load the decrypted 
 version from memory to create a Directory object?

 Thanks much!


 Ilya Zavorin


Re: Welcome Jan Høydahl to the PMC

2012-02-13 Thread Michael McCandless
Welcome Jan!

Mike McCandless

http://blog.mikemccandless.com

On Mon, Feb 13, 2012 at 9:50 AM, Robert Muir rcm...@gmail.com wrote:
 Hello,

 I'm pleased to announce that Jan has accepted the PMC's invitation to join.

 Congratulations Jan!

 --
 lucidimagination.com

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 3.4.0 released

2011-09-14 Thread Michael McCandless
September 14 2011, Apache Lucene™ 3.4.0 available

The Lucene PMC is pleased to announce the release of Apache Lucene 3.4.0.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for nearly
any application that requires full-text search, especially cross-platform.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:

   http://www.apache.org/dyn/closer.cgi/lucene/java (see note below).

If you are already using Apache Lucene 3.1, 3.2 or 3.3, we strongly
recommend you upgrade to 3.4.0 because of the index corruption bug on
OS or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0.

See the CHANGES.txt file included with the release for a full list of
details.

Lucene 3.4.0 Release Highlights:

  * Fixed a major bug (LUCENE-3418) whereby a Lucene index could
easily become corrupted if the OS or computer crashed or lost
power.

  * Added a new faceting module (contrib/facet) for computing facet
counts (both hierarchical and non-hierarchical) at search
time (LUCENE-3079).

  * Added a new join module (contrib/join), enabling indexing and
searching of nested (parent/child) documents using
BlockJoinQuery/Collector (LUCENE-3171).

  * It is now possible to index documents with term frequencies
included but without positions (LUCENE-2048); previously
omitTermFreqAndPositions always omitted both.

  * The modular QueryParser (contrib/queryparser) can now create
NumericRangeQuery.

  * Added SynonymFilter, in contrib/analyzers, to apply multi-word
synonyms during indexing or querying, including parsers to read
the wordnet and solr synonym formats (LUCENE-3233).

  * You can now control how documents that don't have a value on the
sort field should sort (LUCENE-3390), using SortField.setMissingValue.

  * Fixed a case where term vectors could be silently deleted from the
index after addIndexes (LUCENE-3402).

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Apache Lucene/Solr Developers


[ANNOUNCE] Apache Solr 3.4.0 released

2011-09-14 Thread Michael McCandless
September 14 2011, Apache Solr™ 3.4.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 3.4.0.

Apache Solr is the popular, blazing fast open source enterprise search
platform from the Apache Lucene project. Its major features include
powerful full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
distributed search and index replication, and it powers the search and
navigation features of many of the world's largest internet sites.

This release contains numerous bug fixes, optimizations, and
improvements, some of which are highlighted below.  The release
is available for immediate download at:

   http://www.apache.org/dyn/closer.cgi/lucene/solr (see note below).

If you are already using Apache Solr 3.1, 3.2 or 3.3, we strongly
recommend you upgrade to 3.4.0 because of the index corruption bug on OS
or computer crash or power loss (LUCENE-3418), now fixed in 3.4.0.

See the CHANGES.txt file included with the release for a full list of
details.

Solr 3.4.0 Release Highlights:

  * Bug fixes and improvements from Apache Lucene 3.4.0, including a
major bug (LUCENE-3418) whereby a Lucene index could
easily become corrupted if the OS or computer crashed or lost
power.

  * SolrJ client can now parse grouped and range facets results
(SOLR-2523).

  * A new XsltUpdateRequestHandler allows posting XML that's
transformed by a provided XSLT into a valid Solr document
(SOLR-2630).

  * Post-group faceting option (group.truncate) can now compute
facet counts for only the highest ranking documents per-group.
(SOLR-2665).

  * Add commitWithin update request parameter to all update handlers
that were previously missing it.  This tells Solr to commit the
change within the specified amount of time (SOLR-2540).

  * You can now specify NIOFSDirectory (SOLR-2670).

  * New parameter hl.phraseLimit speeds up FastVectorHighlighter
(LUCENE-3234).

  * The query cache and filter cache can now be disabled per request
See http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
(SOLR-2429).

  * Improved memory usage, build time, and performance of
SynonymFilterFactory (LUCENE-3233).

  * Added omitPositions to the schema, so you can omit position
information while still indexing term frequencies (LUCENE-2048).

  * Various fixes for multi-threaded DataImportHandler.

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases.  It is possible that the mirror you are using may not
have replicated the release yet.  If that is the case, please try another
mirror.  This also goes for Maven access.

Happy searching,

Apache Lucene/Solr Developers


Re: Caused by: java.io.IOException: read past EOF

2011-09-09 Thread Michael McCandless
Can you post the traceback/exception?

Are you overriding the default LockFactory for your Directory?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Sep 9, 2011 at 6:07 AM, Java_dev abde...@hotmail.com wrote:
 Hi Michael,

 Thx for taking time to help me out.

 We are using Lucene to index the
 titles(maintitle,subtitle,isbnnumber,productavailability...) in our database
 for a faster search on several fields. We are adding every day new titles
 and a lot of titles are being updated.



 The index is stored in a directory on the local filesystem (Linux RedHat).
 In the directory there are +/- 50 files
 (segments_grr4f,segments.gen,_21z41.cfs(847MB),_6zq20.cf(643MB)s ...)

 When the update is in progress lucene creates a 'write.lock' file now and
 then.

 Thx in advance.








 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Caused-by-java-io-IOException-read-past-EOF-tp3319842p3322467.html
 Sent from the Lucene - General mailing list archive at Nabble.com.



Re: [VOTE] Release PyLucene 3.3 (rc3)

2011-07-21 Thread Michael McCandless
+1 to release!

Smoke test passed and I see grouping module classes are visible by
default!  Thanks Andi :)

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 21, 2011 at 12:47 PM, Andi Vajda va...@apache.org wrote:

 A problem was found with rc2. Please, vote on rc3, thanks :-)

 The Apache PyLucene 3.3-3 release closely tracking the recent release of
 Apache Lucene Java 3.3 is ready.

 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 This new release candidate fixes an issue with wrapping the new grouping
 contrib module which is now part of the PyLucene build.

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_3/CHANGES

 PyLucene 3.3 is built with JCC 2.10 included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_3/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 3.3-3.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
    http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
    http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1



Re: [VOTE] Release PyLucene 3.3.0

2011-07-03 Thread Michael McCandless
Everything looks good -- I was able to compile, run all tests
successfully, and run my usual smoke test (indexing  optimizing 
searching on first 100K wikipedia docs), but...

I then tried to enable the grouping module (lucene/contrib/grouping),
by adding a GROUPING_JAR matching all the other contrib jars, and
running make.  This then hit various compilation errors -- is anyone
able to enable the grouping module and compile successfully?

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 1, 2011 at 8:24 AM, Andi Vajda va...@apache.org wrote:

 The PyLucene 3.3.0-1 release closely tracking the recent release of Lucene
 Java 3.3 is ready.

 A release candidate is available from:
 http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_3/CHANGES

 PyLucene 3.3.0 is built with JCC 2.9 included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_3/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 3.3.0-1.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
 http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
 http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1



Re: [VOTE] Release PyLucene 3.2.0

2011-06-07 Thread Michael McCandless
+1

I built on OS X 10.6.6, passed all tests (I think?  No overall summary
in the end, but I didn't see any obvious problem), and ran my usual
smoke test indexing first 100K docs from a line file from Wikipedia,
and running a few searches.

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jun 6, 2011 at 4:58 PM, Andi Vajda va...@apache.org wrote:

 The PyLucene 3.2.0-1 release closely tracking the recent release of Lucene
 Java 3.2 is ready.

 A release candidate is available from:
  http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
  http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_2/CHANGES

 PyLucene 3.2.0 is built with JCC 2.9 included in these release artifacts.

 A list of Lucene Java changes can be seen at:
  http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_3_2/lucene/CHANGES.txt

 Please vote to release these artifacts as PyLucene 3.2.0-1.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
  http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
  http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1




Re: Lucene: is it possible to search with an error in one letter?

2011-05-30 Thread Michael McCandless
If you want to allow for any single character change, you can use
FuzzyQuery.  EG, pencil~1 allows for 1 character change, pencil~2
allows for 2.

Note that FuzzyQuery is very costly in 3.x, but is substantially (eg
factor of 100 times)  faster in trunk / 4.0.

Mike

http://blog.mikemccandless.com

On Mon, May 30, 2011 at 1:33 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Hi,

 Yes, penc?l should do it.

 Otis
 ---
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



 - Original Message 
 From: boraldo bora...@mail.ru
 To: general@lucene.apache.org
 Sent: Mon, May 30, 2011 8:08:54 AM
 Subject: Lucene: is it possible to search with an error in one letter?

 I have a document with text pencil. I want to search it using  query
 pencel. Or vica versa.
 Is it possible ?

 --
 View this  message in context:
http://lucene.472066.n3.nabble.com/Lucene-is-it-possible-to-search-with-an-error-in-one-letter-tp3001723p3001723.html

 Sent  from the Lucene - General mailing list archive at Nabble.com.




Re: Welcome Chris Male Andi Vajda as full Solr / Lucene Committers

2011-05-23 Thread Michael McCandless
Welcome!

Mike

http://blog.mikemccandless.com

On Mon, May 23, 2011 at 12:39 PM, Simon Willnauer
simon.willna...@googlemail.com wrote:
 Hi folks,

 I am happy to announce that the Lucene PMC has accepted Chris Male and
 Andi Vajda as Lucene/Solr committers.

 Congratulations  Welcome on board, Chris  Andi!!

 Simon

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: Special Board Report for May 2011

2011-05-07 Thread Michael McCandless
 know where to draw the line. I trust the great
 people of this community to know when it's better to discuss something in
 email. An example, if a new feature is being discussed, then it's ok if two
 people want to hash few things out quickly, before they send a detailed and
 organized proposal to the list -- the details to hash out are the initial
 proposal's details. The rest should be followed on list, even if it means
 slightly slower response time.

 Today's list and JIRA volume always look to me like the response time is
 instantaneous. We have very active people from around the globe, so you have
 a high chance receiving response in no time. In the worse case, it will take
 a couple of hours, but I don't remember when did that happen (which is an
 amazing thing !)

 Cheers,
 Shai

 On Fri, May 6, 2011 at 8:35 PM, Grant Ingersoll gsing...@apache.org wrote:

  More reading (shall I say required reading?).  Benson does a good job of
  explaining some of the concepts around consensus and why we also should be
  primarily using mailing lists:
  https://blogs.apache.org/comdev/entry/how_apache_projects_use_consensus
 
  -Grant
 
  On May 5, 2011, at 10:10 AM, Grant Ingersoll wrote:
 
  
   I'd like to throw out another idea:
  
   I think we should standardize on rotating the PMC Chair every year.  I
  think to date, there have been two Chairs:  Doug and me.  Back when Doug
  left, no one wanted to do it (both Hoss and I said we would if no one else
  wanted to) and so I took it on.  For the most part, it's a thankless task 
  of
  herding cats (albeit low volume, thankfully), despite the important 
  sounding
  name that marketing types love.  I would like us to share the burden 
  across
  the PMC by rotating it on an annual basis.  Many other ASF projects do
  exactly this and I think it removes any political pressure.  Have I sold 
  it
  enough? ;-)  Besides, I just know others are dying to file board reports 
  on
  a quarterly basis!
  
   More inline below...
  
   On May 5, 2011, at 8:27 AM, Michael McCandless wrote:
  
   On Wed, May 4, 2011 at 6:40 PM, Grant Ingersoll gsing...@apache.org
  wrote:
   2. I think we need to prioritize getting patch contributors more
  feedback sooner.  I think some of this can be automated much like what
  Hadoop has done.  This should help identify new committers sooner and
  encourage them to keep contributing.
  
   Big +1.  We should be using automation everywhere we can.
  
   But, really, we (as all projects do) need more devs.  Growing the
   community should be job #1 of all committers.
  
   Agreed, but this dovetails w/ the use of IRC.  I realize live collab is
  nice, but it discourages those who aren't in the know about the channel
  being used from ever contributing.    Say, for instance, I'm interested in
  DWPT (DocWriterPerThread), how am I supposed to know that at 8 am EDT on 
  May
  5th (made up example), three of the committers are going to be talking 
  about
  it on IRC?  If there is email about it, then I can participate.  Nothing 
  we
  do is so important that it can't wait a few hours or a day, besides the
  fact, that email is damn near instantaneous these days anyway.
  
   Also, keep in mind that until about a year ago, most everything was done
  on the mailing list and I think we progressed just fine.  Since then, 
  dev@has almost completely dried up in terms of discussions (factoring out 
  JIRA
  mails which have picked up -- which is good) and the large majority of
  discussion takes place on IRC.  I agree, however, we should have the IRC
  discussion on another thread.
  
  
  
   So, what other ideas do people have?  I'll leave this thread open for 
   a
  week or so and then add what we think are good things to
  https://svn.apache.org/repos/asf/lucene/board-reports/2011/special-board-report-may.txt
   The board meeting is on May 19th.  I plan on attending.
  
   How about also PMC members will be more proactive in tackling issues
   that erode the community?  I think this would start with a thread on
   general@.  We need to get in the habit of discussing even tiny
   elephants as soon as they appear, somehow.
  
   Yeah, I agree.  The hard part for me, is I often feel like people on the
  outside make big deals about this stuff and don't get that even having the
  discussion is a very healthy sign.  Besides the fact, that no one likes
  confrontation and uncomfortable topics.  We also, I think, are all tired 
  of
  endless debates that go on and on w/ no resolution.  It's one of the big
  downsides (and, of course, upsides) to consensus based open source as
  opposed to the dictatorial approach.
  
  
   Here's an example: Is Lucid abusing their too-strong influence over
   Lucene/Solr?  It's a great question, and I personally feel the answer
   today is no, but nevertheless we should be able to discuss it and
   similar could-be-controversial topics.
  
   I hopefully would agree we are good stewards of the fact that we

Re: Special Board Report for May 2011

2011-05-05 Thread Michael McCandless
On Wed, May 4, 2011 at 6:40 PM, Grant Ingersoll gsing...@apache.org wrote:

 At our core, this means we are supporting a set of libraries that can be used 
 for search and related capabilities across a lot of different applications 
 ranging in size and shape, as well as a server that makes those capabilities 
 available and easy to consume without requiring Java programming for those 
 who choose to use it.  Our goal has always been to make the parts we like to 
 work on as fast, efficient and capable as possible.As with all open 
 source projects, anyone should be able to contribute where they see fit and 
 to scratch their itch.  Open source has always been evolutionary in code 
 development, not revolutionary.

+1

 I will throw out some ideas as possibly helpful in continuing to build a 
 strong community, but maybe they aren't.  And, no, I don't think any one of 
 these solves everything.

 1. No more IRC for design decisions (answering user questions is OK, IMO) 
 even if they are captured on JIRA.  Either that or we should make IRC logged 
 and public and part of the public record on Lucene/Solr.The fact is, most 
 mailing list subscribers are not on IRC and IRC discussions/decisions rob 
 many of us of the opportunity to participate in the design and it sometimes 
 come across that everything is done by the time it hits JIRA.  It's also very 
 hard for people who aren't on IRC to get the full gist of the discussion if 
 only a summary is presented in JIRA.  Also, due to time zones, many people 
 are asleep while others are working. IRC also prevents ideas from breathing 
 a bit.  Also, since IRC isn't logged, there is less decorum/respect at times 
 (even if I think the banter keeps things lighter most of the time) and even 
 though most of us committers are friends, outsiders or potential contributors 
 may not see sarcasm or jokes in the same way that the rest of us who know 
 each other do.

-0

Probably we should fork off a separate thread to discuss IRC?  But
here's my quick take:

I feel there are times when it's appropriate and time's when it's not
and we should use the right tool for the job at hand.

EG, the recent landing of the [very large] concurrent flushing (DWPT)
branch was a great example where live collaboration was very
helpful, I think.

I completely agree that no decisions are made on IRC: if it's not
on the list, it didn't happen.  Discussions can happen and if that
results in an idea, an approach, that suggestion gets moved to an
issue / to the dev list for iterating.

 2. I think we need to prioritize getting patch contributors more feedback 
 sooner.  I think some of this can be automated much like what Hadoop has 
 done.  This should help identify new committers sooner and encourage them to 
 keep contributing.

Big +1.  We should be using automation everywhere we can.

But, really, we (as all projects do) need more devs.  Growing the
community should be job #1 of all committers.

 3. As a core principal, design discussions, etc. should not take place on 
 private emails or via IM or phone calls.  I don't know how much of this there 
 is, but I've seen hints of it from a variety of people to know it happens.  
 Obviously, there is no way to enforce this other than people should take it 
 to heart and stop it.

+1

Also, big issues should not be sent via private email to hand-picked
people.  Send it to general@

 4.  I think it goes w/o saying that we all learned our lessons about 
 committing and reverting things.  Reverting someone else's code is for when 
 things break the build, not for political/idealogical reasons.

+1

Add to this no way! list: committing without first resolving
the objections raised by other committers.

And also: 'don't walk away from discussions, especially important
ones'.  Radio silence / silent treatment is not a good approach in the
real world, and it's even worse in the open-source world.  Try always
to bring closure, to heal the community after strong disagreements.

 5. People should commit and do their work where they see fit.  If others have 
 better ideas about refactoring them, then step up and help or do the 
 refactoring afterwards.  It's software.  Not everything need be perfect the 
 first time or in just the right location the first time.  At the same time, 
 if others want to refactor it and it doesn't hurt anything but ends up being 
 better for more people b/c it is reusable and componetized, than the 
 refactoring should not be a problem.

+1, progress not perfection, as long as we are free to refactor.

Freedom to refactor/poach is the bread  butter of open source.

 So, what other ideas do people have?  I'll leave this thread open for a week 
 or so and then add what we think are good things to 
 https://svn.apache.org/repos/asf/lucene/board-reports/2011/special-board-report-may.txt
   The board meeting is on May 19th.  I plan on attending.

How about also PMC members will be more proactive in tackling issues
that 

Re: Special Board Report for May 2011

2011-05-05 Thread Michael McCandless
On Wed, May 4, 2011 at 7:26 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 The amazing thing to me is that Lucene of all projects is having problems
 like this.  Lucene has always been my primary example of Open Source Done
 Right.

I think with passion comes blowups.  I think it's natural, and, as long
as the community heals, healthy.  We will emerge stronger from this.

 I very much hope that it comes back to those roots.  The people who
 contribute to Lucene are too good a group to have these problems.

We will.  This is a resilient community ;)

In fact I find it very inspiring that despite this storm in the
background, committers were still actively pushing things forward.

EG, Simon  others landed the concurrent flushing (DWPT)
branch... resulting in astounding gains in Lucene's indexing
throughput on concurrent hardware
(http://people.apache.org/~mikemccand/lucenebench/indexing.html).

Mike

http://blog.mikemccandless.com


Re: IndexFiles cmd runs, even when IndexFiles.java is deleted

2011-05-02 Thread Michael McCandless
Likely the .class file is still present?  Javac compiles .java files
into .class files, and then java executes from .class files.

Mike

http://blog.mikemccandless.com

On Mon, May 2, 2011 at 8:13 AM, daniel daniel_pfis...@msn.com wrote:

 I'm new to Lucene and Java,

 I'm trying to modify the source code for the indexing function in
 Lucene-3.0.3; however, when I modified IndexFiles.java nothing happened, it
 simply indexed the files the same way as before.  So I deleted that file
 entirely, and entered java org.apache.lucene.demo.IndexFiles (+ file to be
 index) in the cmd line again, and IT STILL RAN!

 What is going on here?  How can the program run when the file is removed?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/IndexFiles-cmd-runs-even-when-IndexFiles-java-is-deleted-tp2889622p2889622.html
 Sent from the Lucene - General mailing list archive at Nabble.com.



Re: [VOTE] Create Solr TLP - bigger picture

2011-04-27 Thread Michael McCandless
Thanks Shane.

I agree we (the PMC) should have stepped in well before things got to
this point.  Hindsight is 20/20, and, I'm still learning here too ;)

Then we could have prevented such extreme non-Apache behavior (invalid
vetos, reverting wars).

Mike

http://blog.mikemccandless.com

On Wed, Apr 27, 2011 at 9:21 AM, Shane Curcuru a...@shanecurcuru.org wrote:
 Michael McCandless luc...@mikemccandless.com wrote:

 ...snip...

 While I agree, out of context, Robert's use of a veto/revert wars is
 inappropriate, and is not how things should be done in a healthy
 Apache project Lucene/Solr are not healthy right now, and
 desperate times call for desperate measures.

 Apache projects are about community and consensus driven development. When
 the larger community is having serious disagreements about the direction of
 the project, the first place the community (people here) should go is to the
 PMC - that'd be private@lucene in this case.

 PMCs *should* be the place to work these kinds of issues out.  If committers
 start engaging in controversial reverts, the community should *insist* that
 the PMC assist in the matter and help show the community-based way forwards.

 If committers on any Apache project aren't getting answers or help from
 their PMC, then you can always raise the issue up to board@.  Remember:
 we're all volunteers here: it does take time for PMCs or communities to
 really understand the issue and respond to it (even if there isn't
 consensus).  So I certainly wouldn't urge people to email board@ with every
 little issue without letting the PMC discuss it.

 But from a board perspective, we would certainly rather have heard of some
 of the apparent community issues in Solr and Lucene recently from a PMC
 member or committer *first*, before one of the directors was reading through
 some of these threads or JIRA comments this week.  The board welcomes
 reports on community health from our projects - good or bad.

 - Shane (not on Lucene lists)



Re: Welcome Dawid Weiss and Stanislaw Osinski as Lucene/Solr committers

2011-02-08 Thread Michael McCandless
Welcome Dawid and Stanislaw!

Mike

On Tue, Feb 8, 2011 at 1:13 PM, Robert Muir rcm...@gmail.com wrote:
 I'm pleased to announce that the PMC has voted in Dawid Weiss and
 Stanislaw Osinski as Lucene/Solr committers!

 Welcome!

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: [VOTE] Release PyLucene 2.9.4-1 and 3.0.3-1

2010-12-05 Thread Michael McCandless
+1 to both.

I installed both on Linux (Fedora 13) and ran my test python script
that indexes first 100K line docs from wikipedia and runs a few
searches.  No problems!

Mike

On Sun, Dec 5, 2010 at 1:50 AM, Andi Vajda va...@apache.org wrote:

 With the recent releases of Lucene Java 2.9.4 and 3.0.3, the PyLucene
 2.9.4-1 and 3.0.3-1 releases closely tracking them are ready.

 Release candidates are available from:

    http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_2_9/CHANGES
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_0/CHANGES

 All versions of PyLucene are built with the same version of JCC, currently
 version 2.7, included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/CHANGES.txt
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/CHANGES.txt

 Please vote to release these artifacts as PyLucene 2.9.4-1 and 3.0.3-1.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
    http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
    http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1



Re: [VOTE] Release of Apache Lucene 3.0.3 and 2.9.4 artifacts (take 2)

2010-12-01 Thread Michael McCandless
On Wed, Dec 1, 2010 at 3:38 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi,

 Thanks to the PMC for voting on the Lucene 3.0.3 and 2.9.4 artifacts. The
 vote has passed with 3 positive votes:
 - Robert Muir
 - Andi Vajda
 - Uwe Schindler

Excellent!  Thanks everyone :)

 I will start to publish the artifacts to the mirrors today and send the
 announcement message on Friday morning after the website was updated.

 Mike: What do you thing are they key facts/bug fixes for this version? I
 will prepare the announcement message today, mostly it is the same as
 always, but we should list some key points, like serious bugs.

How about something like this:

This release contains numerous bug fixes since 2.9.3/3.0.2, including
a memory leak in IndexWriter exacerbated by frequent commits, a file
handle leak in IndexWriter when near-real-time readers are opened with
compound file format enabled, a rare index corruption case on disk
full, and various thread safety issues.

Mike


Re: PMC Additions

2010-11-28 Thread Michael McCandless
Welcome Simon and Koji!

Mike

On Sun, Nov 28, 2010 at 7:30 AM, Grant Ingersoll gsing...@apache.org wrote:
 I'm pleased to announce the addition of Simon Willnauer and Koji Sekiguchi to 
 the Lucene PMC.  Both Simon and Koji have been long time 
 contributors/committers to both Lucene and Solr.

 Congrats!

 -Grant
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: [VOTE] Rename Lucene Java to be Lucene Core

2010-11-10 Thread Michael McCandless
+1

Mike

On Tue, Nov 9, 2010 at 3:57 PM, Grant Ingersoll gsing...@apache.org wrote:
 Per the discuss thread and the fact that Java is TM Oracle, I would like us 
 to change Lucene Java to now be referred to as Lucene Core.  The primary 
 change is on the website where the Java tab will now be the Core tab and 
 other mentions will be adjusted accordingly.  I still expect we will just 
 refer to it informally as Lucene, since that is what it is.

 +1

 -Grant





Re: [DISCUSS] Lucene Java - Lucene Core

2010-11-08 Thread Michael McCandless
+1

Seems prudent given the current Java climate.

Mike

On Mon, Nov 8, 2010 at 10:57 AM, Grant Ingersoll gsing...@apache.org wrote:
 Hi Luceneers, esp. PMC and Committers,

 I'm in the process of reviewing our branding per the Trademarks committee 
 sending out requirements.   So, expect to see some changes to the website and 
 logos in the coming days as well as, potentially, a request for help.

 Per the Branding Requirements at http://www.apache.org/foundation/marks/pmcs, 
 I think we should stop calling our core Java implementation Lucene Java, 
 since Java is an Oracle TM, and move to simply calling it Lucene Core or 
 Lucene for Java.

 I'm inclined to call it Lucene Core or (Core, for short).  Most of us just 
 call it Lucene anyway, so the Core part really is only for navigation 
 purposes on the website.

 I'd like to discuss this for a day or two and then call a vote.

 Thoughts?

 -Grant


Re: Welcome Steven Rowe as Lucene/Solr committer!

2010-09-22 Thread Michael McCandless
Welcome Steven!!

Mike

On Wed, Sep 22, 2010 at 9:19 AM, Robert Muir rcm...@gmail.com wrote:
 I'm pleased to announce that the PMC has accepted Steven Rowe as Lucene/Solr
 committer!
 Welcome Steven!
 --
 Robert Muir
 rcm...@gmail.com



Re: Welcome Robert Muir to the Lucene PMC

2010-07-07 Thread Michael McCandless
Congrats!

Mike

On Wed, Jul 7, 2010 at 2:12 PM, Grant Ingersoll gsing...@apache.org wrote:
 In recognition of Robert's continuing contributions to Lucene and Solr, I'm 
 happy to announce Robert has accepted our invitation to join the Lucene PMC.

 Cheers,
 Grant Ingersoll
 Lucene PMC Chair
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




Re: [VOTE] [Take 2] Release PyLucene 2.9.3-1 and 3.0.2-1

2010-06-30 Thread Michael McCandless
+1

Mike

On Tue, Jun 29, 2010 at 7:47 AM, Andi Vajda va...@apache.org wrote:

 The first vote started on June 18th received two PMC votes and one user
 vote.

 A couple of bugs got fixed in the meantime so I'd like to call for another
 vote hoping for three PMC votes to make this release possible.

 ---

 With the recent - simultaneous - releases of Java Lucene 2.9.3 and 3.0.2,
 the PyLucene 2.9.3-1 and 3.0.2-1 releases closely tracking them are ready.

 Release candidates are available from:

    http://people.apache.org/~vajda/staging_area/

 A list of changes in this release can be seen at:
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_2_9/CHANGES
 http://svn.apache.org/repos/asf/lucene/pylucene/branches/pylucene_3_0/CHANGES

 All versions of PyLucene are now built with the same version of JCC,
 currently version 2.6, included in these release artifacts.

 A list of Lucene Java changes can be seen at:
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/CHANGES.txt
 http://svn.apache.org/repos/asf/lucene/java/branches/lucene_3_0/CHANGES.txt

 Please vote to release these artifacts as PyLucene 2.9.3-1 and 3.0.2-1.

 Thanks !

 Andi..

 ps: the KEYS file for PyLucene release signing is at:
    http://svn.apache.org/repos/asf/lucene/pylucene/dist/KEYS
    http://people.apache.org/~vajda/staging_area/KEYS

 pps: here is my +1



Re: [PMC] [DISCUSS] Lucy

2010-06-13 Thread Michael McCandless
Technically, it's clear that Lucy is taking an innovative and
well-thought-out approach, building a search engine that folds in
what's been learned from all the painful experiences of those before
it.  Marvin gets to chuckle whenever we have one of our massive back
compat discussions...

When it hits its first release it should be a real gem.

Further, there's no question that Marvin's closeness has
substantially strengthened Lucene (java).  The awesome amount of
cross-fertilization, discussing design tradeoffs, etc., has led to
sizable improvements in Lucene (like the switch to per-segment
searching).

That said, yes, Lucy doesn't have a large dev community.  And Lucy
doesn't have any users yet since it has no release (though KS's users
should count here, once Lucy releases).  There's unfortunately (for
both projects) not enough overlap in the dev communities of Lucene
(java) and Lucy.  And, Apache does now (for better or worse) strictly
insist on non-umbrella TLPs.

So net/net I'm +1 for Lucy to move to Apache incubator with the
eventual goal of a separate TLP.

Mike

On Sat, Jun 12, 2010 at 7:10 AM, Grant Ingersoll gsing...@apache.org wrote:
 It's been a while since we've taken a look at Lucy from a PMC standpoint, but 
 I think it is worth us reviewing once again.  And, while this isn't easy to 
 do because I very much value Marvin as a member of the Lucene community, I 
 think we need have a frank discussion about whether Lucy belongs as a Lucene 
 subproject, especially in light of recent Board concerns about Lucene's 
 umbrella status.

 Since the last email discussing Lucy, Marvin has been working on it, AFAICT, 
 which is a good thing.  I still, however, don't think it meets the community 
 standards of the ASF (see 
 http://incubator.apache.org/guides/graduation.html#subproject for instance).  
   For instance, there does not appear to be anyone else who has contributed 
 to it at any level beyond the occasional email here and there.  The last 
 email on the dev list from someone other than Marvin was someone announcing 
 KinoSearch on May 1st.  Before that, it was on April 6.  The last email on 
 the user list was from Marvin in November of 2009.  And, while Marvin 
 participates regularly on d...@l.a.o and we have had many cross pollination 
 talks, it does not, unfortunately, make for a community around Lucy.   There 
 also has yet to be a single release in its time here.  Even if there were an 
 attempt at a release, how many PMC members even follow Lucy enough that you 
 feel comfortable voting for a release?

 If this were a project coming from the Incubator to us via a graduation 
 vote, I would vote to not let it graduate.

 Finally, given that Lucy undoubtedly is a separate community (if it ever 
 exists) with separate goals from Lucene and that it is considered ASF best 
 practice for PMC's to not be umbrella projects, I think we should consider 
 either Lucy going into the Incubator with the goal of growing it's own 
 community and standing on it's own as a TLP in its own right (just as we 
 recommended for CLucene recently) or going to Google Code or some other such 
 hosting service where it will be free to make decisions on its future without 
 the hinderance of a PMC that isn't aligned with its needs and objectives, as 
 I believe is the current case with the Lucene PMC.

 -Grant







Re: [VOTE] #2 Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released

2010-06-12 Thread Michael McCandless
On Fri, Jun 11, 2010 at 11:58 AM, Uwe Schindler u...@thetaphi.de wrote:
 Hi all,

 It is not yet quite clear if we should release take2 or take1 of the
 artifacts. Both are on my people account, please vote:

 [1] Release
 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-take2-r
 ev953716/ including LUCENE-2494 as Lucene 3.0.2 and 2.9.2. You only need to
 recheck the 3.0.2 artifacts (if you submitted a vote to the first call), as
 I only rebuilt the 3.0.2 ones. Lucene 2.9.3 has no Java 5 and keeps
 unchanged.

+1 for [1].  All Lucene in Action 2nd edition tests pass with these 3.0.2 JARs.

Mike


Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released

2010-06-11 Thread Michael McCandless
I would argue my 3 cases were borderline bugs -- they weren't just
pure perf improvements.

2135 acts like a mem leak, in that we retain [often very large] memory
for longer than we should.  2161 is nasty choke point in NRT (getting
a new NRT reader syncs the old one thus blocking any searches, since
searches use sync'd methods like getNorms, I think).  2360 was a
regression, specifically indexing small docs got a slower due to the
fix from another issue.

That said, I don't think we need to be so strict (only bug fixes get
backported).  If someone has the itch/time/energy and is willing to
do the work for backport and the risk is low, back compat is
preserved, etc., I think it's great.

Mike

On Fri, Jun 11, 2010 at 1:04 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 : See 3.0.2: http://s.apache.org/6kf
 : vs. 3.0.1: http://s.apache.org/t5

 Ugh... ok, well i guess the precident has allready been set then.  hope it
 doesn't bite us in the ass down the road.


 -Hoss




Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released

2010-06-08 Thread Michael McCandless
This looks like something new to me (doesn't ring a bell).

It looks odd -- the assertion that's tripping would seem to indicate
that a file that we are copying into a CFS file (after flushing) is
still changing while we are copying, which is not good.  All files
should be closed before we build the CFS.  Strange... was this just a
local hard drive / NTFS file system, Uwe?

I also can't repro, so far -- I have a while(1) stress test running on
OpenSolaris and Windows Server 2003, but no failures yet...

Can anyone else get this test to fail?

Mike

On Tue, Jun 8, 2010 at 7:54 AM, Uwe Schindler u...@thetaphi.de wrote:
 I ran the tests on my computer and with 2.9.3 I got a failure, which i
 cannot reproduce:

    [junit] Testsuite: org.apache.lucene.index.TestThreadedOptimize
    [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 9,017 sec
    [junit]
    [junit] - Standard Output ---
    [junit] Thread-45: hit exception
    [junit] java.lang.AssertionError
    [junit]     at
 org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.java:195
 )
    [junit]     at
 org.apache.lucene.index.DocumentsWriter.createCompoundFile(DocumentsWriter.j
 ava:672)
    [junit]     at
 org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4418)
    [junit]     at
 org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4264)
    [junit]     at
 org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4255)
    [junit]     at
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2546)
    [junit]     at
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2500)
    [junit]     at
 org.apache.lucene.index.TestThreadedOptimize$1.run(TestThreadedOptimize.java
 :92)
    [junit] -  ---
    [junit] Testcase:
 testThreadedOptimize(org.apache.lucene.index.TestThreadedOptimize):
 FAILED
    [junit] null
    [junit] junit.framework.AssertionFailedError
    [junit]     at
 org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOptimize.ja
 va:113)
    [junit]     at
 org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(TestThread
 edOptimize.java:154)
    [junit]     at
 org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:221)
    [junit]
    [junit]
    [junit] Test org.apache.lucene.index.TestThreadedOptimize FAILED


 Maybe it's just the bug in the test we already know, if this is so, we can
 proceed with releasing. It happened in JDK 1.4.2 when doing a test build on
 my windows machine of 2.9.3-src.zip.

 Mike, maybe it's an already fixed test-only bug (missing volatile on field
 in this test)?

 Uwe

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Uwe Schindler [mailto:u...@thetaphi.de]
 Sent: Monday, June 07, 2010 5:21 PM
 To: general@lucene.apache.org
 Subject: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be
 released

 Hi all,

 I have posted a release candidate for both Lucene Java 2.9.3 and 3.0.2
 (which
 both have the same bug fix level, functionality and release announcement),
 build from revision 951790 of the corresponding branches.
 Thanks for all your help! Please test them and give your votes, the
 scheduled
 release date for both versions is Friday, June 18th, 2010. Only votes from
 Lucene PMC are binding, but everyone is welcome to check the release
 candidate and voice their approval or disapproval. The vote passes if at
 least
 three binding +1 votes are cast.

 We planned the parallel release with one announcement because of their
 parallel development / bug fix level to emphasize that they are equal
 except
 deprecation removal and Java 5 since major version 3. I will post the
 possible
 release announcement soon for corrections.

 Artifacts can be found at:
 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-
 take1-r
 ev951790/

 Changes:
 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-
 take1-r
 ev951790/changes-2.9.3/Changes.html
 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-
 take1-r
 ev951790/changes-2.9.3/Contrib-Changes.html

 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-
 take1-r
 ev951790/changes-3.0.2/Changes.html
 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-
 take1-r
 ev951790/changes-3.0.2/Contrib-Changes.html

 Maven artifacts:
 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-
 take1-r
 ev951790/maven/

 Happy testing!

 P.S.: I already tested the latest 3.0.2 artifacts with pangaea.de :-)

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de






Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released

2010-06-08 Thread Michael McCandless
Alas, I have 4 envs (Windows Server 2003, OpenSolaris 2009.06, CentOS
5.4, OS X 10.6.2, running stress tests for 4+ hours now, and I haven't
hit a single failure...

If nobody else can repro this, I think we should not hold up the release?

Mike

On Tue, Jun 8, 2010 at 8:44 AM, Uwe Schindler u...@thetaphi.de wrote:
 No idea, its NTFS on Windows 7, 64 bit, JDK 1.4.2_19-32bit. The test now
 works, so cannot reproduce.

 Not idea what we should do!

 -
 Uwe Schindler
 H.-H.-Meier-Allee 63, D-28213 Bremen
 http://www.thetaphi.de
 eMail: u...@thetaphi.de


 -Original Message-
 From: Michael McCandless [mailto:luc...@mikemccandless.com]
 Sent: Tuesday, June 08, 2010 2:36 PM
 To: general@lucene.apache.org
 Subject: Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be
 released

 This looks like something new to me (doesn't ring a bell).

 It looks odd -- the assertion that's tripping would seem to indicate that
 a file
 that we are copying into a CFS file (after flushing) is still changing
 while we
 are copying, which is not good.  All files should be closed before we
 build the
 CFS.  Strange... was this just a local hard drive / NTFS file system, Uwe?

 I also can't repro, so far -- I have a while(1) stress test running on
 OpenSolaris
 and Windows Server 2003, but no failures yet...

 Can anyone else get this test to fail?

 Mike

 On Tue, Jun 8, 2010 at 7:54 AM, Uwe Schindler u...@thetaphi.de wrote:
  I ran the tests on my computer and with 2.9.3 I got a failure, which i
  cannot reproduce:
 
     [junit] Testsuite: org.apache.lucene.index.TestThreadedOptimize
     [junit] Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 9,017
  sec
     [junit]
     [junit] - Standard Output ---
     [junit] Thread-45: hit exception
     [junit] java.lang.AssertionError
     [junit]     at
 
 org.apache.lucene.index.CompoundFileWriter.close(CompoundFileWriter.ja
  va:195
  )
     [junit]     at
 
 org.apache.lucene.index.DocumentsWriter.createCompoundFile(Document
 sWr
  iter.j
  ava:672)
     [junit]     at
  org.apache.lucene.index.IndexWriter.doFlushInternal(IndexWriter.java:4
  418)
     [junit]     at
  org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:4264)
     [junit]     at
  org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:4255)
     [junit]     at
 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2546)
     [junit]     at
 
 org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:2500)
     [junit]     at
 
 org.apache.lucene.index.TestThreadedOptimize$1.run(TestThreadedOptimi
 z
  e.java
  :92)
     [junit] -  ---
     [junit] Testcase:
  testThreadedOptimize(org.apache.lucene.index.TestThreadedOptimize):
  FAILED
     [junit] null
     [junit] junit.framework.AssertionFailedError
     [junit]     at
 
 org.apache.lucene.index.TestThreadedOptimize.runTest(TestThreadedOpti
 m
  ize.ja
  va:113)
     [junit]     at
 
 org.apache.lucene.index.TestThreadedOptimize.testThreadedOptimize(Test
  Thread
  edOptimize.java:154)
     [junit]     at
  org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:221)
     [junit]
     [junit]
     [junit] Test org.apache.lucene.index.TestThreadedOptimize FAILED
 
 
  Maybe it's just the bug in the test we already know, if this is so, we
  can proceed with releasing. It happened in JDK 1.4.2 when doing a test
  build on my windows machine of 2.9.3-src.zip.
 
  Mike, maybe it's an already fixed test-only bug (missing volatile on
  field in this test)?
 
  Uwe
 
  -
  Uwe Schindler
  H.-H.-Meier-Allee 63, D-28213 Bremen
  http://www.thetaphi.de
  eMail: u...@thetaphi.de
 
 
  -Original Message-
  From: Uwe Schindler [mailto:u...@thetaphi.de]
  Sent: Monday, June 07, 2010 5:21 PM
  To: general@lucene.apache.org
  Subject: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be
  released
 
  Hi all,
 
  I have posted a release candidate for both Lucene Java 2.9.3 and
  3.0.2
  (which
  both have the same bug fix level, functionality and release
  announcement), build from revision 951790 of the corresponding
 branches.
  Thanks for all your help! Please test them and give your votes, the
  scheduled
  release date for both versions is Friday, June 18th, 2010. Only votes
  from Lucene PMC are binding, but everyone is welcome to check the
  release candidate and voice their approval or disapproval. The vote
  passes if at
  least
  three binding +1 votes are cast.
 
  We planned the parallel release with one announcement because of
  their parallel development / bug fix level to emphasize that they are
  equal
  except
  deprecation removal and Java 5 since major version 3. I will post the
  possible
  release announcement soon for corrections.
 
  Artifacts can be found at:
  http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-
  take1-r
  ev951790/
 
  Changes:
  http://people.apache.org/~uschindler/staging

Re: [VOTE] Apache Lucene Java 2.9.3 and 3.0.2 artifacts to be released

2010-06-07 Thread Michael McCandless
+1 to release.

ant test passes for both -src.tar.gz downloads, and .asc's check
out, and Lucene in Action 2nd Edition's tests all pass w/ 3.0.2
dropped in.

Mike

On Mon, Jun 7, 2010 at 4:32 PM, Andi Vajda va...@apache.org wrote:

 On Mon, 7 Jun 2010, Uwe Schindler wrote:

 I have posted a release candidate for both Lucene Java 2.9.3 and 3.0.2
 (which both have the same bug fix level, functionality and release
 announcement), build from revision 951790 of the corresponding branches.
 Thanks for all your help! Please test them and give your votes, the
 scheduled release date for both versions is Friday, June 18th, 2010. Only
 votes from Lucene PMC are binding, but everyone is welcome to check the
 release candidate and voice their approval or disapproval. The vote passes
 if at least three binding +1 votes are cast.

 We planned the parallel release with one announcement because of their
 parallel development / bug fix level to emphasize that they are equal
 except
 deprecation removal and Java 5 since major version 3. I will post the
 possible release announcement soon for corrections.

 Artifacts can be found at:

 http://people.apache.org/~uschindler/staging-area/lucene-2.9.3-3.0.2-take1-r
 ev951790/

 PyLucene 2.9.3 and 3.0.2 built from their respective Lucene artifacts pass
 all tests.

 +1

 Andi..



Re: Welcome Uwe Schindler to the Lucene PMC

2010-04-01 Thread Michael McCandless
Welcome Uwe!!

Mike

On Thu, Apr 1, 2010 at 7:05 AM, Grant Ingersoll gsing...@apache.org wrote:
 I'm pleased to announce that the Lucene PMC has voted to add Uwe Schindler to 
 the PMC.  Uwe has been doing a lot of work in Lucene and Solr, including 
 several of the last releases in Lucene.

 Please join me in extending congratulations to Uwe!

 -Grant Ingersoll
 PMC Chair
 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org




Re: java.io.IOException: read past EOF

2010-03-24 Thread Michael McCandless
Your index is in serious trouble -- you have 2 segments_N files, both
of which are 0 length.

This won't be easy to recover (CheckIndex won't be able to).

Any idea how this happened?  Was this index created using 2.4.x?

Mike

On Tue, Mar 23, 2010 at 5:36 PM, Jean-Michel RAMSEYER
jm.ramse...@greenivory.com wrote:
 Hi there,

 I'm new in Lucene's world and I'm currently meeting a problem on an index.
 I'm running Lucene 2.4.1 on a Linux server with a sun jvm version
  1.6.0.17b04, in which the issue
 http://issues.apache.org/jira/browse/LUCENE-1282 is solved.
 I tried to open indexes on another computer with luke but it fails too.
 Files segments* are empty, so is there a way to rebuild index from cfs
 files? Is there a way to recover this index?
 Thank you for your answers.

 Exception trace :
 java.io.IOException: read past EOF
        at
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)
        at
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
        at
 org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:36)
        at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68)
        at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:221)
        at
 org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:95)
        at
 org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:653)
        at
 org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:206)
        at
 org.apache.lucene.search.IndexSearcher.init(IndexSearcher.java:47)

 ls -lah result :
 total 18G
 drwxr-xr-x   2 tomcat tomcat 4.0K 2010-03-22 16:29 .
 drwxr-xr-x 121 tomcat tomcat  12K 2010-03-23 14:22 ..
 -rw-r--r--   1 tomcat tomcat 1.9G 2010-03-20 13:57 _1gg2.cfs
 -rw-r--r--   1 tomcat tomcat 2.0G 2010-03-20 21:45 _1yhj.cfs
 -rw-r--r--   1 tomcat tomcat 1.9G 2010-03-21 04:16 _2gdz.cfs
 -rw-r--r--   1 tomcat tomcat 2.0G 2010-03-21 15:00 _2y9u.cfs
 -rw-r--r--   1 tomcat tomcat 2.0G 2010-03-22 03:21 _3ghg.cfs
 -rw-r--r--   1 tomcat tomcat 2.0G 2010-03-22 07:09 _3xty.cfs
 -rw-r--r--   1 tomcat tomcat 2.0G 2010-03-22 12:24 _4ekl.cfs
 -rw-r--r--   1 tomcat tomcat 192M 2010-03-22 13:25 _4gn2.cfs
 -rw-r--r--   1 tomcat tomcat 198M 2010-03-22 14:23 _4ief.cfs
 -rw-r--r--   1 tomcat tomcat 195M 2010-03-22 15:14 _4kbm.cfs
 -rw-r--r--   1 tomcat tomcat  21M 2010-03-22 15:18 _4kil.cfs
 -rw-r--r--   1 tomcat tomcat  23M 2010-03-22 15:22 _4kop.cfs
 -rw-r--r--   1 tomcat tomcat  22M 2010-03-22 15:27 _4ku0.cfs
 -rw-r--r--   1 tomcat tomcat  25M 2010-03-22 15:31 _4kzb.cfs
 -rw-r--r--   1 tomcat tomcat  21M 2010-03-22 15:36 _4l56.cfs
 -rw-r--r--   1 tomcat tomcat 1.9M 2010-03-22 15:36 _4l5r.cfs
 -rw-r--r--   1 tomcat tomcat 2.0M 2010-03-22 15:37 _4l6c.cfs
 -rw-r--r--   1 tomcat tomcat 165K 2010-03-22 15:37 _4l6d.cfs
 -rw-r--r--   1 tomcat tomcat  58K 2010-03-22 15:37 _4l6e.cfs
 -rw-r--r--   1 tomcat tomcat  80K 2010-03-22 15:37 _4l6f.cfs
 -rw-r--r--   1 tomcat tomcat 149K 2010-03-22 15:37 _4l6g.cfs
 -rw-r--r--   1 tomcat tomcat 218K 2010-03-22 15:37 _4l6h.cfs
 -rw-r--r--   1 tomcat tomcat 198K 2010-03-22 15:37 _4l6i.cfs
 -rw-r--r--   1 tomcat tomcat  45K 2010-03-22 15:37 _4l6j.cfs
 -rw-r--r--   1 tomcat tomcat  58K 2010-03-22 15:37 _4l6k.cfs
 -rw-r--r--   1 tomcat tomcat 158K 2010-03-22 15:37 _4l6l.cfs
 -rw-r--r--   1 tomcat tomcat 116K 2010-03-22 15:37 _4l6m.cfs
 -rw-r--r--   1 tomcat tomcat 1.1M 2010-03-22 15:37 _4l6n.cfs
 -rw-r--r--   1 tomcat tomcat 128K 2010-03-22 15:37 _4l6o.cfs
 -rw-r--r--   1 tomcat tomcat 1.9G 2010-03-20 04:12 _hnt.cfs
 -rw-r--r--   1 tomcat tomcat    0 2010-03-22 15:37 segments_44o3
 -rw-r--r--   1 tomcat tomcat    0 2010-03-22 15:37 segments_44o4
 -rw-r--r--   1 tomcat tomcat    0 2010-03-22 15:37 segments.gen
 -rw-r--r--   1 tomcat tomcat 1.9G 2010-03-20 07:52 _ywu.cfs




Re: java.io.IOException: read past EOF

2010-03-24 Thread Michael McCandless
It can be tricky eg if segments share doc stores, I think you
can't always recover that.

But this index seems not to have separate doc stores (no *.cfx), so, I
think in theory one could regenerate the segment metadata
(SegmentInfo) from the index files, but I don't know that anyone has
created this yet.

Also, it could in general result in re-attaching segments that had
been merged away (ie, causing duplicates in the index).

Mike

On Wed, Mar 24, 2010 at 2:39 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 The documentation (
 http://lucene.apache.org/java/2_4_0/fileformats.html#File%20Naming) makes it
 seem that the cfs files could be used to recover most of the information
 from the index.  Is that not so?


 On Tue, Mar 23, 2010 at 11:30 PM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 Your index is in serious trouble -- you have 2 segments_N files, both
 of which are 0 length.

 This won't be easy to recover (CheckIndex won't be able to).




Re: Less drastic ways

2010-03-15 Thread Michael McCandless
On Sun, Mar 14, 2010 at 4:29 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:

 Even if we merge Lucene/Solr and we treat Solr as just another
 Lucene contrib/module, say, contributors who care only about Solr
 will still patch against Solr and Lucene developers or those people
 who have the itch for that functionality being in Lucene, too, will
 still have to poach/refactor and pull that functionality in Lucene
 later on.

Yes, people with their respective itches can still create Solr-only
and Lucene-only functions, after the merge.  We should not block any
feature from going in solely because it's not factored so that both
Lucene  Solr can use it.

But, no, poaching is no longer needed with merged dev -- we are free
to efficiently refactor at that point.  Merged, we don't need to have
full copies of the code in two projects, await releases to de-dup,
etc. -- code can just freely move back and forth within the project.
It's also more likely that someone wearing a Lucene hat will see the
Solr work going on and jump in and help to make it work in Lucene.

Merged dev makes refactoring much more efficient then poaching across
project lines.  Both achieve the same goals with time, it's just that
poaching is a much slower/more wasteful way to achieve it... (but of
course is the only option for disparate projects, eg, pulling stuff
from Nutch down into Lucene).

 Whether Solr is a separate project or a Lucene
 contrib/module that has its own user (and contributor) community
 that is not tightly integrated with Lucene's -dev community, the
 same thing will happen, no?

True, but much less efficiently (if we can only poach across project
lines).

 Maybe it will help if we made things visual for us visual peeps.  Is
 this, roughly, what the plan is:

 trunk/
lucene-core/
modules/
analysis/
wordnet/
spellchecker/
whatever/
...
facets/
...
functions/
solr/
dih/
...

I honestly don't know what module structure we'll come up with!  It's
tbd'd

But this looks like a good start :)

I think we'd also have a queryparser module (we have like 7 of them,
according to Robert ;), a queries module (I'd think functions lives
inside there).

Mike


Re: Less drastic ways

2010-03-14 Thread Michael McCandless
 Hm, again I'm confused.  If this is how it worked in Solr/Lucene
 land, then there wouldn't be pieces in Solr that we now want to
 refactor and move into Lucene core or modules.  A list of about 4-5
 such pieces of functionality in Solr has already been listed.
 That's really my main question.  Why were/can't things be committed
 to the appropriate place?  Why where they committed to Solr?

Pre-merge:

If someone wants a new functionality in Solr they should be free to
create a patch to make it work well, in Solr, alone.

To expect them to also factor it so that it works well for Lucene-only
users is wrong.  They should not need to, nor be expected to, and they
shouldn't feel bad not having factored it that way.  They use Solr and
they need it working in Solr and that was their itch and they
scratched it and net/net that was a great step forward for Solr.  We
should not up and reject contributions because they are not well
factored for the two projects.  Beggars can't be choosers...

Someone who later has the itch for this functionality in Lucene should
then be fully free to pick it up, refactor, and make it work in Lucene
alone, by poaching it (pulling it into Lucene).

Poaching is a natural way for code to be pulled across projects... and
while in the short term it'd result in code dup, in the long term this
is how refactoring can happen across projects.  It's completely normal
and fine, in my opinion.

But poaching, while effective, is slow ... Lucene would poach, have
to stabilize  do a release, Solr would have to upgrade and then fix
to cutover to Lucene's sources (assuming the sources hadn't
diverged too much, else Solr would have to wait for Lucene's next
release, etc.)

And we have *alot* of modules to refactor here, between Solr and
Lucene.

So for these two reasons I vote for merging Solr/Lucene dev over gobbs
of poaching.  That gives us complete freedom to quickly move the code
around.

Poaching should still be perfectly fine for other cases, like pulling
analyzers from Nutch, from other projects, etc.

Mike


Re: [VOTE] merge lucene/solr development (take 3)

2010-03-09 Thread Michael McCandless
On Tue, Mar 9, 2010 at 5:10 AM, Andrzej Bialecki a...@getopt.org wrote:

 Re: Nutch components - those that are reusable in Lucene or Solr
 contexts eventually find their way to respective projects, witness
 e.g. CommonGrams.

In fact I think this is a great example -- as far as I can tell,
CommonGrams was poached from Nutch, into Solr, and then was
nurtured/improved in both projects separately right?

So can/should we freely poach across all our sub projects?

It has obvious downsides (it's essentially a fork that will confuse
those users that use both Solr  Lucene, in the short term, until
things stabilize into a clean refactoring; it's double the dev; we
must re-sync with time; etc.).

But it has a massive upside: it means we don't rely only on push
(Solr devs to push into Lucene or vice/versa).  We can also use pull
(Lucene devs can pull pieces from Nutch/Solr into Lucene).  It becomes
a 2-way street for properly factoring our shared code with time.

If we had that freedom (poaching is perfectly fine), then,
interested devs could freely refactor across sub projects.

Not having this freedom today, and not having merged dev, is stunting
both Solr  Lucene's growth.

Mike


Re: [VOTE] merge lucene/solr development (take 3)

2010-03-09 Thread Michael McCandless
I'm still +1 for merging Solr/Lucene dev.

I think poaching, when we have so much that needs to be shared, is
going to cause far more problems than it'll solve.  It's not the right
tool for [this] job.

I do think poaching is good  legitimate tool when it's less code (eg
the CommonGrams case), so, we should do both ;)

Mike

On Tue, Mar 9, 2010 at 8:49 AM, Grant Ingersoll gsing...@apache.org wrote:

 On Mar 9, 2010, at 8:21 AM, Michael McCandless wrote:

 On Tue, Mar 9, 2010 at 7:21 AM, Grant Ingersoll gsing...@apache.org wrote:

 If we had that freedom (poaching is perfectly fine), then,
 interested devs could freely refactor across sub projects.

 As someone who works on both, I don't think it is fine.  Just look at the 
 function query mess.  Just look at the version mess.  It's very frustrating 
 as a developer and it makes me choose between two projects that I happen to 
 like equally, but for different reasons.  If I worked on Nutch, I would 
 feel the same way.

 But... Lucene should poach from external (eg non-Apache) projects, if
 the license works?

 Ie if some great analyzer is out there, and Robert spots it, and the
 license works, we should poach it?  (In fact he just did this w/
 Andrzej's Polish stemmer ;) ).

 I'd prefer donate to poach, but, realize that isn't always the case.



 So we have something of a double standard...

 And, ironically, I think it's the fact that there's so much committer
 overlap between Solr and Lucene that is causing this antagonism
 towards poaching.

 When in fact I think poaching, at a wider scale (across unrelated
 projects) is a very useful means for any healthy open source software
 to evolve.

 Why should Lucene be prevented from having a useful feature just
 because Solr happened to create it first?

 But why should I be forced to maintain two versions due to some arbitrary 
 code separation?  And why should you force a good chunk of us to do a whole 
 lot of extra work simply because of some arbitrary code separation?  Here, it 
 is the Lucene PMC that releases code and it is just silly that with all of 
 this overlap at the committer level we still have this duplication.   I can't 
 speak for the external projects (I don't believe any of them have even 
 responded here other than Jackrabbit), but if they don't like it, they should 
 get more involved in the community and work to be committers.

 At any rate, this is exactly why merging makes sense.  You would no longer 
 have this issue of first.  I would no longer have to choose where to add my 
 spatial work based on some arbitrary line that someone drew in the sand that 
 isn't all that pertinent anymore given the desires of most in the community 
 to blur that line.  It would be available to everyone.

 For that matter, why do we even need to have this discussion at all?  Most of 
 us Solr committers are Lucene committers.  We can simply start committing 
 Solr code to Lucene such that in 6 months the whole discussion is moot and 
 the three committers on Solr who aren't Lucene committers can earn their 
 Lucene merit very quickly by patching the Solr portion of Lucene.  We can 
 move all the code to it's appropriate place, add a contrib module for the WAR 
 stuff and the response writers and voila, Solr is in Lucene, the dev mailing 
 lists have merged by the fact that Solr dev would be defunct and all of the 
 proposals in this vote are implemented simply by employing our commit 
 privileges in a concerted way.  Yet, somehow, me thinks that isn't a good 
 solution either, right?  Yet it is perfectly legal and is just as valid a 
 solution as the poaching solution and in a lot of ways seems to be what 
 Chris is proposing.

 -Grant









Re: Composing posts for both JIRA and email (was a JIRA post)

2010-03-05 Thread Michael McCandless
Great guidelines Marvin!

I agree w/ most of this, except, I do use Jira's markup (bq., {quote})
when adding comments.  I'm torn between how important the first read
(via the email Jira sends) is vs the I click through to the issue 
read it).  Typically I just click through to the issue unless it's a
smallish comment.

I don't get why Jira can't support email markup (,  means nested
levels of quoting) in addition to its own... maybe they are gunning
for some kind of lock-in of their users.  EG I've seen people respond
to normal email threads, but quoting using bq.!

Sometimes I compose with an external editor (in emacs, which wraps)
sometimes directly in the browser.  It's All Text plugin sounds neat
-- what does it gain over simple copy/paste out of your editor?

I can't stand that gmail doesn't do the right thing w/ line wrapping
outgoing email, though -- when I quote a message (like below), the
addition of the 's causes already wrapped text to be further wrapped,
thus looking hideous (you should see examples below).

And yes I hate that the first line under {code} has no indentation.
Silly.  Sounds like we just need a Jira upgrade @ Apache to fix that
one...

Mike

On Thu, Mar 4, 2010 at 12:28 PM, Marvin Humphrey mar...@rectangular.com wrote:
 (CC to lucy-dev and general, reply-to set to general)

 On Thu, Mar 04, 2010 at 06:18:28AM +, Shai Erera (JIRA) wrote:

 (Warning, this post is long, and is easier to read in JIRA)

 I consume email from many of the Lucene lists, and I hate it when people force
 me to read stuff via JIRA.  It slows me down to have to jump to all those
 forum web pages.  I only go the web page if there are 5 or more posts in a row
 on the same issue that I need to read.

 For what it's worth, I've worked out a few routines that make it possible to
 compose messages which read well in both mediums.

  * Never edit your posts unless absolutely necessary.  If JIRA used diffs,
    things would be different, but instead it sends the whole frikkin' post
    twice (before and after), which makes it very difficult to see what was
    edited.  If you must edit, append an edited: block at the end to
    describe what you changed instead of just making changes inline.
  * Use FireFox and the It's All Text plugin, which makes it possible to edit
    JIRA posts using an external editor such as Vim instead of typing into a
    textarea. http://trac.gerf.org/itsalltext
  * After editing, use the preview button (it's a little monitor icon to the
    upper right of the textarea) to make sure the post looks good in JIRA.
  * Use   for quoting instead of JIRA's bq. and {quote} since JIRA's
    mechanisms look so crappy in email.  This is easy from Vim, because
    rewrapping a long line (by typing gq from visual mode to rewrap the
    current selection) that starts with   causes   to be prepended to
    the wrapped lines.
  * Use asterisk bullet lists liberally, because they look good everywhere.
  * Use asterisks for *emphasis*, because that looks good everywhere.
  * If you wrap lines, use a reasonably short line length.  (I use 78; Mike
    McCandless, who also wraps lines for his Jira posts, uses a smaller
    number).  Otherwise you'll get nasty wrapping in narrow windows, both in
    email clients and web browsers.

 There are still a couple compromises that don't work out well.  For email,
 ideally you want to set off code blocks with indenting:

    int foo = 1;
    int bar = 2;

 To make code look decent in JIRA, you have to wrap that with {code} tags,
 which unfortunately look heinous in email.  Left-justifying the tags but
 indenting the code seems like it would be a rotten-but-salvageable compromise,
 as it at least sets off the tags visually rather than making them appear as
 though they are part of the code fragment.

 {code}
    int foo = 1;
    int bar = 2;
 {code}

 Unfortunately, that's going to look like this in JIRA, because of a bug that
 strips all leading whitespace from the first line.

   |-|
   | int foo;                |
   |     int bar;            |
   |-|

 It seems that this has been fixed by Atlassian in the Confluence wiki
 (http://jira.atlassian.com/browse/CONF-4548), but the issue remains for the
 JIRA installation at issues.apache.org.  So for now, I manually strip
 indentation until the whole block is flush left.

 {code}
 int foo = 1;
 int bar = 2;
 {code}

 (Gag.  I vastly prefer wikis that automatically apply fixed-width styling to
 any indented text.)

 One last tip for Lucy developers (and other non-Java devs).  JIRA has limited
 syntax highlighting support -- Java, JavaScript, ActionScript, XML and SQL
 only -- and defaults to assuming your code is Java.  In general, you want to
 override that and tell JIRA to use none.

 {code:none}
 int foo = 1;
 int bar = 2;
 {code}

 Marvin Humphrey




Re: [VOTE] merge lucene/solr development

2010-03-04 Thread Michael McCandless
On Thu, Mar 4, 2010 at 12:41 PM, Chris Hostetter
hossman_luc...@fucit.org wrote:

 Why don't we just start by attempting to have a common dev list and
 merging committers, in the hopes that it will promote better
 communication about features up and down the stack, and better bug
 fixing/refactoring/modularization -- then see if that leads us to a
 point where it makes sense to more tightly couple the build systems
 and releases?

I'm all for being iterative / taking baby steps, when the problem
naturally can be solved that way -- progress not perfection.  Many
problems decompose like this.

But I don't think this problem does.

In particular, how would the above baby step address the code
duplication (my goal in the original opening) -- eg the 3 places where
concrete analyzers/queries are today.  How would it lead to making
facets work with pure Lucene?  To developing spatial in one place?

Mike


[VOTE] Merge the development of Solr/Lucene (take 2)

2010-03-04 Thread Michael McCandless
A new vote, that slightly changes proposal from last vote (adding only
that Lucene can cut a release even if Solr doesn't):

 * Merging the dev lists into a single list.

 * Merging committers.

 * When any change is committed (to a module that belongs to Solr or
   to Lucene), all tests must pass.

 * Release details will be decided by dev community, but, Lucene may
   release without Solr.

 * Modulariize the sources: pull things out of Lucene's core (break
   out query parser, move all core queries  analyzers under their
   contrib counterparts), pull things out of Solr's core (analyzers,
   queries).

These things would not change:

 * Besides modularizing (above), the source code would remain factored
   into separate dirs/modules the way it is now.

 * Issue tracking remains separate (SOLR-XXX and LUCENE-XXX
   issues).

 * User's lists remain separate.

 * Web sites remain separate.

 * Release artifacts/jars remain separate.

Mike


Re: [VOTE] Merge the development of Solr/Lucene (take 2)

2010-03-04 Thread Michael McCandless
I forgot my vote: +1

Mike

On Thu, Mar 4, 2010 at 4:33 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 A new vote, that slightly changes proposal from last vote (adding only
 that Lucene can cut a release even if Solr doesn't):

  * Merging the dev lists into a single list.

  * Merging committers.

  * When any change is committed (to a module that belongs to Solr or
   to Lucene), all tests must pass.

  * Release details will be decided by dev community, but, Lucene may
   release without Solr.

  * Modulariize the sources: pull things out of Lucene's core (break
   out query parser, move all core queries  analyzers under their
   contrib counterparts), pull things out of Solr's core (analyzers,
   queries).

 These things would not change:

  * Besides modularizing (above), the source code would remain factored
   into separate dirs/modules the way it is now.

  * Issue tracking remains separate (SOLR-XXX and LUCENE-XXX
   issues).

  * User's lists remain separate.

  * Web sites remain separate.

  * Release artifacts/jars remain separate.

 Mike



Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-03-01 Thread Michael McCandless
If we don't somehow first address the code duplication across the 2
projects, making Solr a TLP will make things worse.

I started here with analysis because I think that's the biggest pain
point: it seemed like an obvious first step to fixing the code
duplication and thus the most likely to reach some consensus.  And
it's also very timely: Robert is right now making all kinds of great
fixes to our collective analyzers (in between bouts of fuzzy DFA
debugging).

But it goes beyond analyzers: I'd like to see other modules, now in
Solr, eventually moved to Lucene, because they really are core
functionality (eg facets, function (and other?) queries, spatial,
maybe improvements to spellchecker/highlighter).  How can we do this?

And how can we do this so that it lasts over time?  If new cool
core things are born in Solr-land (which of course happens alot --
lots of good healthy usage), how will they find their way back to
Lucene?

Yonik's proposal (merging development of Solr/Lucene, but keeping all
else separate) would achieve this.

If we do the opposite (Solr - TLP), how could we possibly achieve
this?

I guess one possibility is to just suck it up and duplicate the code.
Meaning, each project will have to manually merge fixes in from the
other project (so long as there's someone around with the itch to do
so).  Lucene would copy in all of Solr's analysis, and vice-versa (and
likewise other dup'd functionality).  I really dislike this
solution... it will confuse the daylights out of users, its error
proned, it's a waste of dev effort, there will always be little
differences... but maybe it is in fact the lesser evil?

I would much prefer merging Solr/Lucene development...

Mike

On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Grant,

 On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote:

 Hi Robert,

 I think my proposal (Solr-TLP) is sort of orthogonal to the whole analyzers
 issue - I was in favor, at the very least, of having a separate
 module/project/whatever that both Solr/Lucene (and whatever project) can
 depend on for the shared analyzer code...

 Not really.  They are intimately linked.

 Ummm, how so? Making project A called Apache Super Analyzers and then
 making Lucene(-java) and Solr depend on Apache Super Analyzers is separate
 of whether or not Lucene(-java) and Solr are TLPs or not...

 Cheers,
 Chris





 Cheers,
 Chris



 On 3/1/10 9:12 AM, Robert Muir rcm...@gmail.com wrote:

 this will make the analyzers duplication problem even worse

 On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Mark,

 Thanks for your message. I respect your viewpoint, but I respectfully
 disagree. It just seems (to me at least based on the discussion) like a TLP
 for Solr is the way to go.

 Cheers,
 Chris



 On 3/1/10 8:54 AM, Mark Miller markrmil...@gmail.com wrote:

 On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote:
 Hi Mark,


 That would really be no real world change from how things work today.
 The fact
 is, today, Solr already operates essentially as an independent project.

 Well if that's the case, then it would lead me to think that it's more of
 a
 TLP more than anything else per best practices.

 That depends. It could be argued it should be a top level project or
 that it should be closer to the Lucene project. Some people are arguing
 for both approaches right now. There are two directions we could move in.

 The only real difference is that it shares the same PMC with Lucene now
 and
 wouldn't with this change. This would address none of the issues that
 triggered
 the idea for a possible merge.

 I don't agree -- you're looking to bring together two communities that
 are
 fairly separate as you put it. The separation likely didn't spring up
 over
 night and has been this way for a while (as least to my knowledge). This
 is
 exactly the type of situation that typically leads to TLP creation from
 what
 I've seen.

 It also causes negatives between Solr/Lucene that some are looking to
 address. Hence the birth of this proposal. Going TLP with Solr will only
 aggravate those negatives, not help them.

 While the communities operate fairly separately at the moment, the
 people in the communities are not so separate. The committer list has
 huge overlap. Many committers on one project but not the other do a lot
 of work on both projects.

 There is already a strong link with the personal - merging the
 management of the projects addresses many of the concerns that have
 prompted this discussion. TLP'ing Solr only makes those concerns
 multiply. They would diverge further, and incompatible overlap between
 them would increase.

 Cheers,
 Chris






 On 03/01/2010 10:04 AM, Mattmann, Chris A (388J) wrote:

 Hey Grant,

 I'd like to explore this   does this imply that the Lucene
 sub-projects will
 go away and Lucene will turn into Lucene-java and maintain its Apache
 TLP,
 and then 

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-03-01 Thread Michael McCandless
On Mon, Mar 1, 2010 at 12:58 PM, Marvin Humphrey mar...@rectangular.com wrote:
 On Mon, Mar 01, 2010 at 12:44:02PM -0500, Michael McCandless wrote:

 But it goes beyond analyzers: I'd like to see other modules, now in
 Solr, eventually moved to Lucene, because they really are core
 functionality (eg facets, function (and other?) queries, spatial,
 maybe improvements to spellchecker/highlighter).

 I disagree.  Those don't belong in core, and though they are all
 great features, adding them to core constitutes bloat, IMO.

 The Query class belongs in core.  All those other modules should be
 distributed as plugins, which could be used by Solr, Katta, Lucene,
 whatever.

 Note that this is orthogonal to whether Solr and Lucene merge or
 diverge.

I agree with this (sorry I wasn't clear).

By core functionality I mean it should be a separate module (plugin)
that direct Lucene users can use, not whenever you install core
Lucene you get these functions.

Ie, users shouldn't have to install Solr to use facets with Lucene.

Mike


Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-03-01 Thread Michael McCandless
Because the code dup with analyzers is only one of the problems to
solve.  In fact, it's the easiest of the problems to solve (that's why
I proposed it, only, first).

A more differentiating example is a much less mature module

EG take spatial -- if Solr were its own TLP, how could spatial be
built out in a way that we don't waste effort, and so that both direct
Lucene and Solr users could use it when it's released?

Mike

On Mon, Mar 1, 2010 at 1:07 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Mike,

 I'm not sure I follow this line of thinking: how would Solr being a TLP 
 affect the creation of a separate project/module for Analyzers any more so 
 than it not being a TLP? Both Lucene-java and Solr (as a TLP) could depend on 
 the newly created refactored Analysis project.

 Chris



 On 3/1/10 10:44 AM, Michael McCandless luc...@mikemccandless.com wrote:

 If we don't somehow first address the code duplication across the 2
 projects, making Solr a TLP will make things worse.

 I started here with analysis because I think that's the biggest pain
 point: it seemed like an obvious first step to fixing the code
 duplication and thus the most likely to reach some consensus.  And
 it's also very timely: Robert is right now making all kinds of great
 fixes to our collective analyzers (in between bouts of fuzzy DFA
 debugging).

 But it goes beyond analyzers: I'd like to see other modules, now in
 Solr, eventually moved to Lucene, because they really are core
 functionality (eg facets, function (and other?) queries, spatial,
 maybe improvements to spellchecker/highlighter).  How can we do this?

 And how can we do this so that it lasts over time?  If new cool
 core things are born in Solr-land (which of course happens alot --
 lots of good healthy usage), how will they find their way back to
 Lucene?

 Yonik's proposal (merging development of Solr/Lucene, but keeping all
 else separate) would achieve this.

 If we do the opposite (Solr - TLP), how could we possibly achieve
 this?

 I guess one possibility is to just suck it up and duplicate the code.
 Meaning, each project will have to manually merge fixes in from the
 other project (so long as there's someone around with the itch to do
 so).  Lucene would copy in all of Solr's analysis, and vice-versa (and
 likewise other dup'd functionality).  I really dislike this
 solution... it will confuse the daylights out of users, its error
 proned, it's a waste of dev effort, there will always be little
 differences... but maybe it is in fact the lesser evil?

 I would much prefer merging Solr/Lucene development...

 Mike

 On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Grant,

 On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote:

 Hi Robert,

 I think my proposal (Solr-TLP) is sort of orthogonal to the whole 
 analyzers
 issue - I was in favor, at the very least, of having a separate
 module/project/whatever that both Solr/Lucene (and whatever project) can
 depend on for the shared analyzer code...

 Not really.  They are intimately linked.

 Ummm, how so? Making project A called Apache Super Analyzers and then
 making Lucene(-java) and Solr depend on Apache Super Analyzers is separate
 of whether or not Lucene(-java) and Solr are TLPs or not...

 Cheers,
 Chris





 Cheers,
 Chris



 On 3/1/10 9:12 AM, Robert Muir rcm...@gmail.com wrote:

 this will make the analyzers duplication problem even worse

 On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Mark,

 Thanks for your message. I respect your viewpoint, but I respectfully
 disagree. It just seems (to me at least based on the discussion) like a 
 TLP
 for Solr is the way to go.

 Cheers,
 Chris



 On 3/1/10 8:54 AM, Mark Miller markrmil...@gmail.com wrote:

 On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote:
 Hi Mark,


 That would really be no real world change from how things work today.
 The fact
 is, today, Solr already operates essentially as an independent project.

 Well if that's the case, then it would lead me to think that it's more of
 a
 TLP more than anything else per best practices.

 That depends. It could be argued it should be a top level project or
 that it should be closer to the Lucene project. Some people are arguing
 for both approaches right now. There are two directions we could move in.

 The only real difference is that it shares the same PMC with Lucene now
 and
 wouldn't with this change. This would address none of the issues that
 triggered
 the idea for a possible merge.

 I don't agree -- you're looking to bring together two communities that
 are
 fairly separate as you put it. The separation likely didn't spring up
 over
 night and has been this way for a while (as least to my knowledge). This
 is
 exactly the type of situation that typically leads to TLP creation from
 what
 I've seen.

 It also causes negatives between Solr/Lucene

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-03-01 Thread Michael McCandless
Also, there still seems to be a misconception on what's being proposed
here.

The proposal is to synchronize the development of Solr and Lucene.
Ie, a single dev list, single set of committers, synchronized
releases.

Everything else remains the same.  EG the release artifacts, user's
lists, web sites, branding, all remain separate.

How the source code is modularized is an orthogonal question.  We've
discussed breaking out things of Lucene's core, like query parser,
queries, analyzers into their own modules (and shipping their own
artifacts), which I still think makes great sense.  But it's
independent of synchronizing our development.

Mike

On Mon, Mar 1, 2010 at 1:03 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Mon, Mar 1, 2010 at 12:58 PM, Marvin Humphrey mar...@rectangular.com 
 wrote:
 On Mon, Mar 01, 2010 at 12:44:02PM -0500, Michael McCandless wrote:

 But it goes beyond analyzers: I'd like to see other modules, now in
 Solr, eventually moved to Lucene, because they really are core
 functionality (eg facets, function (and other?) queries, spatial,
 maybe improvements to spellchecker/highlighter).

 I disagree.  Those don't belong in core, and though they are all
 great features, adding them to core constitutes bloat, IMO.

 The Query class belongs in core.  All those other modules should be
 distributed as plugins, which could be used by Solr, Katta, Lucene,
 whatever.

 Note that this is orthogonal to whether Solr and Lucene merge or
 diverge.

 I agree with this (sorry I wasn't clear).

 By core functionality I mean it should be a separate module (plugin)
 that direct Lucene users can use, not whenever you install core
 Lucene you get these functions.

 Ie, users shouldn't have to install Solr to use facets with Lucene.

 Mike



Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-03-01 Thread Michael McCandless
This looks great!

But, the goal is to make a standalone toolkit exposing GIS functions,
right?

My original question (integrating this into Lucene/Solr) remains.

EG there's alot of good working happening now in Solr to make spatial
search available.  How will that find its way back to Lucene?  Lucene
has its own (now duplicate) spatial package that was already
developed.  Users will now be confused about the two, each have
different bugs/features, etc.

If we had shared development then the ongoing effort would result in a
spatial package that direct Lucene users and Solr users would be able
to use.

Mike

On Mon, Mar 1, 2010 at 1:28 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 I'm glad that you brought that up! :)

 Check out:

 http://incubator.apache.org/projects/sis.html

 We're just starting to tackle that very issue right 
 now...patches/ideas/contributions welcome.

 Cheers,
 Chris



 On 3/1/10 11:25 AM, Michael McCandless luc...@mikemccandless.com wrote:

 Because the code dup with analyzers is only one of the problems to
 solve.  In fact, it's the easiest of the problems to solve (that's why
 I proposed it, only, first).

 A more differentiating example is a much less mature module

 EG take spatial -- if Solr were its own TLP, how could spatial be
 built out in a way that we don't waste effort, and so that both direct
 Lucene and Solr users could use it when it's released?

 Mike

 On Mon, Mar 1, 2010 at 1:07 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Mike,

 I'm not sure I follow this line of thinking: how would Solr being a TLP 
 affect the creation of a separate project/module for Analyzers any more so 
 than it not being a TLP? Both Lucene-java and Solr (as a TLP) could depend 
 on the newly created refactored Analysis project.

 Chris



 On 3/1/10 10:44 AM, Michael McCandless luc...@mikemccandless.com wrote:

 If we don't somehow first address the code duplication across the 2
 projects, making Solr a TLP will make things worse.

 I started here with analysis because I think that's the biggest pain
 point: it seemed like an obvious first step to fixing the code
 duplication and thus the most likely to reach some consensus.  And
 it's also very timely: Robert is right now making all kinds of great
 fixes to our collective analyzers (in between bouts of fuzzy DFA
 debugging).

 But it goes beyond analyzers: I'd like to see other modules, now in
 Solr, eventually moved to Lucene, because they really are core
 functionality (eg facets, function (and other?) queries, spatial,
 maybe improvements to spellchecker/highlighter).  How can we do this?

 And how can we do this so that it lasts over time?  If new cool
 core things are born in Solr-land (which of course happens alot --
 lots of good healthy usage), how will they find their way back to
 Lucene?

 Yonik's proposal (merging development of Solr/Lucene, but keeping all
 else separate) would achieve this.

 If we do the opposite (Solr - TLP), how could we possibly achieve
 this?

 I guess one possibility is to just suck it up and duplicate the code.
 Meaning, each project will have to manually merge fixes in from the
 other project (so long as there's someone around with the itch to do
 so).  Lucene would copy in all of Solr's analysis, and vice-versa (and
 likewise other dup'd functionality).  I really dislike this
 solution... it will confuse the daylights out of users, its error
 proned, it's a waste of dev effort, there will always be little
 differences... but maybe it is in fact the lesser evil?

 I would much prefer merging Solr/Lucene development...

 Mike

 On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hi Grant,

 On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote:

 Hi Robert,

 I think my proposal (Solr-TLP) is sort of orthogonal to the whole 
 analyzers
 issue - I was in favor, at the very least, of having a separate
 module/project/whatever that both Solr/Lucene (and whatever project) can
 depend on for the shared analyzer code...

 Not really.  They are intimately linked.

 Ummm, how so? Making project A called Apache Super Analyzers and then
 making Lucene(-java) and Solr depend on Apache Super Analyzers is separate
 of whether or not Lucene(-java) and Solr are TLPs or not...

 Cheers,
 Chris





 Cheers,
 Chris



 On 3/1/10 9:12 AM, Robert Muir rcm...@gmail.com wrote:

 this will make the analyzers duplication problem even worse

 On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Mark,

 Thanks for your message. I respect your viewpoint, but I respectfully
 disagree. It just seems (to me at least based on the discussion) like a 
 TLP
 for Solr is the way to go.

 Cheers,
 Chris



 On 3/1/10 8:54 AM, Mark Miller markrmil...@gmail.com wrote:

 On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote:
 Hi Mark,


 That would really be no real world change from how

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-03-01 Thread Michael McCandless
The possibility of slowing down releases is the only real concern I
also share

But, I think release frequency is largely a matter of discipline :)

But, digging into it, I think as long as the project keeps a stable
trunk (something Lucene has always tried to do -- does Solr?)... then
release frequency is really a matter of discipline.

I mean in Lucene we keep saying we want faster releases, but why
doesn't it happen?  Couldn't we have done 2X as many releases in the
past few years?  Did we really want to release more frequently?

If we really want to take it seriously I think we should have someone
unofficially be the next release czar.  As soon as a release is
finished, this czar is responsible for roughly planning the next one.
This means making a tentative schedule, tracking big features and
making sure they land early enough to bake fully on trunk, etc.

New modules (eg spatial) need not gate the release -- that module's
docs would call out clearly that it's not fully baked yet...

Mike

On Mon, Mar 1, 2010 at 1:13 PM, Michael Busch busch...@gmail.com wrote:
 It seems like most of the people agree with these good goals but are
 concerned about the release cycles (including me). How can we achieve these
 goals without making releases more difficult?

  Michael

 On 3/1/10 9:44 AM, Michael McCandless wrote:

 If we don't somehow first address the code duplication across the 2
 projects, making Solr a TLP will make things worse.

 I started here with analysis because I think that's the biggest pain
 point: it seemed like an obvious first step to fixing the code
 duplication and thus the most likely to reach some consensus.  And
 it's also very timely: Robert is right now making all kinds of great
 fixes to our collective analyzers (in between bouts of fuzzy DFA
 debugging).

 But it goes beyond analyzers: I'd like to see other modules, now in
 Solr, eventually moved to Lucene, because they really are core
 functionality (eg facets, function (and other?) queries, spatial,
 maybe improvements to spellchecker/highlighter).  How can we do this?

 And how can we do this so that it lasts over time?  If new cool
 core things are born in Solr-land (which of course happens alot --
 lots of good healthy usage), how will they find their way back to
 Lucene?

 Yonik's proposal (merging development of Solr/Lucene, but keeping all
 else separate) would achieve this.

 If we do the opposite (Solr -  TLP), how could we possibly achieve
 this?

 I guess one possibility is to just suck it up and duplicate the code.
 Meaning, each project will have to manually merge fixes in from the
 other project (so long as there's someone around with the itch to do
 so).  Lucene would copy in all of Solr's analysis, and vice-versa (and
 likewise other dup'd functionality).  I really dislike this
 solution... it will confuse the daylights out of users, its error
 proned, it's a waste of dev effort, there will always be little
 differences... but maybe it is in fact the lesser evil?

 I would much prefer merging Solr/Lucene development...

 Mike

 On Mon, Mar 1, 2010 at 12:01 PM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov  wrote:


 Hi Grant,



 On Mar 1, 2010, at 8:20 AM, Mattmann, Chris A (388J) wrote:



 Hi Robert,

 I think my proposal (Solr-TLP) is sort of orthogonal to the whole
 analyzers
 issue - I was in favor, at the very least, of having a separate
 module/project/whatever that both Solr/Lucene (and whatever project)
 can
 depend on for the shared analyzer code...


 Not really.  They are intimately linked.


 Ummm, how so? Making project A called Apache Super Analyzers and then
 making Lucene(-java) and Solr depend on Apache Super Analyzers is
 separate
 of whether or not Lucene(-java) and Solr are TLPs or not...

 Cheers,
 Chris






 Cheers,
 Chris



 On 3/1/10 9:12 AM, Robert Muirrcm...@gmail.com  wrote:

 this will make the analyzers duplication problem even worse

 On Mon, Mar 1, 2010 at 11:06 AM, Mattmann, Chris A (388J)
 chris.a.mattm...@jpl.nasa.gov  wrote:



 Hi Mark,

 Thanks for your message. I respect your viewpoint, but I respectfully
 disagree. It just seems (to me at least based on the discussion) like
 a TLP
 for Solr is the way to go.

 Cheers,
 Chris



 On 3/1/10 8:54 AM, Mark Millermarkrmil...@gmail.com  wrote:

 On 03/01/2010 10:40 AM, Mattmann, Chris A (388J) wrote:


 Hi Mark,




 That would really be no real world change from how things work
 today.


 The fact


 is, today, Solr already operates essentially as an independent
 project.



 Well if that's the case, then it would lead me to think that it's
 more of


 a


 TLP more than anything else per best practices.



 That depends. It could be argued it should be a top level project or
 that it should be closer to the Lucene project. Some people are
 arguing
 for both approaches right now. There are two directions we could move
 in.




 The only real difference is that it shares the same PMC with Lucene
 now

Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-02-28 Thread Michael McCandless
To make this more concrete, I think this is roughly what's being
proposed:

  * Merging the dev lists into a single list.

  * Merging committers.

  * When a change it committed to Lucene, it must pass all Solr
tests.

  * Release both at once.

These things would not change:

  * Most importantly, the source code would remain factored into
separate dirs/modules.

  * User's lists should remain separate.

  * Web sites would remain separate.

  * Solr  Lucene are still separate downloads, separate JARs,
seperate subdirs in the source tree, etc.

The outside world still sees Solr  Lucene as separate entities.  It's
only that they would now be developed/released in synchrony.

There are some important gains by doing this:

  * Single source for all the code dup we now have across the
projects (my original reason, specifically on analyzers, for
starting this).

  * Whenever a new feature is added to Lucene, we'd work through what
the impact is to Solr.  This can still mean we separately develop
exposure in Solr, but it'd get us to at least more immediately
think about it.

  * Solr is Lucene's biggest direct user -- most people who use Lucene
use it through Solr -- so having it more closely integrated means
we know sooner if we broke something.

  * Right now I could test whether flex breaks anything in Solr.  I
can't do that now since Solr is isn't upgraded to 3.1.

Recent big changes (eg segment based searching, Version, attr based
tokenstream api) caused alot of work in Solr that could've been much
smoother had Solr been there as we were working through them.

Recent new features, eg near-real-time search, which are unavailable
in Solr still, would have at least had some discussion about how to
expose this in Solr.

Over time (and we don't have to do this right on day 1) we can make
core capabilities available to pure Lucene.  EG core Lucene users
should be able to use faceting, use a schema, etc.

I think this idea makes alot of sense and I think now is a good time
to do it.  Yes, this a big change, but I think the gains are sizable.
As Lucene  Solr diverge more, it'll only become harder and harder to
merge.

Robert's massive patch on SOLR-1657, upgrading most Solr's analyzers
to 3.0, is aging... while other changes to analyzers are being
proposed (SOLR-1799).  If we were integrated (or at least single
source for analyzers), Robert would already have committed it.

Mike

On Fri, Feb 26, 2010 at 5:20 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Fri, Feb 26, 2010 at 5:15 PM, Steven A Rowe sar...@syr.edu wrote:
 On 02/24/2010 at 2:20 PM, Yonik Seeley wrote:
 I've started to think that a merge of Solr and Lucene would be in the
 best interest of both projects.

 The Sorlucene :) merger could be achieved virtually, i.e. via policy, rather 
 than physically merging:

 Everything is virtual here anyway :-)
 I agree with Mike that a single dev list is highly desirable.  There
 would still be separate downloads.  What to do with some of the other
 stuff is unspecified.

 Committers would need to be merged though - that's the only way to
 make a change across projects w/o breaking stuff.

 -Yonik



Re: Factor out a standalone, shared analysis package for Nutch/Solr/Lucene?

2010-02-26 Thread Michael McCandless
I think this is a good idea!  LuSolr ;)  (kidding)

I agree with all of your points Yonik.

What do other people think...?

Mike

On Wed, Feb 24, 2010 at 2:20 PM, Yonik Seeley yo...@apache.org wrote:
 I've started to think that a merge of Solr and Lucene would be in the
 best interest of both projects.

 Recently, Solr as pulled back from using Lucene trunk (or even the
 latest version), as the increased amount of change between releases
 (and in-between releases) made it impractical to deal with. This is a
 pretty big negative for Lucene, since Solr is the biggest Lucene user
 (where people are directly exposed to lucene for the express purpose
 of developing search features).  I know Solr development has always
 benefited hugely from users using trunk, and Lucene trunk has now lost
 all the solr users.

 Some in Lucene development have expressed a desire to make Lucene more
 of a complete solution, rather than just a core full-text search
 library... things like a data schema, faceting, etc.  The Lucene
 project already has an enterprise search platform with these
 features... that's Solr.  Trying to pull popular pieces out of Solr
 makes life harder for Solr developers, brings our projects into
 conflict, and is often unsuccessful (witness the largely failed
 migration of FunctionQueries from Solr to Lucene).  For Lucene to
 achieve the ultimate in usability for users, it can't require Java
 experience... it needs higher level abstractions provided by Solr.

 The other benefit to Lucene would be to bring features to developers
 much sooner... Solr has had features years before they were developed
 in Lucene, and currently has more developers working with it.  Esp
 with Solr not using Lucene trunk, if a Solr developer wants a feature
 quickly, they cannot add it to Lucene (even if it might make sense
 there) since that introduces a big unpredictable lag - when that
 version of Lucene make it's way into Solr.

 The current divide is a bit unnatural.  For maximum benefit of both
 projects, it seems like Solr and Lucene should essentially merge.
 Lucene core would essentially remain as it is, but:
 1) Solr would go back to using Lucene's trunk
 2) For new Solr features, there would be an effort to abstract it such
 that non-Solr users could use the functionality (faceting, field
 collapsing, etc)
 3) For new Lucene features, there would be an effort to integrate it into 
 Solr.
 4) Releases would be synchronized... Lucene and Solr would release at
 the same time.

 -Yonik



Re: Stale NFS file handle Exception

2010-01-14 Thread Michael McCandless
This is a known limitation of Lucene over NFS.

It's because NFS makes no effort to protect open files from deletion.

Other filesystems prevent (or delay) deletion of still open files, eg
on Unix the delete on last close semantics is used, on Windows the
file cannot be deleted until no process has it open anymore.

One way to workaround this is to make a custom IndexDeletionPolicy, so
that your app defers deletion of old commits until you know all
current readers have reopened.  Another workaround is to simply catch
that exception (best to screen for Stale NFS file handle, so you don't
mask other IOException cases), and reopen your reader right then --
but this is only viable if it's acceptable that a random Query will be
forced to wait while reopen/warming takes place.

Mike

On Thu, Jan 14, 2010 at 1:25 AM, Claudio Deluca decl...@gmail.com wrote:
 Hi,
 We are using Lucene 2.4.1 in a load-balanced environment.
 The lucene index is stored on server a while server b accesses the index
 though an nfs share.

 After creating the instance of IndexWriter, the documents are beeing added
 and the index gets optimized and closed.
 *IndexWriter theIndexWriter = new IndexWriter(new File(indexerPath), new
 WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
 ...
 theIndexWriter.optimize();
 theIndexWriter.close();*

 For serach we open the index on application startup like this
 *Directory theDirectory = FSDirectory.getDirectory(theConfigPath);
 IndexReader indexReader = IndexReader.open(theDirectory, true);
 IndexSearcher searcher = new IndexSearcher(indexReader);*

 The exception appears, when server a finished recreation of index (closed
 the indexwriter) and server b executes a search query over the index.
 Only if we restart the application the problem does not longer appear,
 because at this point the index will be newly opened.

 How can we avoid this Exception?

 java.io.IOException: Stale NFS file handle
        at java.io.RandomAccessFile.readBytes(Native Method)
        at java.io.RandomAccessFile.read(Unknown Source)
        at 
 org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal(FSDirectory.java:596)
        at 
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:136)
        at 
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:247)
        at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:157)
        at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
        at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:78)
        at 
 org.apache.lucene.index.SegmentTermDocs.next(SegmentTermDocs.java:110)
        at 
 org.apache.lucene.index.SegmentTermPositions.next(SegmentTermPositions.java:98)
        at 
 org.apache.lucene.search.PhrasePositions.next(PhrasePositions.java:41)
        at org.apache.lucene.search.PhraseScorer.init(PhraseScorer.java:131)
        at org.apache.lucene.search.PhraseScorer.next(PhraseScorer.java:76)
        at 
 org.apache.lucene.search.ConjunctionScorer.init(ConjunctionScorer.java:80)
        at 
 org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48)
        at 
 org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:319)
        at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:136)
        at 
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:123)
        at org.apache.lucene.search.Searcher.search(Searcher.java:86)


 Thanks,
 Claudio



Re: Lucene PMC += Mark Miller

2010-01-14 Thread Michael McCandless
Welcome!

Mike

On Thu, Jan 14, 2010 at 10:37 AM, Grant Ingersoll gsing...@apache.org wrote:
 I'm pleased to announce the Lucene PMC has elected to add Mark Miller to its 
 ranks in recognition of his longstanding contributions to the Lucene 
 community as a committer on both Lucene Java and Solr.

 Congrats, Mark!

 -Grant Ingersoll
 Lucene PMC Chair


Re: [spatial] Cartesian Tiers nomenclature

2009-12-30 Thread Michael McCandless
Right, NRQ is able to translate any requested range into the union
(OR) of brackets (from the trie) created during indexing.

Can spatial do the same thing, just with 2D instead of 1D?  Ie,
reconstruct any expressible shape (created at query time) as the union
of some number of grids/tiers, at finer  finer levels, created during
indexing?

Spatial, today, seems to do this, except it must also do precise
filtering on each matching doc, because some of the grids may contain
hits outside of the requested shape.

In fact, NRQ could also borrow from spatial's current approach -- ie,
create the union of some smallish number of coarse brackets.  Some of
the brackets will fall entirely within the requested range, and so
require no further filtering, while others will fall part inside /
part outside of the requested range, and so will require precise
filtering.  If NRQ did this, it should have much fewer postings to
enum, at the cost of having to do precise filtering on some of them
(and we'd have to somehow encode the orig value in the index).

Mike

On Tue, Dec 29, 2009 at 8:42 PM, Yonik Seeley
yo...@lucidimagination.com wrote:
 On Tue, Dec 29, 2009 at 7:13 PM, Marvin Humphrey mar...@rectangular.com 
 wrote:
 ... but for this algorithm, different rasterization resolutions need not
 proceed by powers-of-two.

 Indeed - one way to further generalize would be to use something like
 Lucene's trie-based Numeric field, but with a square instead of a
 line.  That would allow to tweak the space/speed tradeoff.

 -Yonik
 http://www.lucidimagination.com



Re: [spatial] Cartesian Tiers nomenclature

2009-12-29 Thread Michael McCandless
It's great that there's such a sudden burst of energy to improve
spatial in both Solr and Lucene!

Isn't this concept the same as trie (for Lucene's numeric fields),
but in 2D not 1D?

If so, I think tiles doesn't convey that they recursively
subdivide.

Also: why does this notion even need naming so badly?  Why does this
concept leak out of the abstraction?  Shouldn't this (cartesian
tier, cartesian tier plotter) all be under the hood?  I make a
SpatialField, I index it, I can then make SpatialShapeQuery, a
SpatialDistanceSort, etc.?

Ie, trie is known within Lucene, but doesn't leak out -- the outside
world knows it as Numeric*.  Trie is an implementation detail,
inside Lucene.

(NOTE: I only know just enough about spatial to be dangerous...)

Mike

On Tue, Dec 29, 2009 at 2:49 AM, patrick o'leary pj...@pjaol.com wrote:
 Ah the language of math is the ultimate lingua franca -
 Nice !

 When you look at the coordinates entity from KML, ask why are the lat /
 longs reversed to long/ lat?
 Answer because the folks working on the display thought in terms of *display
 not GIS*, the point is over Y degrees of longitude and down X degrees of
 latitude.

 But again that's not a convention used outside a little part of GeoTools or
 KML, GML / GeoRSS are again just the regular lat,long (NS,EW), or projected
 EPSG or other standard projections in  OGC 05-011.
 To my knowledge google are the only real pushers of (EW,NS) these days.

 So what does this diatribe mean? We're kind of at the bleeding edge of
 defining the standard, hence the difficulty of finding data on it.
 This is one reason why locallucene and localsolr became popular, it solved a
 problem simply.

 Doc's about it exist on gissearch.com
 dzone are doing articles on it
 http://java.dzone.com/articles/spatial-search-hibernate?utm_source=feedburnerutm_medium=feedutm_campaign=Feed%3A+javalobby%2Ffrontpage+%28Javalobby+%2F+Java+Zone%29

 Locallucene in google has over 8,000 results
 http://www.google.com/search?q=locallucene

 Localsolr has over 4,000 results
 http://www.google.com/search?q=localsolr

 I've seen and help with installations all over the place, heck even codehaus
 use it, as do folks on github with geonames db.
 I see named it mathematically  scientifically correct, and  gaining enough
 traction and popularity to start becoming part of the standard, not just
 duplicating one.

 I can't honestly see how a refactoring is bringing anything positive to
 this, when there isn't a good standard out there yet.


 On Mon, Dec 28, 2009 at 10:22 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Patrick,

 Interesting. It seems like there is a precedent already in the Local Lucene
 and Local SOLR packages that define CartesianTier as lingua franca.

 Like I said in an earlier email it depends on who you talk to regarding the
 preference of what to call these Tiles/Grids/Tiers, etc., and that seems to
 be further evidenced by your research.

 I for one don¹t really have a preference but precedent matters to me and if
 Tiers have been used to date then there should be strong consideration to
 use that nomenclature and +1 from me.

 Cheers,
 Chris

 On 12/28/09 9:25 PM, patrick o'leary pj...@pjaol.com wrote:

  So trying no to drag this out, the most frequent generic term used in GIS
  software is SRID
  http://en.wikipedia.org/wiki/SRID
 
  Again this provides just a basic nomenclature for the high level element,
  somewhat the blackbird of objects rather than the defining the magpie
 (sorry
  for the CS 101 reference)
 
  But it should show that every implementation is unique in some format.
  Perhaps as unique as CartesianTier's ( sorry Ted ! )
 
 
 
  On Mon, Dec 28, 2009 at 5:26 PM, patrick o'leary pj...@pjaol.com
 wrote:
 
  Hmm, depends, tiles indicate to me a direct correlation between the id
 and
  a map tile, which will depend upon using the right projection
  with the cartesian plotter
 
 
  On Mon, Dec 28, 2009 at 2:56 PM, Grant Ingersoll gsing...@apache.org
 wrote:
 
 
  On Dec 28, 2009, at 4:19 PM, patrick o'leary wrote:
 
  Hmm, but when you say grid, to me that's just a bunch of regularly
  spaced
  lines..
 
  Yeah, I hear you.  I chose spatial tiles for the Solr patch, but
 spatial
  grid would work too.  Or map tiles/map grids.  That anchors it into the
  spatial world, since we're calling Lucene's spatial contrib/spatial and
  Solr's Solr Spatial.
 
 
  On Mon, Dec 28, 2009 at 1:16 PM, Grant Ingersoll gsing...@apache.org
  wrote:
 
 
  On Dec 28, 2009, at 3:51 PM, patrick o'leary wrote:
 
  So Grant here's the deal behind the name.
  Cartesian because it's a simple x.y coordinate system
  Tier because there are multiple tiers, levels of resolution.
 
  If you look at it closer:
  - To programmers there's a quadtree implementation
  - To web users who use maps these are grids / tiles.
  - To GIS experts this is a form of multi-resolution raster-ing.
  - To astrophysicists these are tiers.
  - To the MS folks I've 

  1   2   >