from:"Michael McCandless"

Re: How to find RAM/disk usage of each vector field

2024-11-06 Thread Michael McCandless

On Tue, Nov 5, 2024 at 5:17 PM Adrien Grand  wrote

Why is it important to break down per field as opposed to scaling based on
> the total volume of vector data?
>

It's really for internal planning purposes / service telemetry ... at
Amazon product search team (where I also work w/ Tanmay -- hi Tanmay!) we
have a number of teams using our Lucene search service to experiment with
KNN search, varying the number of dimensions, whether quantization is in
use, which ML model, etc.  These fields come and go, sometimes without our
(low level infrastructure) service knowing ahead of time how these fields
are changing.  So we would ideally have an efficient way to break out
per-field KNN disk and "ideal hot RAM" online (in our production service)
instead of offline / inefficiently like rewriting the whole index into
separate files (Robert's cool DiskUsage tool).

It's tricky with KNN and features like scalar quantization (
https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene) and
soon RabitQ (https://github.com/apache/lucene/pull/13651) because the
on-disk form (which retains full float32 precision vectors) is different
from what searching really uses (the quantized byte-per-dimension form).
So the disk consumed by each field is larger than the amount of effective
"hot RAM" you might need.

Mike McCandless

http://blog.mikemccandless.com

Re: How to find RAM/disk usage of each vector field

2024-11-06 Thread Michael McCandless

On Tue, Nov 5, 2024 at 7:31 PM Patrick Zhai  wrote:

I wouldn't call this a good way, but as the last resort you can parse the
> metadata files yourself, as it is not so hard to parse (yet)

Yeah ... the Lucene codec itself knows precisely how much disk is used for
each field, and indeed stores it simply in its metadata.  And it's
incredibly fast to peek into that metadata to get the per-field metrics.

We will likely take this approach (on top of Lucene) but it is clearly an
abstraction violation, is brittle to future Codec changes, not supported,
etc.

We've taken the same brittle approach (at Amazon product search) to track
per-field disk usage of terms dictionary and postings (inverted lexical
index) for similar reasons (so many teams indexing so many fields with so
many words!).

I would think many multi-tenant users of Lucene would want some resource
tracking along these lines... but "officially" supporting this in Lucene's
Codec APIs would be an added dev burden.

As for RAM usage, OnHeapHnswGraph actually implements the Accountable API
> <
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L279C15-L279C27
> >,
> and HnswGraphBuilder also have InfoStream passed in
> <
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L199
> >
> so
> I think
> it's ok and reasonable to report the RAM usage at the end of graph build
> maybe? Tho this won't include the off heap vector sizes but that one can be
> estimated easily I think?
>

I think the OnHeapHnswGraph is used only during indexing?  But +1 to have
the infoStream print the RAM size of that graph during indexing if it
doesn't already ... Tanmay maybe open a spinoff issue for this small
improvement?

Thanks Patrick.

Mike McCandless

http://blog.mikemccandless.com

Re: Performance Difference between files getting opened with IoContext.RANDOM vs IoContext.READ

2024-09-29 Thread Michael McCandless

Hi Navneet,

With RANDOM IOcontext, on modern OS's / Java versions, Lucene will hint the
memory mapped segment that the IO will be random using madvise POSIX API
with MADV_RANDOM flag.

For READ IOContext, Lucene maybe hits with MADV_SEQUENTIAL, I'm not sure.
Or maybe it doesn't hint anything?

It's up to the OS to then take these hints and do something "interesting"
to try to optimize IO and page caching based on these hints.  I think
modern Linux OSs will readahead (and pre-warm page cache) for
MADV_SEQUENTIAL?  And maybe skip page cache and readhead for MADV_RANDOM?
Not certain...

For computing checksum, which is always a sequential operation, if we use
MADV_RANDOM (which is stupid), that is indeed expected to perform worse
since there is no readahead pre-caching.  50% worse (what you are seeing)
is indeed quite an impact ...

Maybe open an issue?  At least for checksumming we should open even .vec
files for sequential read?  But, then, if it's the same IndexInput which
will then be used "normally" (e.g. for merging), we would want THAT one to
be open for random access ... might be tricky to fix.

One simple workaround an application can do is to ask MMapDirectory to
pre-touch all bytes/pages in .vec/.veq files -- this asks the OS to cache
all of those bytes into page cache (if there is enough free RAM).  We do
this at Amazon (product search) for our production searching processes.
Otherwise paging in all .vec/.veq pages via random access provoked through
HNSW graph searching is crazy slow...

Mike McCandless

http://blog.mikemccandless.com

On Sun, Sep 29, 2024 at 4:06 AM Navneet Verma 
wrote:

> Hi Lucene Experts,
> I wanted to understand the performance difference between opening and
> reading the whole file using an IndexInput with IoContext as RANDOM vs
> READ.
>
> I can see .vec files(storing the flat vectors) are opened with RANDOM and
> whereas dvd files are opened as READ. As per my testing with files close to
> size 5GB storing (~1.6M docs with each doc 3072 bytes), I can see that when
> full file checksum validation is happening for a file opened via READ
> context it is faster than RANDOM. The amount of time difference I am seeing
> is close to 50%. Hence the performance question is coming up, I wanted to
> understand is this understanding correct?
>
> Thanks
> Navneet
>

Re: Excessive reads while doing commit in lucene

2024-09-04 Thread Michael McCandless

It's odd to have a ~500X difference in writes versus reads.  Are you sure?
Is it possible you are also opening IndexReaders and searching the commit
points?

Lucene does re-read previously written (already indexed) documents during
segment merges.  But at default settings (as long as you did not change
merge settings in IndexWriterConfig) this read/write amplification should
be log(N) in cost, i.e. maybe ~5X at worst, not ~500X.

More comments inlined below:

On Wed, Sep 4, 2024 at 7:07 AM Gopal Sharma 
wrote:

In my use case i am committing every 100k records (because in my test
> scenarios committing per million was taking a lot of time)
>

Setting as large an IndexWriterConfig.setRAMBufferSizeMB as possible, and
committing as rarely as possible, should minimize overall IO (lower
read/write amplification due to merging).

> Below is the snippet on how i am instantiating lucene writter
>
> FSDirectory indexDirectory = NIOFSDirectory.open(indexDir.toPath());
>

Have you tried MMapDirectory instead?  NIOFSDirectory does buffered
reads... so I wonder if your odd "read amplification" is somehow caused by
that?  That would not be good -- it's a performance bug in NIOFSDirectory,
if so.  I'd be curious whether this fixes (works around) your ~500X read
amplification during indexing.

Still, using MMapDirectory just foists the problem (which pages to
read/cache) onto the OS.  But Lucene's reads should be largely sequential,
so it ought to be easy for the OS to cache/readahead well and not burn too
much read IO from EFS.

Mike McCandless

http://blog.mikemccandless.com

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-15 Thread Michael McCandless

Thanks Jeven, more response inlined below:

On Tue, May 14, 2024 at 12:58 PM Jerven Tjalling Bolleman
 wrote:

The index that had an issue when merging into one segment definitely had
> more than 1 billion times the word "positional" in it. I hope to be able
> to give a closer number once re-indexing finished with a "work-around".
>
> Of course the "work-around" is to just fix this correctly by not having
> that word so often in the index and definitely not as docs, freqs and
> postings.
>

To be clear, indexing a given token like "positional" (nice token btw) as
many times as you like into a Lucene index, even force merging down to a
single segment, is perfectly allowed, and it certainly should not throw an
exception, let alone a cryptic one like this!  That's a valid use-case.

So we really need to understand why you're even hitting an exception in the
first place ...

Mike McCandless

http://blog.mikemccandless.com

Re: ArithmeticException: due to integer overflow during lucene merging

2024-05-14 Thread Michael McCandless

I think we should at least open an issue to try to improve the exception
message?  We might catch the exception higher up (where we know the field
name) and rethrow with the field name, maybe.  We can discuss options on
the issue ...

If you are not using custom term frequencies it's not clear to me how you
could even have hit this exception.

Mike McCandless

http://blog.mikemccandless.com


On Tue, May 7, 2024 at 4:10 PM Michael Sokolov  wrote:

> This is definitely a confusing error condition. If we can add more
> information without creating an undue burden for the indexer it would
> be nice, but I think this will be very challenging here since the
> exception is thrown at a low level in the code where there might not
> be a lot of useful info (ie the field name) to provide. And I expect
> there are other places that make a similar assumption we would have to
> track down?
>
> On Tue, May 7, 2024 at 9:10 AM Jerven Tjalling Bolleman
>  wrote:
> >
> > Dear Michael,
> >
> > Looking deeper into this. I think we overflowed a term frequency field.
> > Looking in some statistics, in a previous release we had 1,288,526,281
> > of a certain field, this would be larger now. Each of these would have
> > had a limited set of values. But crucially nearly all of them would have
> > had the term "positional" or "non-positional" added to the document.
> >
> > There is no good reason to do this today, we should just turn this into
> > a boolean field and update the UI. I will do this and report back.
> >
> > Do you think that a patch for a try/catch for a more informative log
> > message be appreciated by the community? e.g. mentioning the field name
> > in the exception?
> >
> > Regards,
> > Jerven
> >
> > On 5/7/24 14:52, Jerven Tjalling Bolleman wrote:
> > > Dear Michael,
> > >
> > > Thank you for your help.
> > >
> > > We don't use custom term frequencies (I just double checked with a code
> > > search).
> > > We also always merge down to one segment (historical but also we index
> > > once and then there are no changes for a week to a month and then we
> > > reindex every document from scratch).
> > >
> > > Your response is very helpful already and I very much appreciate it as
> > > it cuts down the search space significantly.
> > >
> > > Regards,
> > > Jerven
> > >
> > >
> > > On 5/7/24 14:03, Michael Sokolov wrote:
> > >> It seems as if the term frequency for some term exceeded the maximum.
> > >> This can happen if you supplied custom term frequencies eg with
> > >>
> https://lucene.apache.org/core/9_10_0/core/org/apache/lucene/analysis/tokenattributes/TermFrequencyAttribute.html?is-external=true
> > >> . The behavior didn't change since 8.x but it's possible that the
> > >> merging brought together some very "high frequency" terms that were
> > >> previously not in the same segment?
> > >>
> > >> On Tue, May 7, 2024 at 4:03 AM Jerven Tjalling Bolleman
> > >>  wrote:
> > >>>
> > >>> Dear Lucene community,
> > >>>
> > >>> This morning I found this exception in our logs. This was the first
> time
> > >>> we indexed this data with lucene 9.10. Before we were still on the
> > >>> lucene 8.x branch. between the last indexing with 8 and this one with
> > >>> 9.10 we have a bit more data so it could be something else that went
> > >>> over an limit.
> > >>>
> > >>> Unfortunately, from this log message I am at a loss for what is going
> > >>> on. And what I could do to prevent this from happening. Does anyone
> have
> > >>> any ideas?
> > >>>
> > >>> Regards,
> > >>> Jerven Bolleman
> > >>>
> > >>>
> > >>> Exception in thread "Lucene Merge Thread #202"
> > >>> org.apache.lucene.index.MergePolicy$MergeException:
> > >>> java.lang.ArithmeticException: integer overflow
> > >>> at
> > >>>
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:735)
> > >>> at
> > >>>
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:727)
> > >>> Caused by: java.lang.ArithmeticException: integer overflow
> > >>> at java.base/java.lang.Math.toIntExact(Math.java:1135)
> > >>> at
> > >>>
> org.apache.lucene.store.DataOutput.writeGroupVInts(DataOutput.java:354)
> > >>> at
> > >>>
> org.apache.lucene.codecs.lucene99.Lucene99PostingsWriter.finishTerm(Lucene99PostingsWriter.java:379)
> > >>> at
> > >>>
> org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:173)
> > >>> at
> > >>>
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.write(Lucene90BlockTreeTermsWriter.java:1097)
> > >>> at
> > >>>
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write(Lucene90BlockTreeTermsWriter.java:398)
> > >>> at
> org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:95)
> > >>> at
> > >>>
> org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:205)
> > >>> at
> > >>>
> org.apache.lucene.index.SegmentMerger.mergeTerms

Re: recommended index size

2024-01-04 Thread Michael McCandless

Hi Vincent,

Lucene has a hard limit of ~2.1 B documents in a single index; hopefully
you hit the ~50 - 100 GB limit well before that.

Otherwise it's very application dependent: how much latency can you
tolerate during searching, how fast are the underlying IO devices at random
and large sequential IO, the types of queries, etc.

Lucene should not require much additional RAM as the index gets larger --
much work has been done in recent years to move data structures off-heap.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jan 2, 2024 at 9:49 AM  wrote:

> Hello,
>
> is there a recommended / rule of thumb maximum size for index?
> I try to target between 50 and 100 Gb, before spreading to other servers.
> or is this just a matter of how much memory and cpu I have?
> this is a log aggregation use case. a lot of write, smaller number of reads
> obviously.
> I am using lucene 9.
> thanks,
> Vincent
>

Re: Performance changes within the Lucene 8 branch

2023-12-14 Thread Michael McCandless

Hi Marc,

How are you retrieving your hits?  Lucene's stored fields, or doc values,
or both?

Do you sort the hits docids and then retrieve them in docid order (NOT in
the sorted order Lucene returned them in)?  I think that might be faster as
Lucene's stored fields use block compression and if there are multiple
docids in one underlying block then you amortize the overhead of that
decompression over the N docs in it.

Also, make sure you are using Mode.BEST_SPEED when you create the Codec at
IndexWriter write time.  It's the default, so if you are not changing that
to BEST_COMPRESSION, great.

Separately: why are you retrieving so many results?  Are you doing some
multi-phased ranking/filtering or so?  It's best to push any/all
filters/ranks as deep into the original Lucene search as possible/feasible
so that you don't have problems like this.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Dec 12, 2023 at 4:36 PM Marc Davenport
 wrote:

> Hello,
>
> We have a search application built around Lucene 8.  Motivated by the list
> of performance enhancements and optimizations in the change notes we
> upgraded from 8.1 to 8.11.2.  We track the performance of different
> activities within our application and can clearly see an improvement in our
> facet count queries (about 15% on our p50 to 5% on our p95 execution
> times).
>
> But when looking at the calls to retrieve matching documents, the parts of
> our application where we allow the retrieval of many top hits (up to 10k)
> has really suffered. For instance when the number of documents matching our
> query exceeds 5k I see the performance of the doc matching degrade 50% on
> the p50 and 35% on the p95.  The facet collection still shows improvements
> of 1-5%. The counter point is that when a query will match 500 documents
> then we see across the board improvements in both document matching and
> facet count generation.
>
> Has anyone else seen this sort of performance change in applications that
> allow such a high number of top docs to be returned from
> IndexSearcher.search()?  Is there some tuning or flag setting that I should
> consider when allowing such a large number of top documents to be
> returned?
>
> Thank you,
> Marc Davenport
>

Re: Consistent NRT searching with SearcherLifetimeManager and multiple instances

2023-12-14 Thread Michael McCandless

Hi Steven,

Great question!  I'm so glad to hear your app is providing consistent
pagination :)  I've long felt Lucene (with NRT segment replication) could
do a great job at this, yet so few apps manage to implement it.  Every time
I interact with a search engine and go to the next page it irks me that I
might be missing some results...

First off, the point-in-time IndexSearchers are keyed into the
SearcherLifetimeManager by their underlying IndexReader.getVersion(), which
returns the long value from the underlying SegmentInfos.getVersion().

This is good news because it means all of your replicas will see the same
long version mapping to the same point-in-time view of the index,
even across replicas, since that same SegmentInfos is sent to all replicas
by the primary node.

Second, each of your replicas should simply assume that every point-in-time
IndexSearcher may be used at any time by an incoming search request, and
enroll all refreshed IndexSearchers into the local
SearcherLifetimeManager.  This way, no matter where the followon requests
go, that replica will have that IndexSearcher version.   This is not as
costly as it sounds because a refreshed IndexSearcher will in general share
nearly all of its segments with the prior one(s).

This requires a periodic refresh schedule, and all replicas should quickly
refresh when the primary publishes a new point-in-time SegmentInfos.

There is some small risk if replicas do not refresh consistently around
the same time, and page 2 for a query goes to a replica that has not yet
refreshed.  This ought to be rare, since it'd mean a human loaded page 1
from a replica that had already refreshed, consumed the results, then
clicked on page 2, and by then replicas should (typically) all have
refreshed.  When it happens, you could either have the query wait for the
refresh to completely (somewhat dangerous since such queries could pile up
if something is seriously wrong with that node and its refreshing is
sluggish), or, simply retry the query to another replica: eventually it
will find a replica that has the point-in-time IndexSearcher already
refreshed.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Dec 13, 2023 at 4:47 PM Steven Schlansker <
stevenschlans...@gmail.com> wrote:

> Hi lucene-users,
>
> We use the lucene-replicator to have a single indexing node push commits
> and NRT updates
> to a set of replicas.
>
> Currently, each replica has the full dataset - there is no sharding.
>
> We use a SearcherLifetimeManager to try to provide consistent pagination
> over results.
>
> So when we present the first page of results, we return the result of
> `record(IndexSearcher)` to the client,
> with the expectation that at a later time (but not too much later) they
> might request page 2 of results with version X.
>
> This works fine for a single instance, since the SearcherLifetimeManager
> keeps the remembered version around.
> However, with multiple instances, this doesn't seem to work at all -
> your first request goes to replica A, who calls `record(searcher) -> X`.
>
> The second request likely goes to a different instance B,
> whose lifetime manager never saw a call to `record` at all - so the
> `acquire(X)` fails and returns null.
>
> Surely there must be a way to solve this -
> how do you implement consistent versioned searching like
> SearcherLifetimeManager, but with multiple Lucene replicas
> who otherwise do not coordinate about which NRT versions get opened or
> recorded?
>
> Thanks for any advice,
> Steven
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Michael McCandless

You can use either the "doc values" implementation for facets
(SortedSetDocValuesFacetField), or the "taxonomy" implementation
(FacetField, in which case, yes, you need to create a TaxonomyWriter).

It used to be that the "doc values" based faceting did not support
arbitrary hierarchy, but I think that was fixed at some point.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 20, 2023 at 9:03 AM Michael Wechner 
wrote:

> Hi Mike
>
> Thanks for your feedback!
>
> IIUC in order to have the actual advantages of Facets one has to
> "connect" it with a TaxonomyWriter
>
> FacetsConfig config = new FacetsConfig();
> DirectoryTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxoDir);
> indexWriter.addDocument(config.build(taxoWriter, doc));
>
> right?
>
> Thanks
>
> Michael
>
>
>
>
> Am 20.10.23 um 12:19 schrieb Michael McCandless:
> > There are some differences.
> >
> > StringField is indexed into the inverted index (postings) so you can do
> > efficient filtering.  You can also store in stored fields to retrieve.
> >
> > FacetField does everything StringField does (filtering, storing
> (maybe?)),
> > but in addition it stores data for faceting.  I.e. you can compute facet
> > counts or simple aggregations at search time.
> >
> > FacetField is also hierarchical: you can filter and facet by different
> > points/levels of your hierarchy.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Oct 20, 2023 at 5:43 AM Michael Wechner <
> michael.wech...@wyona.com>
> > wrote:
> >
> >> Hi
> >>
> >> I have found the following simple Facet Example
> >>
> >>
> >>
> https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleFacetsExample.java
> >>
> >> whereas for a simple categorization of documents I currently use
> >> StringField, e.g.
> >>
> >> doc1.add(new StringField("category", "book"));
> >> doc1.add(new StringField("category", "quantum_physics"));
> >> doc1.add(new StringField("category", "Neumann"))
> >> doc1.add(new StringField("category", "Wheeler"))
> >>
> >> doc2.add(new StringField("category", "magazine"));
> >> doc2.add(new StringField("category", "astro_physics"));
> >>
> >> which works well, but would it be better to use Facets for this, e.g.
> >>
> >> doc1.add(new FacetField("media-type", "book"));
> >> doc1.add(new FacetField("topic", "physics", "quantum");
> >> doc1.add(new FacetField("author", "Neumann");
> >> doc1.add(new FacetField("author", "Wheeler");
> >>
> >> doc1.add(new FacetField("media-type", "magazine"));
> >> doc1.add(new FacetField("topic", "physics", "astro");
> >>
> >> ?
> >>
> >> IIUC the StringField approach is more general, whereas the FacetField
> >> approach allows to do a more specific categorization / search.
> >> Or do I misunderstand this?
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Michael McCandless

There are some differences.

StringField is indexed into the inverted index (postings) so you can do
efficient filtering.  You can also store in stored fields to retrieve.

FacetField does everything StringField does (filtering, storing (maybe?)),
but in addition it stores data for faceting.  I.e. you can compute facet
counts or simple aggregations at search time.

FacetField is also hierarchical: you can filter and facet by different
points/levels of your hierarchy.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 20, 2023 at 5:43 AM Michael Wechner 
wrote:

> Hi
>
> I have found the following simple Facet Example
>
>
> https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleFacetsExample.java
>
> whereas for a simple categorization of documents I currently use
> StringField, e.g.
>
> doc1.add(new StringField("category", "book"));
> doc1.add(new StringField("category", "quantum_physics"));
> doc1.add(new StringField("category", "Neumann"))
> doc1.add(new StringField("category", "Wheeler"))
>
> doc2.add(new StringField("category", "magazine"));
> doc2.add(new StringField("category", "astro_physics"));
>
> which works well, but would it be better to use Facets for this, e.g.
>
> doc1.add(new FacetField("media-type", "book"));
> doc1.add(new FacetField("topic", "physics", "quantum");
> doc1.add(new FacetField("author", "Neumann");
> doc1.add(new FacetField("author", "Wheeler");
>
> doc1.add(new FacetField("media-type", "magazine"));
> doc1.add(new FacetField("topic", "physics", "astro");
>
> ?
>
> IIUC the StringField approach is more general, whereas the FacetField
> approach allows to do a more specific categorization / search.
> Or do I misunderstand this?
>
> Thanks
>
> Michael
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene Index Writer in a distributed system

2023-10-19 Thread Michael McCandless

Hi Gopal,

Indeed, for a single Lucene index, only one writer may be open at a time.
Lucene tries to catch you if you mess this up, using file-based locking.

If you really need concurrent indexing, you could have N IndexWriters each
writing into a private Directory, and then periodically use addIndexes to
"reduce" these indices down into a single index which you then use for
searching.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Oct 19, 2023 at 5:49 AM Gopal Sharma 
wrote:

> Hello Team,
>
> I am new to Lucene and want to use Lucene in a distributed system to write
> in a Amazon EFS index.
>
> As per my understanding, the index writer for a particular index needs to
> be opened by 1 server only. Is there a way we can achieved this in
> distributed system to write parallelly in Lucence
>
> Any document/article where i can read about this problem will be of great
> help, Many thanks in advance!!
>
> Gopal
>

Re: Reindexing leaving behind 0 live doc segments

2023-08-31 Thread Michael McCandless

Hi Rahul,

Please do not pursue Approach 2 :)  ReadersAndUpdates.release is not
something the application should be calling.  This path can only lead to
pain.

It sounds to me like something in Solr is holding an old reader (maybe the
last commit point, or reader prior to the refresh after you re-indexed all
docs in a given now 100% deleted segment) open.

Does Solr keep old readers open, older than the most recent commit?  Do you
have queries in flight that might be holding the old reader open?

Given that your small by-hand test case (3 docs) correctly showed the 100%
deleted segment being reclaimed after the soft commit interval or a manual
hard commit, something must be different in the larger use case that is
causing Solr to keep a still old reader open.  Is there any logging you can
enable to understand Solr's handling of its IndexReaders' lifecycle?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Aug 28, 2023 at 10:20 PM Rahul Goswami 
wrote:

> Hello,
> I am trying to execute a program to read documents segment-by-segment and
> reindex to the same index. I am reading using Lucene apis and indexing
> using solr api (in a core that is currently loaded).
>
> What I am observing is that even after a segment has been fully processed
> and an autoCommit (as well as autoSoftCommit ) has kicked in, the segment
> with 0 live docs gets left behind. *Upon Solr restart, the segment does get
> cleared succesfully.*
>
> I tried to replicate same thing without the code by indexing 3 docs on an
> empty test core, and then reindexing the same docs. The older segment gets
> deleted as soon as softCommit interval hits or an explicit commit=true is
> called.
>
> Here are the two approaches that I have tried. Approach 2 is inspired by
> the merge logic of accessing segments in case opening a DirectoryReader
> (Approach 1) externally is causing this issue.
>
> But both approaches leave undeleted segments behind until I restart Solr
> and load the core again. What am I missing? I don't have any more brain
> cells left to fry on this!
>
> Approach 1:
> =
> try (FSDirectory dir = FSDirectory.open(Paths.get(core.getIndexDir()));
> IndexReader reader = DirectoryReader.open(dir)) {
> for (LeafReaderContext lrc : reader.leaves()) {
>
>//read live docs from each leaf , create a
> SolrInputDocument out of Document and index using Solr api
>
> }
> }catch(Exception e){
>
> }
>
> Approach 2:
> ==
> ReadersAndUpdates rld = null;
> SegmentReader segmentReader = null;
> RefCounted iwRef =
> core.getSolrCoreState().getIndexWriter(core);
>  iw = iwRef.get();
> try{
>   for (SegmentCommitInfo sci : segmentInfos) {
>  rld = iw.getPooledInstance(sci, true);
>  segmentReader = rld.getReader(IOContext.READ);
>
> //process all live docs similar to above using the segmentReader.
>
> rld.release(segmentReader);
> iw.release(rld);
> }finally{
>if (iwRef != null) {
>iwRef.decref();
> }
> }
>
> Help would be much appreciated!
>
> Thanks,
> Rahul
>

Re: Vector Search with OpenAI Embeddings: Lucene Is All You Need

2023-08-31 Thread Michael McCandless

Thanks Michael, very interesting!  I of course agree that Lucene is all you
need, heh ;)

Jimmy Lin also tweeted about the strength of Lucene's HNSW:
https://twitter.com/lintool/status/1681333664431460353?s=20

Mike McCandless

http://blog.mikemccandless.com


On Thu, Aug 31, 2023 at 3:31 AM Michael Wechner 
wrote:

> Hi Together
>
> You might be interesed in this paper / article
>
> https://arxiv.org/abs/2308.14963
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Re: LuceneTestCase altered the default query cache policy

2023-06-27 Thread Michael McCandless

Hi Yuan,

[Disclaimer: I work in the same team at Amazon, customer facing product
search, where we heavily use Lucene at high scale!]

LuceneTestCase already has similar assertions, e.g. to confirm that no
system properties were changed, no threads leaked, not too much static
objects left referenced, etc.

I think it'd make sense to also assert that tests did not leave
IndexSearcher's default cache and cache policy changed?  In general tests
should not alter any of the mutable global state in the JVM.

Maybe open an issue / PR on GitHub and we can discuss it there?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jun 26, 2023 at 5:26 PM Yuan Xiao  wrote:

> Hello community,
>
> I am a developer that work for Amazon Product Search. Recently I have
> experienced a use scenario that I altered the IndexSearcher’s default query
> cache policy but some of our unit tests that extended to LuceneTestCase
> failed. It take me some time to figure it out LuceneTestCase actually
> override the IndexSearcher’s defaultQueryCahchingPolicy in its before
> class.
>
> We think it would be a great feature that if LuceneTestCase could make an
> assertion that if custom query cache policy has been set, we should not
> need to override the default query policy? Or at least make an assertion,
> so developer can catch the test bug earlier?
>
> Want to share and know community’s idea on this.
>
> Thanks&Regards,
> Yuan
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene in action

2023-06-10 Thread Michael McCandless

Hi Vimal,

Indeed I think it is unlikely I have the energy for a 3rd edition ... but
anyone can drive the 3rd edition, not just the prior authors.  New authors
welcome!

> Since 2nd edition ( based on lucene 4),

I'm sorry to say that 2nd edition is based on Lucene 3.0 not 4!  It's even
older than you thought ;)

Many of the core concepts in LIA 2 are still accurate, but the book
does not cover all the new and exciting stuff added to Lucene since.

> So should we expect 3rd edition soon or any other references to learn new
things in latest lucene ?

Well, I try to write up blog posts on exciting new features at
https://blog.mikemccandless.com/ but even that writing is challenging to
make time for...

Many other people write up exciting Lucene related blog posts too.  And
conferences like Community Over Code and Berlin Buzzwords are a great
chance to learn and meet fellow developers...

Mike McCandless

http://blog.mikemccandless.com

On Sat, Jun 10, 2023 at 3:07 AM Mark Miller  wrote:

> Nature abhors being anything but an author by name on a second tech book.
> The ruse is up after one when you have the inputs crystalized and the
> hourly wage in hand. Hard to find anything but executive producers after
> that. I’d shoot for a persuasive crowdfunding attempt.
>

Re: Analyzer.createComponents(String fieldname) only being called once, when indexing multiple documents

2023-06-09 Thread Michael McCandless

Hi Usman,

Long ago Lucene switched to reusing these analysis components (per
Analyzer, per thread), so that explains why createComponents is called once.

However, the reuse policy is controllable (expert usage), so in theory you
could implement an Analyzer.ReuseStrategy that never reuses and pass that
to super() when you create your custom Analyzer.

However, that is generally not a great idea in general -- poor indexing
throughput.

Another possibility is to create a Field with a pre-analyzed TokenStream,
basically bypassing Analyzer entirely and making your own TokenStream chain
that will alter these payload values.

Usually payloads are set/derived from the incoming tokens and would not be
dynamically set externally.  Or, such a parameter that changes per document
but not per token could be set in a doc values field instead.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jun 8, 2023 at 7:08 AM Usman Shaikh  wrote:

> Hello
>
> I hope somebody can offer suggestions/advice regarding this.
>
> I'm going through some old Lucene code and have a custom Analyzer which
> overrides the createComponents method. See snippet below
>
> public class BulletinPayloadsAnalyzer extends Analyzer {
>   private boolean bulletin;
>   private float boost;
>
>   BulletinPayloadsAnalyzer(float boost) {
> this.boost = boost;
>   }
>
>   public void setBulletin(boolean bulletin) {
> this.bulletin = bulletin;
>   }
>
>   @Override
>   protected TokenStreamComponents createComponents(String fieldName) {
> Tokenizer src = new StandardTokenizer();
> BulletinPayloadsFilter result = new BulletinPayloadsFilter(src, boost,
> bulletin);
> return new TokenStreamComponents(src, result);
>   }
>
> I then use the boost and bulletin params inside my BullletinPayloadsFilter
> for some specialized logic e.g. if bulletin is true, and a keyword is
> tokenized, then boost the document by setting a PayloadAttribute with the
> boost amount.
> However I've noticed when indexing several documents at once, the
> createComponents method is only called the first time. For all subsequent
> documents execution goes straight into the incrementToken method of my
> custom BulletinPayloadsFilter.
>
> Is there a way of ensuring the createComponents method is called when
> indexing each document? As I need to  make sure the correct parameters are
> passed to the filter. These params could change for each document.
>
> Thank you
> Usman
>

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-09 Thread Michael McCandless

I'd also love to understand this:

> using SimpleFSDirectoryFactory  (since Mmap doesn't  quite work well on
Windows for our index sizes which commonly run north of 1 TB)

Is this a known problem on certain versions of Windows?  Normally memory
mapped IO can scale to very large sizes (well beyond system RAM) an the OS
does the right thing (caches the frequently accessed parts of the index).

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 7, 2023 at 7:23 AM Adrien Grand  wrote:

> I agree it's worth discussing. I opened
> https://github.com/apache/lucene/issues/12355 and
> https://github.com/apache/lucene/issues/12356.
>
> On Tue, Jun 6, 2023 at 9:17 PM Rahul Goswami 
> wrote:
> >
> > Thanks Adrien. I spent some time trying to understand the readByte() in
> > ReverseRandomAccessReader (through FST) and compare with 7.x.  Although I
> > don't understand ALL of the details and reasoning for always loading the
> > FST (and in turn the term index) off-heap (as discussed in
> > https://github.com/apache/lucene/issues/10297 ) I understand that this
> is
> > essentially causing disk access for every single byte during readByte().
> >
> > Does this warrant a JIRA for regression?
> >
> > As mentioned, I am noticing a 10x slowdown in
> SegmentTermsEnum.seekExact()
> > affecting atomic update performance . For setups like mine that can't use
> > mmap due to large indexes this would be a legit regression, no?
> >
> > - Rahul
> >
> > On Tue, Jun 6, 2023 at 10:09 AM Adrien Grand  wrote:
> >
> > > Yes, this changed in 8.x:
> > >  - 8.0 moved the terms index off-heap for non-PK fields with
> > > MMapDirectory. https://github.com/apache/lucene/issues/9681
> > >  - Then in 8.6 the FST was moved off-heap all the time.
> > > https://github.com/apache/lucene/issues/10297
> > >
> > > More generally, there's a few files that are no longer loaded in heap
> > > in 8.x. It should be possible to load them back in heap by doing
> > > something like that (beware, I did not actually test this code):
> > >
> > > class MyHeapDirectory extends FilterDirectory {
> > >
> > >   MyHeapDirectory(Directory in) {
> > > super(in);
> > >   }
> > >
> > >   @Override
> > >   public IndexInput openInput(String name, IOContext context) throws
> > > IOException {
> > > if (context.load == false) {
> > >   return super.openInput(name, context);
> > > } else {
> > >   try (IndexInput in = super.openInput(name, context)) {
> > > byte[] bytes = new byte[Math.toIntExact(in.length())];
> > > in.readBytes(bytes, bytes.length);
> > > ByteBuffer bb =
> > >
> ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).asReadOnlyBuffer();
> > > return new ByteBuffersIndexInput(new
> > > ByteBuffersDataInput(Collections.singletonList(bb)),
> > > "ByteBuffersIndexInput(" + name + ")");
> > >   }
> > > }
> > >   }
> > >
> > > }
> > >
> > > On Tue, Jun 6, 2023 at 3:41 PM Rahul Goswami 
> > > wrote:
> > > >
> > > > Thanks Adrien. Is this behavior of FST something that has changed in
> > > Lucene
> > > > 8.x (from 7.x)?
> > > > Also, is the terms index not loaded into memory anymore in 8.x?
> > > >
> > > > To your point on MMapDirectoryFactory, it is much faster as you
> > > > anticipated, but the indexes commonly being >1 TB makes the Windows
> > > machine
> > > > freeze to a point I sometimes can't even connect to the VM.
> > > > SimpleFSDirectory works well for us from that standpoint.
> > > >
> > > > To add, both NIOFS and SimpleFS have similar indexing benchmarks on
> > > > Windows. I understand it is because of the Java bug which
> synchronizes
> > > > internally in the native call for NIOFs.
> > > >
> > > > -Rahul
> > > >
> > > > On Tue, Jun 6, 2023 at 9:32 AM Adrien Grand 
> wrote:
> > > >
> > > > > +Alan Woodward helped me better understand what is going on here.
> > > > > BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
> > > > > doesn't play well with the fact that the FST reads bytes backwards:
> > > > > every call to readByte() triggers a refill of 1kB because it wants
> to
> > > > > read the byte that is just before what the buffer contains.
> > > > >
> > > > > On Tue, Jun 6, 2023 at 2:07 PM Adrien Grand 
> wrote:
> > > > > >
> > > > > > My best guess based on your description of the issue is that
> > > > > > SimpleFSDirectory doesn't like the fact that the terms index now
> > > reads
> > > > > > data directly from the directory instead of loading the terms
> index
> > > in
> > > > > > heap. Would you be able to run the same benchmark with
> MMapDirectory
> > > > > > to check if it addresses the regression?
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 6, 2023 at 5:47 AM Rahul Goswami <
> rahul196...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > Hello,
> > > > > > > We started experiencing slowness with atomic updates in Solr
> after
> > > > > > > upgrading from 7.7.2 to 8.11.1. Running several tests revealed
> the
> > > > > > > slowness to be in RealTimeGet's

Re: Info required on licensing of Lucene component

2023-05-18 Thread Michael McCandless

Wonderful, thank you for raising the issue and bringing closure now!  Keep
the issues coming :)

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 18, 2023 at 12:59 AM external-opensource-requests(mailer list) <
external-opensource-reque...@cisco.com> wrote:

> Hello Michael
>
>
>
> We do see the fix included in Lucene 9.6.0.
>
> Appreciate your prompt response and thank you so much for resolving the
> issue!
>
>
>
> Regards,
>
> Open Source Request Team
>
>
>
> *From:* Michael McCandless 
> *Sent:* 11 May 2023 07:07 PM
> *To:* external-opensource-requests(mailer list) <
> external-opensource-reque...@cisco.com>
> *Cc:* java-user@lucene.apache.org
> *Subject:* Re: Info required on licensing of Lucene component
>
>
>
> Thank you Mike S for adding the 9.6.0 Milestone tag to the issue!
>
>
>
> I wonder if we are able to close an issue AND attach a milestone label in
> a commit message?  I tried to research a bit and didn't find anything
> except all the synonyms for "closes":
> https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue
>
>
>
> I'll start a separate thread ...
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
>
>
> On Wed, May 10, 2023 at 12:28 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> Hello,
>
>
>
> That's a great question, and, looking through the GitHub PR that was
> merged, it sure is hard to track down whether it was backported to 9.x and
> exactly which release.
>
>
>
> In the Jira days we would have a clear "fix version" to make this clear.
> How does one do this with GitHub issues?
>
>
>
> But digging in the commit logs, it looks to me like this fix will be
> included in Lucene 9.6.x, the next feature release, which should be
> released any day now (the release VOTE just passed yesterday).
>
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
>
>
> On Wed, May 10, 2023 at 10:01 AM external-opensource-requests(mailer list)
>  wrote:
>
> Hello Michael
>
>
>
> The thread https://github.com/apache/lucene/issues/12226 seems to be
> Closed now and as per the updates to *Cleanup NOTICE.txt *
> https://github.com/apache/lucene/pull/12227 - JUnit is not packaged in a
> Lucene release.
>
> But we can still see the Junit reference in Notices.txt file in maven for
> lucene components, for example
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
> and
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
>
>
>
> Hence, just wanted to confirm exactly which Lucene release is the
> update/pull request applied to?
>
>
>
>
>
> Thanks,
>
> Open Source Request Team
>
>
>
>
>
> *From:* Michael McCandless 
> *Sent:* 06 April 2023 03:39 PM
> *To:* java-user@lucene.apache.org; external-opensource-requests(mailer
> list) 
> *Subject:* Re: Info required on licensing of Lucene component
>
>
>
> > In that case, can you’ll update your source repo for Lucene to exclude
> references to ‘junit’ from Notices.txt file since it is something which is
> not part of distribution for Lucene.
>
>
>
> That sounds reasonable to me.  I'll open an issue in our GitHub repo, but
> IANAL and I'm not sure how to specifically proceed.
>
>
>
> I opened https://github.com/apache/lucene/issues/12226 -- let's continue
> discussion there?
>
>
> Thanks for raising this Open Source Request Team at Cisco!
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
>
>
> On Tue, Mar 21, 2023 at 7:58 PM Michael Sokolov 
> wrote:
>
> Lucene is licensed under the Apache license, just as it says in the
> LICENSE file. junit is used for testing Lucene and is not
> redistributed with it. Using Lucene in your code does not mean you are
> using junit, except in some extremely philosophical sense. EG Lucene
> developers may have developed Lucene using Windows on their laptops -
> that doesn't mean you need a WIndows license to use Lucene. IANAL, so
> you should ask yours - I'm sure someone at Cisco can help you sort
> this out?
>
> On Tue, Mar 21, 2023 at 10:13 AM external-opensource-requests(mailer
> list)  wrote:
> >
> > Hello Team
> >
> > I hope you are doing well!!
> >
> > This is regarding Lucene component licensing.
> > The maven repo link
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
> for lucene-queries 4.10.4 shows Apache 2.0 license associated wi

Re: Question - Why stopwords.txt provided by smartcn contains blank lines?

2023-05-15 Thread Michael McCandless

Hi Jerry,

I agree, that makes no sense!  Maybe the stopload loader should ignore
truly blank lines?

Also, the comments on lines 57 and 59 are confusing -- there are no
(default) English and Chinese stopwords in the file.  I guess they are
placeholders.

Could you open an issue in Lucene's GitHub issue tracker (
https://github.com/apache/lucene/issues ) and let's iterate from there?

Thanks!

Mike McCandless

http://blog.mikemccandless.com

On Mon, May 15, 2023 at 5:25 AM Jerry Chin  wrote:

> Hi all,
>
> This following line contains two blank lines, including line 56 & 58:
>
> https://github.com/apache/lucene/blob/main/lucene/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn/smart/stopwords.txt
>
> As a result,  SmartChineseAnalyzer.getDefaultStopSet() will produce a empty
> string as stop words, but it makes no sense to have empty string as stop
> word right?
>
> Much appreciated for your help!
>
>
>
>
> *Regards,Jerry Chin.*
> *下述真理不证自明：凡为人类，生而平等，秉造物者之赐，拥诸不可剥夺之权利，包含生命权、自由权、及追求幸福权。*
>

Re: Info required on licensing of Lucene component

2023-05-11 Thread Michael McCandless

Thank you Mike S for adding the 9.6.0 Milestone tag to the issue!

I wonder if we are able to close an issue AND attach a milestone label in a
commit message?  I tried to research a bit and didn't find anything except
all the synonyms for "closes":
https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue

I'll start a separate thread ...

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 10, 2023 at 12:28 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> Hello,
>
> That's a great question, and, looking through the GitHub PR that was
> merged, it sure is hard to track down whether it was backported to 9.x and
> exactly which release.
>
> In the Jira days we would have a clear "fix version" to make this clear.
> How does one do this with GitHub issues?
>
> But digging in the commit logs, it looks to me like this fix will be
> included in Lucene 9.6.x, the next feature release, which should be
> released any day now (the release VOTE just passed yesterday).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, May 10, 2023 at 10:01 AM external-opensource-requests(mailer list)
>  wrote:
>
>> Hello Michael
>>
>>
>>
>> The thread https://github.com/apache/lucene/issues/12226 seems to be
>> Closed now and as per the updates to *Cleanup NOTICE.txt *
>> https://github.com/apache/lucene/pull/12227 - JUnit is not packaged in a
>> Lucene release.
>>
>> But we can still see the Junit reference in Notices.txt file in maven for
>> lucene components, for example
>> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
>> and
>> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
>>
>>
>>
>> Hence, just wanted to confirm exactly which Lucene release is the
>> update/pull request applied to?
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Open Source Request Team
>>
>>
>>
>>
>>
>> *From:* Michael McCandless 
>> *Sent:* 06 April 2023 03:39 PM
>> *To:* java-user@lucene.apache.org; external-opensource-requests(mailer
>> list) 
>> *Subject:* Re: Info required on licensing of Lucene component
>>
>>
>>
>> > In that case, can you’ll update your source repo for Lucene to exclude
>> references to ‘junit’ from Notices.txt file since it is something which is
>> not part of distribution for Lucene.
>>
>>
>>
>> That sounds reasonable to me.  I'll open an issue in our GitHub repo, but
>> IANAL and I'm not sure how to specifically proceed.
>>
>>
>>
>> I opened https://github.com/apache/lucene/issues/12226 -- let's continue
>> discussion there?
>>
>>
>> Thanks for raising this Open Source Request Team at Cisco!
>>
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>>
>>
>>
>> On Tue, Mar 21, 2023 at 7:58 PM Michael Sokolov 
>> wrote:
>>
>> Lucene is licensed under the Apache license, just as it says in the
>> LICENSE file. junit is used for testing Lucene and is not
>> redistributed with it. Using Lucene in your code does not mean you are
>> using junit, except in some extremely philosophical sense. EG Lucene
>> developers may have developed Lucene using Windows on their laptops -
>> that doesn't mean you need a WIndows license to use Lucene. IANAL, so
>> you should ask yours - I'm sure someone at Cisco can help you sort
>> this out?
>>
>> On Tue, Mar 21, 2023 at 10:13 AM external-opensource-requests(mailer
>> list)  wrote:
>> >
>> > Hello Team
>> >
>> > I hope you are doing well!!
>> >
>> > This is regarding Lucene component licensing.
>> > The maven repo link
>> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
>> for lucene-queries 4.10.4 shows Apache 2.0 license associated with the
>> component.
>> > Also, the archive (lucene-queries-4.10.4-sources.jar) uploaded has a
>> LICENSE.txt file which has Apache 2.0 license, but it also includes a
>> NOTICE.txt file which shows JUnit (junit-4.10) licensed under the Common
>> Public License v. 1.0. But there is no code associated with Junit included
>> in the source archive (lucene-queries-4.10.4-sources.jar) file.
>> >
>> > In this case, since Common Public License 1.0 is more restrictive
>> compared to Apache 2.0, for our better understanding,  can you clarify to
>> us on what is the actual Open Source license associated with the Lucene
>> component?
>> >
>> > Mentioning just two of the lucene components in mail as example for
>> your reference "lucene-backward-codecs 9.3.0"
>> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
>> >
>> > Looking forward to your reply.
>> >
>> >
>> > Thanks ,
>> > Open Source Request Team
>> >
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

Re: Info required on licensing of Lucene component

2023-05-10 Thread Michael McCandless

Hello,

That's a great question, and, looking through the GitHub PR that was
merged, it sure is hard to track down whether it was backported to 9.x and
exactly which release.

In the Jira days we would have a clear "fix version" to make this clear.
How does one do this with GitHub issues?

But digging in the commit logs, it looks to me like this fix will be
included in Lucene 9.6.x, the next feature release, which should be
released any day now (the release VOTE just passed yesterday).

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 10, 2023 at 10:01 AM external-opensource-requests(mailer list) <
external-opensource-reque...@cisco.com> wrote:

> Hello Michael
>
>
>
> The thread https://github.com/apache/lucene/issues/12226 seems to be
> Closed now and as per the updates to *Cleanup NOTICE.txt *
> https://github.com/apache/lucene/pull/12227 - JUnit is not packaged in a
> Lucene release.
>
> But we can still see the Junit reference in Notices.txt file in maven for
> lucene components, for example
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
> and
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
>
>
>
> Hence, just wanted to confirm exactly which Lucene release is the
> update/pull request applied to?
>
>
>
>
>
> Thanks,
>
> Open Source Request Team
>
>
>
>
>
> *From:* Michael McCandless 
> *Sent:* 06 April 2023 03:39 PM
> *To:* java-user@lucene.apache.org; external-opensource-requests(mailer
> list) 
> *Subject:* Re: Info required on licensing of Lucene component
>
>
>
> > In that case, can you’ll update your source repo for Lucene to exclude
> references to ‘junit’ from Notices.txt file since it is something which is
> not part of distribution for Lucene.
>
>
>
> That sounds reasonable to me.  I'll open an issue in our GitHub repo, but
> IANAL and I'm not sure how to specifically proceed.
>
>
>
> I opened https://github.com/apache/lucene/issues/12226 -- let's continue
> discussion there?
>
>
> Thanks for raising this Open Source Request Team at Cisco!
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
>
>
>
> On Tue, Mar 21, 2023 at 7:58 PM Michael Sokolov 
> wrote:
>
> Lucene is licensed under the Apache license, just as it says in the
> LICENSE file. junit is used for testing Lucene and is not
> redistributed with it. Using Lucene in your code does not mean you are
> using junit, except in some extremely philosophical sense. EG Lucene
> developers may have developed Lucene using Windows on their laptops -
> that doesn't mean you need a WIndows license to use Lucene. IANAL, so
> you should ask yours - I'm sure someone at Cisco can help you sort
> this out?
>
> On Tue, Mar 21, 2023 at 10:13 AM external-opensource-requests(mailer
> list)  wrote:
> >
> > Hello Team
> >
> > I hope you are doing well!!
> >
> > This is regarding Lucene component licensing.
> > The maven repo link
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
> for lucene-queries 4.10.4 shows Apache 2.0 license associated with the
> component.
> > Also, the archive (lucene-queries-4.10.4-sources.jar) uploaded has a
> LICENSE.txt file which has Apache 2.0 license, but it also includes a
> NOTICE.txt file which shows JUnit (junit-4.10) licensed under the Common
> Public License v. 1.0. But there is no code associated with Junit included
> in the source archive (lucene-queries-4.10.4-sources.jar) file.
> >
> > In this case, since Common Public License 1.0 is more restrictive
> compared to Apache 2.0, for our better understanding,  can you clarify to
> us on what is the actual Open Source license associated with the Lucene
> component?
> >
> > Mentioning just two of the lucene components in mail as example for your
> reference "lucene-backward-codecs 9.3.0"
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
> >
> > Looking forward to your reply.
> >
> >
> > Thanks ,
> > Open Source Request Team
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Info required on licensing of Lucene component

2023-04-06 Thread Michael McCandless

> In that case, can you’ll update your source repo for Lucene to exclude
references to ‘junit’ from Notices.txt file since it is something which is
not part of distribution for Lucene.

That sounds reasonable to me.  I'll open an issue in our GitHub repo, but
IANAL and I'm not sure how to specifically proceed.

I opened https://github.com/apache/lucene/issues/12226 -- let's continue
discussion there?

Thanks for raising this Open Source Request Team at Cisco!

Mike McCandless

http://blog.mikemccandless.com


On Tue, Mar 21, 2023 at 7:58 PM Michael Sokolov  wrote:

> Lucene is licensed under the Apache license, just as it says in the
> LICENSE file. junit is used for testing Lucene and is not
> redistributed with it. Using Lucene in your code does not mean you are
> using junit, except in some extremely philosophical sense. EG Lucene
> developers may have developed Lucene using Windows on their laptops -
> that doesn't mean you need a WIndows license to use Lucene. IANAL, so
> you should ask yours - I'm sure someone at Cisco can help you sort
> this out?
>
> On Tue, Mar 21, 2023 at 10:13 AM external-opensource-requests(mailer
> list)  wrote:
> >
> > Hello Team
> >
> > I hope you are doing well!!
> >
> > This is regarding Lucene component licensing.
> > The maven repo link
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
> for lucene-queries 4.10.4 shows Apache 2.0 license associated with the
> component.
> > Also, the archive (lucene-queries-4.10.4-sources.jar) uploaded has a
> LICENSE.txt file which has Apache 2.0 license, but it also includes a
> NOTICE.txt file which shows JUnit (junit-4.10) licensed under the Common
> Public License v. 1.0. But there is no code associated with Junit included
> in the source archive (lucene-queries-4.10.4-sources.jar) file.
> >
> > In this case, since Common Public License 1.0 is more restrictive
> compared to Apache 2.0, for our better understanding,  can you clarify to
> us on what is the actual Open Source license associated with the Lucene
> component?
> >
> > Mentioning just two of the lucene components in mail as example for your
> reference "lucene-backward-codecs 9.3.0"
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
> >
> > Looking forward to your reply.
> >
> >
> > Thanks ,
> > Open Source Request Team
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Info required on licensing of Lucene component

2023-04-04 Thread Michael McCandless

Hello,

You maybe missed the two responses already to the email, since by default
responses only go the the user list not back to the individual.  See the
archived responses here:
https://lists.apache.org/thread/zg01tkq8wtmym27q3dolcg1msbtoxoxl

Mike McCandless

http://blog.mikemccandless.com


On Tue, Apr 4, 2023 at 8:52 AM external-opensource-requests(mailer list)
 wrote:

> Hi Team
>
> Hope all is well.
> Just touching base for a quick update on the previous mail.
>
>
> Appreciate your help on this.
>
> Thanks,
> Open Source Request Team
>
> From: external-opensource-requests(mailer list)
> Sent: 21 March 2023 07:42 PM
> To: 'java-user@lucene.apache.org' 
> Subject: Info required on licensing of Lucene component
>
> Hello Team
>
> I hope you are doing well!!
>
> This is regarding Lucene component licensing.
> The maven repo link
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-queries/4.10.4
> for lucene-queries 4.10.4 shows Apache 2.0 license associated with the
> component.
> Also, the archive (lucene-queries-4.10.4-sources.jar) uploaded has a
> LICENSE.txt file which has Apache 2.0 license, but it also includes a
> NOTICE.txt file which shows JUnit (junit-4.10) licensed under the Common
> Public License v. 1.0. But there is no code associated with Junit included
> in the source archive (lucene-queries-4.10.4-sources.jar) file.
>
> In this case, since Common Public License 1.0 is more restrictive compared
> to Apache 2.0, for our better understanding,  can you clarify to us on what
> is the actual Open Source license associated with the Lucene component?
>
> Mentioning just two of the lucene components in mail as example for your
> reference "lucene-backward-codecs 9.3.0"
> https://mvnrepository.com/artifact/org.apache.lucene/lucene-backward-codecs/9.3.0
>
> Looking forward to your reply.
>
>
> Thanks ,
> Open Source Request Team
>
>

Re: Vector Search on Lucene

2023-03-16 Thread Michael McCandless

Note that Lucene's demo package (IndexFiles.java, SearchFiles.java) also
show examples of how to index and search KNN vectors.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Mar 2, 2023 at 4:46 AM Michael Wechner 
wrote:

> Hi Marcos
>
> The indexing looks kind of
>
> Document doc =new Document();
> float[] vector = getEmbedding(text);
> FieldType vectorFieldType = KnnVectorField.createFieldType(vector.length,
> VectorSimilarityFunction.COSINE);
> KnnVectorField vectorField =new KnnVectorField("my_vector_field", vector,
> vectorFieldType);
> doc.add(vectorField);
> writer.addDocument(doc);
>
>
> And the searching / retrieval looks kind of
>
> float[] queryVector = getEmbedding(question)
> int k =7;// INFO: The number of documents to find
> Query query =new KnnVectorQuery("my_vector_field", queryVector, k);
> IndexSearcher searcher =new IndexSearcher(indexReader);
> TopDocs topDocs = searcher.search(query, k);
>
> Also see
>
> https://lucene.apache.org/core/9_5_0/demo/index.html#Embeddings
>
> https://lucene.apache.org/core/9_5_0/demo/org/apache/lucene/demo/knn/package-summary.html
>
> HTH
>
> Michael
>
>
>
>
>
> Am 02.03.23 um 10:25 schrieb marcos rebelo:
> > Hi all,
> >
> > I'm willing to use Vector Search with Lucene.
> >
> > I have vectors created for queries and documents outside Lucene.
> > I would like to upload the document vectors to a Lucene index, Then use
> > Lucene to filter the documents (like classical search) and rank the
> > remaining products with the Vectors.
> >
> > For performance reasons I would like some fast KNN for the rankers.
> >
> > I looked on Google and I didn't find any document with some code samples.
> >
> > 2 questions:
> >   * Is this a correct design pattern?
> >   * Is there a good article explaining how to do this with Lucene?
> >
> > Best Regards
> > Marcos Rebelo
> >
>

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-25 Thread Michael McCandless

Thank you Tomoko!  It looks AWESOME!

I will work on fixing jirasearch to index from GitHub ...

Mike

On Thu, Aug 25, 2022 at 4:32 AM Vigya Sharma  wrote:

> Love this! Thanks for all the hard work, Tomoko.
> -
> Vigya
>
>
> On Wed, Aug 24, 2022 at 12:27 PM Michael Sokolov 
> wrote:
>
>> Thanks! It seems to be working nicely.
>>
>> Question about the fix-version: tagging. I wonder if going forward we
>> want to main that for new issues? I happened to notice there is also
>> this "milestone" feature in github -- does that seem like a place to
>> put version information?
>>
>> On Wed, Aug 24, 2022 at 3:20 PM Tomoko Uchida
>>  wrote:
>> >
>> > 
>> >
>> > Issue migration has been completed (except for minor cleanups).
>> > This is the Jira -> GitHub issue number mapping for possible future
>> usage.
>> https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/issue-map.csv.20220823_final
>> >
>> > GitHub issue is now fully available for all issues.
>> > For issue label management (e.g. "fix-version"), please review this
>> manual.
>> >
>> https://github.com/apache/lucene/blob/main/dev-docs/github-issues-howto.md
>> >
>> > Tomoko
>> >
>> >
>> > 2022年8月22日(月) 19:46 Michael McCandless :
>> >>
>> >> Wooot!  Thank you so much Tomoko!!
>> >>
>> >> Mike
>> >>
>> >> On Mon, Aug 22, 2022 at 6:44 AM Tomoko Uchida <
>> tomoko.uchida.1...@gmail.com> wrote:
>> >>>
>> >>> 
>> >>>
>> >>> Issue migration has been started. Jira is now read-only.
>> >>>
>> >>> GitHub issue is available for new issues.
>> >>>
>> >>> - You should open new issues on GitHub. E.g.
>> https://github.com/apache/lucene/issues/1078
>> >>> - Do not touch issues that are in the middle of migration, please.
>> E.g. https://github.com/apache/lucene/issues/1072
>> >>>   - While you cannot break these issues, migration scripts can
>> modify/overwrite your comments on the issues.
>> >>> - Pull requests are not affected. You can open/update PRs as usual.
>> Please let me know if you have any trouble with PRs.
>> >>>
>> >>>
>> >>> Tomoko
>> >>>
>> >>>
>> >>> 2022年8月18日(木) 18:23 Tomoko Uchida :
>> >>>>
>> >>>> Hello all,
>> >>>>
>> >>>> The Lucene project decided to move our issue tracking system from
>> Jira to GitHub and migrate all Jira issues to GitHub.
>> >>>>
>> >>>> We start issue migration on Monday, August 22 at 8:00 UTC.
>> >>>> 1) We make Jira read-only before migration. You cannot update
>> existing issues until the migration is completed.
>> >>>> 2) You can use GitHub for opening NEW issues or pull requests during
>> migration.
>> >>>>
>> >>>> Note that issues should be raised in Jira at this moment, although
>> GitHub issue is already enabled in the Lucene repository.
>> >>>> Please do not raise issues in GitHub until we let you know that
>> GitHub issue is officially available. We immediately close any issues on
>> GitHub until then.
>> >>>>
>> >>>> Here are the detailed plan/migration steps.
>> >>>> https://github.com/apache/lucene-jira-archive/issues/7
>> >>>>
>> >>>> Tomoko
>> >>
>> >> --
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>>
>> -
>>
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>
>
> --
> - Vigya
>
-- 
Mike McCandless

http://blog.mikemccandless.com

Re: [ANNOUNCE] Issue migration Jira to GitHub starts on Monday, August 22

2022-08-22 Thread Michael McCandless

Wooot!  Thank you so much Tomoko!!

Mike

On Mon, Aug 22, 2022 at 6:44 AM Tomoko Uchida 
wrote:

> 
>
> Issue migration has been started. Jira is now read-only.
>
> GitHub issue is available for new issues.
>
> - You should open new issues on GitHub. E.g.
> https://github.com/apache/lucene/issues/1078
> - Do not touch issues that are in the middle of migration, please. E.g.
> https://github.com/apache/lucene/issues/1072
>   - While you cannot break these issues, migration scripts can
> modify/overwrite your comments on the issues.
> - Pull requests are not affected. You can open/update PRs as usual. Please
> let me know if you have any trouble with PRs.
>
>
> Tomoko
>
>
> 2022年8月18日(木) 18:23 Tomoko Uchida :
>
>> Hello all,
>>
>> The Lucene project decided to move our issue tracking system from Jira to
>> GitHub and migrate all Jira issues to GitHub.
>>
>> We start issue migration on Monday, August 22 at 8:00 UTC.
>> 1) We make Jira read-only before migration. You cannot update existing
>> issues until the migration is completed.
>> 2) You can use GitHub for opening NEW issues or pull requests during
>> migration.
>>
>> Note that issues should be raised in Jira at this moment, although GitHub
>> issue is already enabled in the Lucene repository.
>> Please do not raise issues in GitHub until we let you know that GitHub
>> issue is officially available. We immediately close any issues on GitHub
>> until then.
>>
>> Here are the detailed plan/migration steps.
>> https://github.com/apache/lucene-jira-archive/issues/7
>>
>> Tomoko
>>
> --
Mike McCandless

http://blog.mikemccandless.com

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless

OK done:
https://github.com/apache/lucene-jira-archive/commit/13fa4cb46a1a6d609448240e4f66c263da8b3fd1

Mike McCandless

http://blog.mikemccandless.com


On Sat, Aug 6, 2022 at 10:29 AM Baris Kazar  wrote:

> I think so.
> Best regards
> --
> *From:* Michael McCandless 
> *Sent:* Saturday, August 6, 2022 10:12 AM
> *To:* java-user@lucene.apache.org 
> *Cc:* Baris Kazar 
> *Subject:* Re: [HELP] Link your Apache Lucene Jira and GitHub account ids
> before Thursday August 4 midnight (in your local time)
>
> Thanks Baris,
>
> And your Jira ID is bkazar right?
>
> Mike
>
> On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar 
> wrote:
>
>> My github username is bmkazar
>> can You please register me?
>> Best regards
>> 
>> From: Michael McCandless 
>> Sent: Saturday, August 6, 2022 6:05:51 AM
>> To: d...@lucene.apache.org 
>> Cc: Lucene Users ; java-dev <
>> java-...@lucene.apache.org>
>> Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids
>> before Thursday August 4 midnight (in your local time)
>>
>> Hi Adam, I added your linked accounts here:
>>
>> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLk1DO04g$
>>
>> And Tomoko added Rushabh's linked accounts here:
>>
>>
>> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nITwUFX0A$
>>
>> Keep the linked accounts coming!
>>
>> Mike
>>
>> On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah
>>  wrote:
>>
>> > Hi,
>> > My mapping is:
>> > JiraName,GitHubAccount,JiraDispName
>> > shahrs87, shahrs87, Rushabh Shah
>> >
>> > Thank you Tomoko and Mike for all of your hard work.
>> >
>> >
>> >
>> >
>> > On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> Hello Lucene users, contributors and developers,
>> >>
>> >> If you have used Lucene's Jira and you have a GitHub account as well,
>> >> please check whether your user id mapping is in this file:
>> >>
>> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLjA_KarQ$
>> >>
>> >> If not, please reply to this email and we will try to add you.
>> >>
>> >> Please forward this email to anyone you know might be impacted and who
>> >> might not be tracking the Lucene lists.
>> >>
>> >>
>> >> Full details:
>> >>
>> >> The Lucene project will soon migrate from Jira to GitHub for issue
>> >> tracking.
>> >>
>> >> There have been discussions, votes, a migration tool created / iterated
>> >> (thanks to Tomoko Uchida's incredibly hard work), all iterating on
>> Lucene's
>> >> dev list.
>> >>
>> >> When we run the migration, we would like to map Jira users to the right
>> >> GitHub users to properly @-mention the right person and make it easier
>> for
>> >> you to find issues you have engaged with.
>> >>
>> >> Mike McCandless
>> >>
>> >>
>> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
>> >>
>> > --
>> Mike McCandless
>>
>>
>> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
>>
> --
> Mike McCandless
>
> http://blog.mikemccandless.com
> <https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!JIy9w3Oyvgxri_lPKzszX-rCz4T17oAvHWxs3gLwaxWQ3Ah7toRiMqu3hYT0YP-UnxPR1mSnuaqAoGbejVCNsw$>
>

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless

Thanks Baris,

And your Jira ID is bkazar right?

Mike

On Sat, Aug 6, 2022 at 10:05 AM Baris Kazar  wrote:

> My github username is bmkazar
> can You please register me?
> Best regards
> ____
> From: Michael McCandless 
> Sent: Saturday, August 6, 2022 6:05:51 AM
> To: d...@lucene.apache.org 
> Cc: Lucene Users ; java-dev <
> java-...@lucene.apache.org>
> Subject: Re: [HELP] Link your Apache Lucene Jira and GitHub account ids
> before Thursday August 4 midnight (in your local time)
>
> Hi Adam, I added your linked accounts here:
>
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLk1DO04g$
>
> And Tomoko added Rushabh's linked accounts here:
>
>
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nITwUFX0A$
>
> Keep the linked accounts coming!
>
> Mike
>
> On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah
>  wrote:
>
> > Hi,
> > My mapping is:
> > JiraName,GitHubAccount,JiraDispName
> > shahrs87, shahrs87, Rushabh Shah
> >
> > Thank you Tomoko and Mike for all of your hard work.
> >
> >
> >
> >
> > On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >> Hello Lucene users, contributors and developers,
> >>
> >> If you have used Lucene's Jira and you have a GitHub account as well,
> >> please check whether your user id mapping is in this file:
> >>
> https://urldefense.com/v3/__https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nLjA_KarQ$
> >>
> >> If not, please reply to this email and we will try to add you.
> >>
> >> Please forward this email to anyone you know might be impacted and who
> >> might not be tracking the Lucene lists.
> >>
> >>
> >> Full details:
> >>
> >> The Lucene project will soon migrate from Jira to GitHub for issue
> >> tracking.
> >>
> >> There have been discussions, votes, a migration tool created / iterated
> >> (thanks to Tomoko Uchida's incredibly hard work), all iterating on
> Lucene's
> >> dev list.
> >>
> >> When we run the migration, we would like to map Jira users to the right
> >> GitHub users to properly @-mention the right person and make it easier
> for
> >> you to find issues you have engaged with.
> >>
> >> Mike McCandless
> >>
> >>
> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
> >>
> > --
> Mike McCandless
>
>
> https://urldefense.com/v3/__http://blog.mikemccandless.com__;!!ACWV5N9M2RV99hQ!KNwyR7RuqeuKpyzEemagEZzGRGtdqjpE-OWaDfjjyZVHJ-zgsGLyYJhZ7ZWJCI1NrWR6H4DYdMbB8nIyHPa_wA$
>
-- 
Mike McCandless

http://blog.mikemccandless.com

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-06 Thread Michael McCandless

Hi Adam, I added your linked accounts here:
https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c

And Tomoko added Rushabh's linked accounts here:

https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb

Keep the linked accounts coming!

Mike

On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah
 wrote:

> Hi,
> My mapping is:
> JiraName,GitHubAccount,JiraDispName
> shahrs87, shahrs87, Rushabh Shah
>
> Thank you Tomoko and Mike for all of your hard work.
>
>
>
>
> On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hello Lucene users, contributors and developers,
>>
>> If you have used Lucene's Jira and you have a GitHub account as well,
>> please check whether your user id mapping is in this file:
>> https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified
>>
>> If not, please reply to this email and we will try to add you.
>>
>> Please forward this email to anyone you know might be impacted and who
>> might not be tracking the Lucene lists.
>>
>>
>> Full details:
>>
>> The Lucene project will soon migrate from Jira to GitHub for issue
>> tracking.
>>
>> There have been discussions, votes, a migration tool created / iterated
>> (thanks to Tomoko Uchida's incredibly hard work), all iterating on Lucene's
>> dev list.
>>
>> When we run the migration, we would like to map Jira users to the right
>> GitHub users to properly @-mention the right person and make it easier for
>> you to find issues you have engaged with.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
> --
Mike McCandless

http://blog.mikemccandless.com

Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-07-31 Thread Michael McCandless

Thanks, added here:
https://github.com/apache/lucene-jira-archive/commit/d91534e67b35004f212100d73008283327f2f1e7

Please confirm it's right ;)

Mike

On Sun, Jul 31, 2022 at 7:27 AM 翁剑平  wrote:

> Hi, could you help to add my info, thanks a lot
> jira full name: jianping weng
> github id: wjp719
>
> the jira issue I create before:
> https://issues.apache.org/jira/browse/LUCENE-10425
> the github pr I submit before: https://github.com/apache/lucene/pull/780
>
>
> Best Regards,
> jianping weng
>
>
>
> Michael McCandless  于2022年7月31日周日 18:08写道：
>
>> Hello Lucene users, contributors and developers,
>>
>> If you have used Lucene's Jira and you have a GitHub account as well,
>> please check whether your user id mapping is in this file:
>> https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified
>>
>> If not, please reply to this email and we will try to add you.
>>
>> Please forward this email to anyone you know might be impacted and who
>> might not be tracking the Lucene lists.
>>
>>
>> Full details:
>>
>> The Lucene project will soon migrate from Jira to GitHub for issue
>> tracking.
>>
>> There have been discussions, votes, a migration tool created / iterated
>> (thanks to Tomoko Uchida's incredibly hard work), all iterating on Lucene's
>> dev list.
>>
>> When we run the migration, we would like to map Jira users to the right
>> GitHub users to properly @-mention the right person and make it easier for
>> you to find issues you have engaged with.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
> --
Mike McCandless

http://blog.mikemccandless.com

[HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-07-31 Thread Michael McCandless

Hello Lucene users, contributors and developers,

If you have used Lucene's Jira and you have a GitHub account as well,
please check whether your user id mapping is in this file:
https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified

If not, please reply to this email and we will try to add you.

Please forward this email to anyone you know might be impacted and who
might not be tracking the Lucene lists.


Full details:

The Lucene project will soon migrate from Jira to GitHub for issue tracking.

There have been discussions, votes, a migration tool created / iterated
(thanks to Tomoko Uchida's incredibly hard work), all iterating on Lucene's
dev list.

When we run the migration, we would like to map Jira users to the right
GitHub users to properly @-mention the right person and make it easier for
you to find issues you have engaged with.

Mike McCandless

http://blog.mikemccandless.com

Re: Unclear on what position means

2022-07-22 Thread Michael McCandless

Hi Kendall,

"Position" and "Offset" are often confused in Lucene ;)

Lucene uses offset to track what you referred to ("(character, not byte)
offset into a text file", or into an indexed string).

Lucene uses position to track the Nth token: position 0 is first token,
position 1 is the second token, etc.  But since tokens are usually N > 1
characters, the offsets grow faster than the positions.  These tokens need
not be only a linear sequence: they can be a graph structure when
multi-token synonyms are applied.

Lucene indexes both of these, and you can turn them individually on/off if
you want.

Finally, you might be interested in Lucene's highlighters module -- this
contains tooling to do hit highlighting, to solve the "final inch" problem
of showing your users precisely which words/excerpts matched inside each
matched hit.  Here's an example

(searching Lucene's issues for the word "python").

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jul 22, 2022 at 12:51 AM Mikhail Khludnev  wrote:

> Hello, Kendall.
>
> You can read about Token Position Increments at
>
> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/analysis/package-summary.html#package.description
> Usually position is a number of word and offset is a number of symbol.
> Modeling entries via positions is boilerplate, I suppose. Nowadays we
> either denormalize by copying values across children into a single parent
> document. Also, here are more relational options
>
> https://lucene.apache.org/core/9_2_0/join/org/apache/lucene/search/join/package-summary.html
>
>
> On Fri, Jul 22, 2022 at 7:02 AM Kendall Shaw 
> wrote:
>
> > Hi,
> >
> > I'm trying to figure out if I should be learning to use Lucene. I
> > imagine wanting to provide a user with a way to search for something and
> > present that found thing, in some way. If what is ultimately searched is
> > text files, then position would be an offset into the text file, I
> > think. But, that seems like a pretty unlikely scenario.
> >
> > If I have stored structured data into a database of some sort, does
> > Lucene provide some way to associate a position with an entry in a
> > database? Or is that left to the programmer to implement, outside of
> > Lucene?
> >
> > Kendall
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: Replicator PrimaryNode waits forever for remotes to close

2022-06-30 Thread Michael McCandless

+1 to provide a timeout, or, to simply fix close to aggressively close
regardless of what the replicas are doing?

It's not a great design for primary to be so dependent on the replicas (but
vice/versa makes sense?).

Maybe open a Jira issue or starting PR so we can discuss?

Thanks for uncovering this and proposing a fix!

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jun 29, 2022 at 7:36 PM Steven Schlansker <
stevenschlans...@gmail.com> wrote:

> Hi Lucene fans,
>
> We use lucene-replicator to copy our indexes from a primary to replica
> nodes.
> Usually, startup and shutdown are fine. In particular we call
> PrimaryNode.close.
>
> But, in some edge cases - dropped connection? IOException? some process
> crashed? -
> we sometimes hang in PrimaryNode.waitForAllRemotesToClose, which never
> returns.
>
> I suspect we have a reference counting bug: in some exceptional case, we
> forget to release our CopyState.
> This definitely should be fixed, but in the meantime, it's very unhelpful
> for the primary node to never come down.
>
> I was considering submitting a PR to add a configurable timeout for the
> shutdown wait - and after the timeout expires,
> continue with closing even though some replicas did not terminate.
> They will possibly crash with an "IOException: directory closed" later, or
> maybe never come back at all.
>
> Does this sound like a welcome change? Is there a better way to avoid
> hanging here, other than to be bug-free?
> It's quite challenging to figure out where the CopyState wasn't released,
> as only a count is kept.
>
> Thanks!
>
> Steven Schlansker
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless

Antony, do you maybe have Microsoft Defender turned on, which might
quarantine files that it suspects are malicious?  I'm not sure if it is on
by default these days on modern Windows boxes ...

Mike McCandless

http://blog.mikemccandless.com


On Thu, May 5, 2022 at 10:34 AM Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Thu, May 5, 2022 at 10:30 AM Uwe Schindler  wrote:
>
> To find all errors in an index, you should pass -ea to the java command
>> line to enable assertions.
>>
>
> +1
>
> Tempting to make CheckIndex demand that :)  Or at least, slow you down and
> make it clear why, if assertions are disabled.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless

On Thu, May 5, 2022 at 10:30 AM Uwe Schindler  wrote:

To find all errors in an index, you should pass -ea to the java command
> line to enable assertions.
>

+1

Tempting to make CheckIndex demand that :)  Or at least, slow you down and
make it clear why, if assertions are disabled.

Mike McCandless

http://blog.mikemccandless.com

Re: Index corruption and repair

2022-05-05 Thread Michael McCandless

Hi Antony,

Sorry for the late reply.

Indeed the file _14gb.si is missing, yet _14gb.cfs is present (interesting
-- must have failed deletion because an IndexReader has it open).  And yet
when you run CheckIndex on this directory (without -exorcise), the index is
fine?  No errors reported?  Can you post the full CheckIndex output?

There are two segments_N files present, which is interesting.  Are
you using the default IndexDeletionPolicy (which deletes the old segments_N
file as soon as the new segments_N+1 is done being committed)?

Do you open near-real-time readers (passing IndexWriter to
DirectoryReader.open)?  Or filesystem based readers only (passing Directory
to DirectoryReader.open)?

How do you reopen/refresh those IndexReaders?  Is it "every N seconds"?  Or
is it timed to after the IndexWriter.commit() has finished?  How often are
you calling IndexWriter.commit()?

6.5.0 is quite old by now, and I poked around in our issue history
<https://jirasearch.mikemccandless.com/search.py?index=jira> to see if this
might be a known issue.  The only interesting issue I found was LUCENE-6835
<https://issues.apache.org/jira/browse/LUCENE-6835> which shifted
responsibility of retrying file deletions down into Directory (instead of
IndexWriter), but that landed in 6.0 and hopefully any bugs were ironed out
by 6.5.0.

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 4, 2022 at 3:44 PM Antony Joseph 
wrote:

> Hi Michael,
>
> Any update?
>
> Regards,
> Antony
>
> On Sun, 1 May 2022 at 19:35, Antony Joseph 
> wrote:
>
>> Hi Michael,
>>
>> Thank you for your reply. Please find responses to your questions below.
>>
>> Regards,
>> Antony
>>
>> On Sat, 30 Apr 2022 at 18:59, Michael McCandless <
>> luc...@mikemccandless.com> wrote:
>>
>>> Hi Antony,
>>>
>>> Hmm it looks like the root cause is this:
>>>
>>>   Caused by: java.nio.file.NoSuchFileException: D:\i\202204\_14gb.si
>>>
>>> Can you list all the files in the index directory at the time this
>>> exception happens, and reply here?  We need to figure out whether the file
>>> is really missing or what.
>>>
>> Below the index directory file listing. Yes, file is missing
>> (D:\i\202204\_14gb.si)
>>
>>>
>>> Do you run any virus scanner / disk file tree utilities / etc.?  In the
>>> distant past sometimes such programs might cause strange transient errors
>>> if they open a file for read exclusively or so, on windows.
>>>
>> There is no virus scanner running.
>>
>>>
>>> What is the actual drive you are storing the index on (D:)?  Is it a
>>> local disk or remote SMBFS mount?
>>>
>> It's a local disk (D:).
>>
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sat, Apr 30, 2022 at 8:39 AM Antony Joseph <
>>> antony.dev.webm...@gmail.com> wrote:
>>>
>>>> Thank you for your reply.
>>>>
>>>> *The full stack trace is included:*
>>>>
>>>> , >
>>>> Java stacktrace:
>>>> org.apache.lucene.index.CorruptIndexException: Unexpected file read
>>>> error
>>>> while
>>>> reading index.
>>>>
>>>> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="D:\i\202204\segments_10fj")))
>>>> at
>>>> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
>>>> at
>>>>
>>>> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:165)
>>>> at
>>>> org.apache.lucene.index.IndexWriter.(IndexWriter.java:972)
>>>> Caused by: java.nio.file.NoSuchFileException: D:\i\202204\_14gb.si
>>>> at sun.nio.fs.WindowsException.translateToIOException(Unknown
>>>> Source)
>>>> at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown
>>>> Source)
>>>> at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown
>>>> Source)
>>>> at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown
>>>> Source)
>>>> at java.nio.channels.FileChannel.open(Unknown Source)
>>>> at java.nio.channels.FileChannel.open(Unknown Source)
>>>> at
>>>> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
>>>> at
>>>> org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137)
>>

Re: Index corruption and repair

2022-04-30 Thread Michael McCandless

Hi Antony,

Hmm it looks like the root cause is this:

  Caused by: java.nio.file.NoSuchFileException: D:\i\202204\_14gb.si

Can you list all the files in the index directory at the time this
exception happens, and reply here?  We need to figure out whether the file
is really missing or what.

Do you run any virus scanner / disk file tree utilities / etc.?  In the
distant past sometimes such programs might cause strange transient errors
if they open a file for read exclusively or so, on windows.

What is the actual drive you are storing the index on (D:)?  Is it a local
disk or remote SMBFS mount?

Mike McCandless

http://blog.mikemccandless.com


On Sat, Apr 30, 2022 at 8:39 AM Antony Joseph 
wrote:

> Thank you for your reply.
>
> *The full stack trace is included:*
>
> , >
> Java stacktrace:
> org.apache.lucene.index.CorruptIndexException: Unexpected file read error
> while
> reading index.
>
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="D:\i\202204\segments_10fj")))
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
> at
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:165)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:972)
> Caused by: java.nio.file.NoSuchFileException: D:\i\202204\_14gb.si
> at sun.nio.fs.WindowsException.translateToIOException(Unknown
> Source)
> at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
> at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
> at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown
> Source)
> at java.nio.channels.FileChannel.open(Unknown Source)
> at java.nio.channels.FileChannel.open(Unknown Source)
> at
> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
> at
> org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137)
> at
>
> org.apache.lucene.codecs.lucene62.Lucene62SegmentInfoFormat.read(Lucene62SegmentInfoFormat.java:89)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:288)
> ... 2 more
>
> Traceback (most recent call last):
>   File "index.py", line 112, in start
> writer = IndexWriter(index_directory, iconfig)
> lucene.JavaError: , >
> Java stacktrace:
> org.apache.lucene.index.CorruptIndexException: Unexpected file read error
> while
> reading index.
>
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="D:\i\202204\segments_10fj")))
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
> at
> org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:165)
> at org.apache.lucene.index.IndexWriter.(IndexWriter.java:972)
> Caused by: java.nio.file.NoSuchFileException: D:\i\202204\_14gb.si
> at sun.nio.fs.WindowsException.translateToIOException(Unknown
> Source)
> at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
> at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
> at sun.nio.fs.WindowsFileSystemProvider.newFileChannel(Unknown
> Source)
> at java.nio.channels.FileChannel.open(Unknown Source)
> at java.nio.channels.FileChannel.open(Unknown Source)
> at
> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
> at
> org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137)
> at
>
> org.apache.lucene.codecs.lucene62.Lucene62SegmentInfoFormat.read(Lucene62SegmentInfoFormat.java:89)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357)
> at
> org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:288)
> ... 2 more
>
>
> Regards,
> Antony
>
> On Sat, 30 Apr 2022 at 10:59, Robert Muir  wrote:
>
> > The most helpful thing would be the full stacktrace of the exception.
> > This exception should be chaining the original exception and call
> > site, and maybe tell us more about this error you hit.
> >
> > To me, it looks like a windows-specific issue where the filesystem is
> > returning an unexpected error. So it would be helpful to see exactly
> > which one that is, and the full trace of where it comes from, to chase
> > it further
> >
> > On Thu, Apr 28, 2022 at 12:10 PM Antony Joseph
> >  wrote:
> > >
> > > Thank you for your reply.
> > >
> > > This isn't happening in a single environment. Our application is being
> > used
> > > by various clients and this has been reported by multiple users - all
> of
> > > whom were running the earlier pylucene (v4.10) - without issues.
> > >
> > > One thing to mention is that our earlier version used Python 2.7.15
> (with
> > > pylucene 4.10) and now we are using Python 3.8.10 with Pylucene 6.5.0 -
> > the
> > > indexing logic is the same...
> > >
> > > One other thing to note is that the i

Re: A question on PhraseQuery and slop

2021-12-13 Thread Michael McCandless

Hello Claude,

Hmm, that is interesting that you see slop=2 matching query "quick fox"
against document "the fox is quick".

Edit distance (Levenshtein) is a bit tricky because it might include a
transposition (just swapping the two words) as edit distance 1 OR 2.

So maybe Lucene's PhraseQuery is counting transposition as edit distance 1,
in which case, your test makes sense, and the javadocs are wrong?

I am far from an expert on PhraseQuery :)  Does anyone know if we change
the behavior?  In any case, we must at least fix the javadocs.  Claude,
maybe open a Jira issue (
https://issues.apache.org/jira/projects/LUCENE/summary) and we can
discuss there?

Thank you for catching this!

Mike McCandless

http://blog.mikemccandless.com

On Fri, Dec 10, 2021 at 8:47 AM Claude Lepere 
wrote:

> Hello.
>
>
> The explanation of
>
> https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop
> <
> https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--
> >
> writes
> that the edit distance between "quick fox" and "the fox is quick" would be
> at an edit distance of 3;
> this seems inaccurate to me.
>
> I don't know if the edit distance used by Lucene is the Levenshtein
> distance (insertion, deletion, substitution, all of weight 1) - a standard
> in information retrieval - but a test of "quick fox" PhraseQuery with a
> slop of 2 hits the text "the fox is quick" (1 deletion + 1 insertion); the
> slop does not have to be 3.
>
> I wonder if I'm right.
>
>
> Claude Lepère, Belgium
>
> claudelep...@gmail.com
>
>
>
> <
> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> Virus-free.
> www.avg.com
> <
> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

Re: Java 17 and Lucene

2021-10-20 Thread Michael McCandless

Nightly benchmarks managed to succeed (once, so far) on JDK 17:
https://home.apache.org/~mikemccand/lucenebench/

No obvious performance changes on quick look.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Oct 19, 2021 at 8:42 PM Gautam Worah  wrote:

> Thanks for the note of caution Uwe.
>
> > On our Jenkins server running with AMD Ryzen CPU it happens quite often
> that JDK 16, JDK 17 and JDK 18 hang during tests and stay unkillable (only
> a hard kill with" kill -9")
>
> Scary stuff.
> I'll try to reproduce the hang first and then try to get the JVM logs. I'll
> respond back here if I find something useful.
>
> > Do you get this error in lucene:core:ecjLintMain and not during compile?
> Then this is https://issues.apache.org/jira/browse/LUCENE-10185, solved
> already.
>
> Ahh. I should've been clearer with my comment. The error we see is because
> we have forked the class and have modified it a bit.
> I just assumed that the upstream Lucene package would've also gotten errors
> on the JDK17 build because it was untouched.
>
> -
> Gautam Worah.
>
>
> On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov 
> wrote:
>
> > > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU
> > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests
> and
> > stay unkillable (only a hard kill with" kill -9"). Previous Java versions
> > don't hang. It happens not all the time (about 1/4th of all builds) and
> due
> > to the fact that the JVM is unresponsible it is not possible to get a
> stack
> > trace with "jstack". If you know a way to get the stack trace, I'd happy
> to
> > get help.
> >
> > ooh that sounds scary. I suppose one could maybe get core dumps using
> > the right signal and debug that way? Oh wait you said only 9 works,
> > darn! How about attaching using gdb? Do we maintain GC logs for these
> > Jenkins builds? Maybe something suspicious would show up there.
> >
> > By the way the JDK is absolutely "responsible" in this situation! Not
> > responsive maybe ...
> >
> > On Tue, Oct 19, 2021 at 4:46 AM Uwe Schindler  wrote:
> > >
> > > Hi,
> > >
> > > > Hey,
> > > >
> > > > Our team at Amazon Product Search recently ran our internal
> benchmarks
> > with
> > > > JDK 17.
> > > > We saw a ~5% increase in throughput and are in the process of
> > > > experimenting/enabling it in production.
> > > > We also plan to test the new Corretto Generational Shenandoah GC.
> > >
> > > I would a bit careful: On our Jenkins server running with AMD Ryzen CPU
> > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during tests
> and
> > stay unkillable (only a hard kill with" kill -9"). Previous Java versions
> > don't hang. It happens not all the time (about 1/4th of all builds) and
> due
> > to the fact that the JVM is unresponsible it is not possible to get a
> stack
> > trace with "jstack". If you know a way to get the stack trace, I'd happy
> to
> > get help.
> > >
> > > Once I figured out what makes it hang, I will open issues in OpenJDK (I
> > am OpenJDK member/editor). I have now many stuck JVMs running to analyze
> on
> > the server, so you're invited to help! At the moment, I have no time to
> > take care, so any help is useful.
> > >
> > > > On a side note, the Lucene codebase still uses the deprecated (as of
> > > > JDK17) AccessController
> > > > in the RamUsageEstimator class.
> > > > We suppressed the warning for now (based on recommendations
> > > > <http://mail-archives.apache.org/mod_mbox/db-derby-
> > > > dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800
> > > > 5...@atlassian.jira%3E>
> > > > from the Apache Derby mailing list).
> > >
> > > This should not be an issue, because we compile Lucene with javac
> > parameter "--release 11", so it won't show any warning that you need to
> > suppress. Looks like your build system at Amazon is not the original one
> by
> > Lucene's Gradle, which shows no warnings at all.
> > >
> > > Uwe
> > >
> > > > Gautam Worah.
> > > >
> > > >
> > > > On Mon, Oct 18, 2021 at 3:02 PM Michael McCandless <
> > > > luc...@mikemccandless.com> wrote:
> > > >
> > > > > Also, I try to semi-aggressively upgrade Lucene&#x

Re: Java 17 and Lucene

2021-10-18 Thread Michael McCandless

Also, I try to semi-aggressively upgrade Lucene's nightly benchmarks to new
JDK releases and leave an annotation on the nightly charts:
https://home.apache.org/~mikemccand/lucenebench/

I just now upgraded to JDK 17 and kicked off a new benchmark run ... in a
few hours it should show the new data points and then I'll try to remember
to annotate it tomorrow.

So let's see whether nightly benchmarks uncover any performance changes
from JDK17 :)

Mike McCandless

http://blog.mikemccandless.com


On Mon, Oct 18, 2021 at 5:36 PM Robert Muir  wrote:

> We test different releases on different platforms (e.g. Linux, Windows,
> Mac).
> We also test EA (Early Access) releases of openjdk versions during the
> development process.
> This finds bugs before they get released.
>
> More information about versions/EA testing: https://jenkins.thetaphi.de/
>
> On Mon, Oct 18, 2021 at 5:33 PM Kevin Rosendahl
>  wrote:
> >
> > Hello,
> >
> > We are using Lucene 8 and planning to upgrade from Java 11 to Java 17. We
> > are curious:
> >
> >- How lucene is testing against java versions. Are there correctness
> and
> >performance tests using java 17?
> >   - Additionally, besides Java 17, how are new Java releases tested?
> >- Are there any other orgs using Java 17 with Lucene?
> >- Any other considerations we should be aware of?
> >
> >
> > Best,
> > Kevin Rosendahl
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Issue regarding build

2021-08-19 Thread Michael McCandless

Hello Udit,

The screen shot did not come through for me -- it's a broken image.  Maybe
copy/paste the text of the error instead?

Also, try running "./gradlew assemble" from the command-line (in a console
shell, e.g. Terminal on OS X) instead?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Aug 19, 2021 at 10:04 AM Udit Prabhakar 
wrote:

> Hi,
> I'm a newbie to the open source contribution world, I would really
> appreciate some help. I cloned the project and opened it on Intellij Idea.
> I ran .\gradlew assemble to build gradle. But, its failing.
> [image: Screenshot (108).png]
>

Re: Info about the Lucene 4.10.4 version.

2021-06-22 Thread Michael McCandless

Hi Arvind,

I responded about this on the issue you also opened:
https://issues.apache.org/jira/browse/LUCENE-10013

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 22, 2021 at 10:04 AM Arvind Kumar Sahu
 wrote:

> Hi Team,
>
> Currently we are using Lucene 4.10.4 version. We are getting the below
> error:
>
> "Document contains at least one immense term in field (whose UTF8 encoding
> is longer than the max length 32766), all of which were skipped. Please
> correct the analyzer to not produce such terms. The prefix of the first
> immense term is: '[-41, -103, -41, -87, -41, -103, -41, -111, -41, -108,
> 32, 56, 56, 45, -41, -108, -41, -111, -41, -107, -41, -89, -41, -88, 44,
> 32, 40, 32, 49, 51]...', original message: bytes can be at most 32766 in
> length; got 35169".
>
> We understand from the Lucene JIRA ticket [LUCENE-5472] Long terms should
> generate a RuntimeException, not just infoStream - ASF JIRA (apache.org)<
> https://issues.apache.org/jira/browse/LUCENE-5472>, this issue has been
> resolved in 4.8 and 6.0.
>
> Please confirm us if this fix is included in 4.10.4.
>
> Thanks,
> Arvind
>

Re: Multiple merge-runs from same set of segments

2021-05-24 Thread Michael McCandless

Are you trying to rewrite your already created index into a different
segment geometry?

Maybe have a look at the new IndexRearranger tool
?  It is already doing
something like what you enumerated below, including mocking LiveDocs to get
the right documents into the right segments.

Mike McCandless

http://blog.mikemccandless.com

On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
ravikumar.govindara...@gmail.com> wrote:

> Hello,
>
> We have a use-case for index-rewrite on a "frozen index" where no new
> documents are added. It goes like this..
>
>1. Get all segments for the index (base-segment-list)
>2. Create a new segment from base-segment-list with unique set of docs
>(LiveDocs)
>3. Repeat step 2, for a fixed count. Like say 5 or 10 times
>
> Is something like this achievable via Merge Policy? We can disable commits
> too, till the full run is completed.
>
> Any help is appreciated
>
> Regards,
> Ravi
>

Re: Performance decrease with NRT use-case in 8.8.x (coming from 8.3.0)

2021-05-19 Thread Michael McCandless

> The update showed no issues (e.g. compiled without changes) but I noticed
that our test-suites take a lot longer to finish.

Hmm, that sounds bad.  We need our tests to stay fast but also do a good
job testing things ;)

Does your production usage also slow down?  Tests do other interesting
things, e.g. enable assertions, randomly swap in different codecs,
different Directory impls (if you are using Lucene's randomized test
infrastructure
),
etc.

> While in 8.8 files are opened for reading that do not (yet) exist.

That is incredibly strange!  Lucene should never do that (opening a
non-existing file for "read", causing that file to become created through
Windows CreateFile), that I am aware of.  Can you share the full source
code of the test case?

Try to eliminate parts of the test maybe?  Do you see the same slowdown if
you don't use NRTCachingDirectory at all? (Just straight to
MMapDirectory.)  There were some recent fixes to NRTCachineDirectory, e.g.
https://issues.apache.org/jira/browse/LUCENE-9115 and
https://issues.apache.org/jira/browse/LUCENE-8663 -- maybe they are related?

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 19, 2021 at 7:23 AM Gietzen, Markus
 wrote:

> Hello,
>
> recently I updated the Lucene version in one of our products from 8.3 to
> 8.8.x (8.8.2 as of now).
> The update showed no issues (e.g. compiled without changes) but I noticed
> that our test-suites take a lot longer to finish.
>
> So I took a closer look at one test-case which showed a severe slowdown
> (it’s doing small update, flush, search  cycles in order to stress NRT;
> the purpose is to see performance-changes in an early stage 😉 ):
>
> Lucene 8.3:   ~2,3s
> Lucene 8.8.x:  25s
>
> This is a huge difference. Therefore I used YourKit to profile 8.3 and 8.8
> and do a comparison.
>
> The gap is caused by different amount of calls to
> sun.nio.fs.WindowsNativeDispatcher.CreateFile0(long, int, int, long, int,
> int) WindowsNativeDispatcher.java (native)
> 8.3:  about 150 calls
> 8.8:  about 12500 calls
>
> In order to hunt down what is causing this, I took a look at the open() in
> NRTDirectory.
> Here I could see that the amount of calls to that open is in the same
> ballpark for 8.3 and 8.8
>
> The difference is that in 8.3 nearly all files are available in the
> underlying RAMDirectory. While in 8.8 files are opened for reading that do
> not (yet) exist.
> This leads to a call to the WindowsNativeDispatcher.CreateFile0
>
> Add the end of the mail I added two example-stacktraces that show this
> behavior.
>
> Has someone an idea what change might cause this or if I need to do
> something different in 8.8 compared to 8.3?
>
>
> Thanks for any help,
>
> Markus
>
> Here is an example stacktrace that is causing such a try of a read-access
> to non-existing file:
>
> Filename= _0.fdm(IOContext is READ)   (I checked the directory on
> harddisk: it did not yet contain it nor in RAM-directory of the NRTCacheDir)
>
> openInput:100, FilterDirectory (org.apache.lucene.store)
> openInput:100, FilterDirectory (org.apache.lucene.store)
> openChecksumInput:157, Directory (org.apache.lucene.store)
> finish:140, FieldsIndexWriter (org.apache.lucene.codecs.compressing)
> finish:480, CompressingStoredFieldsWriter
> (org.apache.lucene.codecs.compressing)
> flush:81, StoredFieldsConsumer (org.apache.lucene.index)
> flush:239, DefaultIndexingChain (org.apache.lucene.index)
> flush:350, DocumentsWriterPerThread (org.apache.lucene.index)
> doFlush:476, DocumentsWriter (org.apache.lucene.index)
> flushAllThreads:656, DocumentsWriter (org.apache.lucene.index)
> getReader:605, IndexWriter (org.apache.lucene.index)
> doOpenIfChanged:277, StandardDirectoryReader (org.apache.lucene.index)
> openIfChanged:235, DirectoryReader (org.apache.lucene.index)
>
>
> In a consequence later accesses to such files also lead to the state that
> the file is not within the RAMDirectory but only on harddisk.
> Example:
>
>
>
> Filename _1.fdx  Context = READ   (file is on harddisk but not in
> RAMDirectory)
>
>
>
> openInput:100, FilterDirectory (org.apache.lucene.store)
>
> openInput:100, FilterDirectory (org.apache.lucene.store)
>
> openInput:100, FilterDirectory (org.apache.lucene.store)
>
> openChecksumInput:157, Directory (org.apache.lucene.store)
>
> write:90, Lucene50CompoundFormat (org.apache.lucene.codecs.lucene50)
>
> createCompoundFile:5316, IndexWriter (org.apache.lucene.index)
>
> sealFlushedSegment:457, DocumentsWriterPerThread (org.apache.lucene.index)
>
> flush:395, DocumentsWriterPerThread (org.apache.lucene.index)
>
> doFlush:476, DocumentsWriter (org.apache.lucene.index)
>
> flushAllThreads:656, DocumentsWriter (org.apache.lucene.index)
>
> getReader:605, IndexWriter (org.apache.lucene.index)
>
> doOpenFromWriter:290, StandardDirectoryReader (org.apache.lucene.index)
>
> doOpenIfChanged:275, StandardDirectoryReader (org.apache.lucene.index)
>

Re: Correct usage of synonyms with Japanese

2021-05-18 Thread Michael McCandless

Hi Geoffrey,

[Disclaimer: Geoffrey and I both work at Amazon on customer-facing product
search]

We absolutely must get SynonymGraphFilter consuming input graphs!  This is
just a (serious) bug in it!  But it's just software, let's fix it :)  That
is clearly the right fix, it is just rather fun and challenging. But it is
doable.  Could you open an issue?  I thought we had one for this but cannot
find it now.

I think you are using Kuromoji Japanese tokenizer?  Which produces nice
looking graphs right from the get-go (tokenizer), with compound words also
properly decompounded so both options are indexed/searched.

History: we created SynonymGraphFilter, along with other important
QueryParser (e.g. http://issues.apache.org/jira/browse/LUCENE-7603) and
Query improvements, to get multi-term synonyms working correctly, finally
in Lucene.  With the old SynonymFilter, positional queries involving
multi-term synonyms would have both false positive and false negative hits
... I tried to explain the messy situation here:
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

And, finally, with SynonymGraphFilter, used only at search time, and with
tokens consumed by a QueryParser that knows how to turn graphs into correct
positional queries, those bugs are finally fixed -- multi-term synonyms
work correctly.

When used during indexing, SynonymGraphFilter must eventually be followed
by FlattenGraphFilter, because Lucene's index does not store the posLength
attribute of each token.  I.e., the graph is lost anyways during indexing,
so FlattenGraphFilter tries to flatten the graph in the most
information-preserving way (but still loses information, resulting in false
positive/negative hits for positional queries).

Anyways, until we fix this, feeding a graph to SynonymGraphFilter will
indeed mess up its output in weird ways.

This problem has come up several times recently, e.g.
https://issues.apache.org/jira/browse/LUCENE-9173 and
https://issues.apache.org/jira/browse/LUCENE-9123.  There is also the more
revolutionary https://issues.apache.org/jira/browse/LUCENE-5012 but that is
too ambitious for this "small bug", I think.

SynonymGraphFilter also struggles with holes, since they might break the
token graph into two: https://issues.apache.org/jira/browse/LUCENE-8985

For short term workarounds, some possible ideas:

  * I think Kuromoji has an option to NOT produce the compounds/graph
output?  It has an indexing and searching mode.  That might be one
workaround, if maybe you could maybe then move the compounding into
SynonymGraphFilter?  I'm not sure that is possible, in general, since
Kuromoji is using more powerful information (dictionary) to make its graph
choices.

  * Use FlattenGraphFilter immediately before SynonymGraphFilter, and then
again at the end of your analysis chain.  This loses information, since all
tokens are "squashed" onto one another, and we could no longer tell which
sequence of tokens corresponded to which compound word, and it might mean
some synonyms fail to apply when they should have.

  * Go back to SynonymFilter at indexing time.  It will also not fully
handle an input graph correctly, and will necessarily miss some synonyms
that should've applied, but it may produce a more "reasonable" bad output,
and then you shouldn't need FlattenGraphFilter at all.  But test this
carefully to understand what it is doing!

But let's fix the issue for real!

Mike McCandless

http://blog.mikemccandless.com

On Tue, May 18, 2021 at 6:17 AM Geoffrey Lawson 
wrote:

> Hello,
>
> I'm working on a project that involves search in Japanese and uses
> synonyms. The Japanese tokenizer creates an analysis graph, but the
> SynonymGraphFilter states it cannot take a graph as input. After a few
> tests I've seen it can create some unusual outputs if given a graph as
> input. The SynonymFilter is marked deprecated, and has documentation
> pointing out it doesn't handle multiple synonym paths correctly.
>
> My question is what is the 'correct' way to handle synonyms with Japanese
> in Lucene? should the graph be flattened before the SynonymGraphFilter,
> then flattened again after? This seems extra lossy. Is the correct answer
> to make SynonymGraphFilter accept graphs as inputs? is there another option
> that I'm missing?
>
> thanks,
> Geoff
>

Re: CorruptIndexException after failed segment merge caused by No space left on device

2021-03-24 Thread Michael McCandless

+1, this sounds like a bad bug in Lucene!  We try hard to test for and
prevent such bugs!

As long as you succeeded in at least one commit since creating the
index before you hit the disk full, restarting Lucene on the index should
have recovered from that last successful commit.

How often do you commit?  Did you have a successful commit before the disk
full event?

Please open an issue and put all possible comments detailing your context,
thanks,

Mike McCandless

http://blog.mikemccandless.com


On Wed, Mar 24, 2021 at 12:55 PM Robert Muir  wrote:

> On Wed, Mar 24, 2021 at 1:41 AM Alexander Lukyanchikov <
> alexanderlukyanchi...@gmail.com> wrote:
>
> > Hello everyone,
> >
> > Recently we had a failed segment merge caused by "No space left on
> device".
> > After restart, Lucene failed with the CorruptIndexException.
> > The expectation was that Lucene automatically recovers in such
> > case, because there was no succesul commit. Is it a correct assumption,
> or
> > I am missing something?
> > It would be great to know any recommendations to avoid such situations
> > in future and be able to recover automatically after restart.
> >
>
> I don't think you are missing something. It should not happen.
>
> Can you please open a issue:
> https://issues.apache.org/jira/projects/LUCENE
>
> If you don't mind, please supply all relevant info you are able to provide
> on the issue: OS, filesystem, JDK version, any hints as to how you are
> using lucene (e.g. when you are committing / how you are indexing). There
> are a lot of tests in lucene's codebase designed to simulate the disk full
> condition and guarantee that stuff like this never happens, but maybe some
> case is missing, or some other unknown bug causing the missing files.
>
> Thanks
>

Re: [VOTE] Lucene logo contest, third time's a charm

2020-12-21 Thread Michael McCandless

Thank you Ryan for pushing forwards to our new logo.

Now that this VOTE has passed, are there issues open to actually "deliver
it" to the world?

E.g. I see https://lucene.apache.org still shows our old logo.

Branding is a lot of work!

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 8, 2020 at 11:10 PM Tomoko Uchida 
wrote:

> I will take care of Luke app. It has Lucene logo and sticker.
> If by any chance a new logo for Luke itself is provided (maybe from fans
> with graphic designing skills), I will also change it - the priority is low
> I think.
>
>
> 2020年9月9日(水) 5:07 Anshum Gupta :
>
> > Thank you, Ryan and everyone else who was involved!
> >
> > Looking forward to the new Logo. It's been a while :)
> >
> > On Tue, Sep 8, 2020 at 8:55 AM Ryan Ernst  wrote:
> >
> > > This vote is now closed. The results are as follows:
> > >
> > > Binding Results
> > >   A1: 12 (55%)
> > >   D: 6 (27%)
> > >   A2: 4 (18%)
> > >
> > > All Results
> > >   A1: 16 (55%)
> > >   D: 7 (24%)
> > >   A2: 5 (17%)
> > >   B5d: 1 (3%)
> > >
> > > A1 is our winner! I will make the necessary changes.
> > >
> > > Thank you to Dustin Haver, Stamatis Zampetakis, Baris Kazar and all who
> > > voted!
> > >
> > > On Tue, Sep 1, 2020 at 1:21 PM Ryan Ernst  wrote:
> > >
> > > > Dear Lucene and Solr developers!
> > > >
> > > > Sorry for the multiple threads. This should be the last one.
> > > >
> > > > In February a contest was started to design a new logo for Lucene
> > > > [jira-issue]. The initial attempt [first-vote] to call a vote
> resulted
> > in
> > > > some confusion on the rules, as well the request for one additional
> > > > submission. The second attempt [second-vote] yesterday had incorrect
> > > links
> > > > for one of the submissions. I would like to call a new vote, now with
> > > more
> > > > explicit instructions on how to vote, and corrected links.
> > > >
> > > > *Please read the following rules carefully* before submitting your
> > vote.
> > > >
> > > > *Who can vote?*
> > > >
> > > > Anyone is welcome to cast a vote in support of their favorite
> > > > submission(s). Note that only PMC member's votes are binding. If you
> > are
> > > a
> > > > PMC member, please indicate with your vote that the vote is binding,
> to
> > > > ease collection of votes. In tallying the votes, I will attempt to
> > verify
> > > > only those marked as binding.
> > > >
> > > >
> > > > *How do I vote?*
> > > > Votes can be cast simply by replying to this email. It is a
> > ranked-choice
> > > > vote [rank-choice-voting]. Multiple selections may be made, where the
> > > order
> > > > of preference must be specified. If an entry gets more than half the
> > > votes,
> > > > it is the winner. Otherwise, the entry with the lowest number of
> votes
> > is
> > > > removed, and the votes are retallied, taking into account the next
> > > > preferred entry for those whose first entry was removed. This process
> > > > repeats until there is a winner.
> > > >
> > > > The entries are broken up by variants, since some entries have
> multiple
> > > > color or style variations. The entry identifiers are first a capital
> > > > letter, followed by a variation id (described with each entry below),
> > if
> > > > applicable. As an example, if you prefer variant 1 of entry A,
> followed
> > > by
> > > > variant 2 of entry A, variant 3 of entry C, entry D, and lastly
> variant
> > > 4e
> > > > of entry B, the following should be in your reply:
> > > >
> > > > (binding)
> > > > vote: A1, A2, C3, D, B4e
> > > >
> > > > *Entries*
> > > >
> > > > The entries are as follows:
> > > >
> > > > A*.* Submitted by Dustin Haver. This entry has two variants, A1 and
> A2.
> > > >
> > > > [A1]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> > > > [A2]
> > > >
> > https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png
> > > >
> > > > B. Submitted by Stamatis Zampetakis. This has several variants.
> Within
> > > the
> > > > linked entry there are 7 patterns and 7 color palettes. Any vote for
> B
> > > > should contain the pattern number followed by the lowercase letter of
> > the
> > > > color palette. For example, B3e or B1a.
> > > >
> > > > [B]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
> > > >
> > > > C. Submitted by Baris Kazar. This entry has 8 variants.
> > > >
> > > > [C1]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf
> > > > [C2]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf
> > > > [C3]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf
> > > > [C4]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/13006395/lucene_logo4_full.pdf
> > > > [C5]
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/13006396/lucene_logo5_f

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread Michael McCandless

Hello,

Yes, that is exactly what MMapDirectory.setPreload is trying to do, but not
promises (it is best effort).  I think it asks the OS to touch all pages in
the mapped region so they are cached in RAM, if you have enough RAM.

Make your JVM heap as low as possible to let the OS have more RAM to use to
load your index.

Mike McCandless

http://blog.mikemccandless.com


On Sun, Dec 13, 2020 at 4:18 PM  wrote:

> Hi,-
>
> it would be nice to create a Lucene index in files and then effectively
> load it into memory once (since i use in read-only mode). I am looking into
> if this is doable in Lucene.
>
> i wish there were an option to load whole Lucene index into memory:
>
> Both of below urls have links to the blog url where i quoted a very nice
> section:
>
>
> https://lucene.apache.org/core/8_5_0/core/org/apache/lucene/store/MMapDirectory.html
>
> https://lucene.apache.org/core/8_5_2/core/org/apache/lucene/store/MMapDirectory.html
>
> This following blog mentions about such option
> to run in the memory: (see the underlined sentence below)
>
>
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html?m=1
>
> MMapDirectory will not load the whole index into physical memory. Why
> should it do this? We just ask the operating system to map the file into
> address space for easy access, by no means we are requesting more. Java and
> the O/S optionally provide the option to try loading the whole file into
> RAM (if enough is available), but Lucene does not use that option (we may
> add this possibility in a later version).
>
> My question is: is there such an option?
> is the method setPreLoad for this purpose:
> to load all Lucene lndex into memory?
>
> I would like to use MMapDirectory and set my
> JVM heap to 16G or a bit less (since my index is
> around this much).
>
> The Lucene 8.5.2 (8.5.0 as well) javadocs say:
> public void setPreload(boolean preload)
> Set to true to ask mapped pages to be loaded into physical memory on init.
> The behavior is best-effort and operating system dependent.
>
> For example Lucene 4.0.0 does not have setPreLoad method.
>
>
> https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/store/MMapDirectory.html
>
> Happy Holidays
> Best regards
>
>
> Ps. i know there is also BytesBuffersDirectory class for in memory Lucene
> but this requires creating Lucene Index on the fly.
>
> This is great for only such kind of Lucene indexes that can be created
> quickly on the fly.
>
> Ekaterina has a nice article on this BytesBuffersDirectory class:
>
>
> https://medium.com/@ekaterinamihailova/in-memory-search-and-autocomplete-with-lucene-8-5-f2df1bc71c36
>
>

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless

That's great Rob!  Thanks for bringing closure.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Nov 13, 2020 at 9:13 AM Rob Audenaerde 
wrote:

> To follow up, based on a quick JMH-test with 2M docs with some random data
> I see a speedup of 70% :)
> That is a nice friday-afternoon gift, thanks!
>
> For ppl that are interested:
>
> I added a BinaryDocValues field like this:
>
> doc.add(BinaryDocValuesField("GROUPS_ALLOWED_EMPTY", new BytesRef(0x01;
>
> And used the finalQuery.add(new DocValuesFieldExistsQuery("
> GROUPS_ALLOWED_EMPTY", BooleanClause.Occur.SHOULD);
>
> On Fri, Nov 13, 2020 at 2:09 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
> > Maybe NormsFieldExistsQuery as a MUST_NOT clause?  Though, you must
> enable
> > norms on your field to use that.
> >
> > TermRangeQuery is indeed a horribly costly way to execute this, but if
> you
> > cache the result on each refresh, perhaps it is OK?
> >
> > You could also index a dedicated doc values field indicating that the
> > field empty and then use DocValuesFieldExistsQuery.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde  >
> > wrote:
> >
> >> Hi all,
> >>
> >> We have implemented some security on our index by adding a field
> >> 'groups_allowed' to documents, and wrap a boolean must query around the
> >> original query, that checks if one of the given user-groups matches at
> >> least one groups_allowed.
> >>
> >> We chose to leave the groups_allowed field empty when the document
> should
> >> able to be retrieved by all users, so we need to also select a document
> if
> >> the 'groups_allowed' is empty.
> >>
> >> What would be the faster Query construction to do so?
> >>
> >>
> >> Currently I use a TermRangeQuery that basically matches all values and
> put
> >> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that
> gets
> >> rather slow then the number of groups is high.
> >>
> >> Thanks!
> >>
> >
>

Re: best way (performance wise) to search for field without value?

2020-11-13 Thread Michael McCandless

Maybe NormsFieldExistsQuery as a MUST_NOT clause?  Though, you must enable
norms on your field to use that.

TermRangeQuery is indeed a horribly costly way to execute this, but if you
cache the result on each refresh, perhaps it is OK?

You could also index a dedicated doc values field indicating that the field
empty and then use DocValuesFieldExistsQuery.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Nov 13, 2020 at 7:56 AM Rob Audenaerde 
wrote:

> Hi all,
>
> We have implemented some security on our index by adding a field
> 'groups_allowed' to documents, and wrap a boolean must query around the
> original query, that checks if one of the given user-groups matches at
> least one groups_allowed.
>
> We chose to leave the groups_allowed field empty when the document should
> able to be retrieved by all users, so we need to also select a document if
> the 'groups_allowed' is empty.
>
> What would be the faster Query construction to do so?
>
>
> Currently I use a TermRangeQuery that basically matches all values and put
> that in a MUST_NOT combined with a MatchAllDocumentQuery(), but that gets
> rather slow then the number of groups is high.
>
> Thanks!
>

Re: BooleanQuery normal form

2020-09-27 Thread Michael McCandless

Hi Patrick,

I don't think Lucene supports CNF or DNF for BooleanQuery?

BooleanQuery will try to do some rewriting simplifications for degenerate
cases, e.g. a BooleanQuery with a single clause.  Probably it could do more
optimizing?  It is quite complex already :)

Mike

On Tue, Sep 22, 2020 at 10:33 PM Haoyu Zhai  wrote:

> Hi there,
>
> Does Lucene support normalizing a BooleanQuery into normalized form (like
>
> CNF or DNF)?
>
>
>
> If not, is there a suggested way of doing it?
>
>
>
> Also, I wonder whether there'll be a performance difference between
>
> different forms of essentially the same BooleanQuery?
>
>
>
> Thanks
>
> Patrick
>
> --
Mike McCandless

http://blog.mikemccandless.com

Re: Optimizing term-occurrence counting (code included)

2020-09-21 Thread Michael McCandless

I left a comment on the issue.

Mike McCandless

http://blog.mikemccandless.com


On Sun, Sep 20, 2020 at 1:08 PM Alex K  wrote:

> Hi all, I'm still a bit stuck on this particular issue.I posted an issue on
> the Elastiknn repo outlining some measurements and thoughts on potential
> solutions: https://github.com/alexklibisz/elastiknn/issues/160
>
> To restate the question: Is there a known optimal way to find and count
> docs matching 10s to 100s of terms? It seems the bottleneck is in the
> PostingsFormat implementation. Perhaps there is a PostingsFormat better
> suited for this usecase?
>
> Thanks,
> Alex
>
> On Fri, Jul 24, 2020 at 7:59 AM Alex K  wrote:
>
> > Thanks Ali. I don't think that will work in this case, since the data I'm
> > counting is managed by lucene, but that looks like an interesting
> project.
> > -Alex
> >
> > On Fri, Jul 24, 2020, 00:15 Ali Akhtar  wrote:
> >
> >> I'm new to lucene so I'm not sure what the best way of speeding this up
> in
> >> Lucene is, but I've previously used https://github.com/npgall/cqengine
> >> for
> >> similar stuff. It provided really good performance, especially if you're
> >> just counting things.
> >>
> >> On Fri, Jul 24, 2020 at 6:55 AM Alex K  wrote:
> >>
> >> > Hi all,
> >> >
> >> > I am working on a query that takes a set of terms, finds all documents
> >> > containing at least one of those terms, computes a subset of candidate
> >> docs
> >> > with the most matching terms, and applies a user-provided scoring
> >> function
> >> > to each of the candidate docs
> >> >
> >> > Simple example of the query:
> >> > - query terms ("aaa", "bbb")
> >> > - indexed docs with terms:
> >> >   docId 0 has terms ("aaa", "bbb")
> >> >   docId 1 has terms ("aaa", "ccc")
> >> > - number of top candidates = 1
> >> > - simple scoring function score(docId) = docId + 10
> >> > The query first builds a count array [2, 1], because docId 0 contains
> >> two
> >> > matching terms and docId 1 contains 1 matching term.
> >> > Then it picks docId 0 as the candidate subset.
> >> > Then it applies the scoring function, returning a score of 10 for
> docId
> >> 0.
> >> >
> >> > The main bottleneck right now is doing the initial counting, i.e. the
> >> part
> >> > that returns the [2, 1] array.
> >> >
> >> > I first started by using a BoolQuery containing a Should clause for
> >> every
> >> > Term, so the returned score was the count. This was simple but very
> >> slow.
> >> > Then I got a substantial speedup by copying and modifying the
> >> > TermInSetQuery so that it tracks the number of times each docId
> >> contains a
> >> > query term. The main construct here seems to be PrefixCodedTerms.
> >> >
> >> > At this point I'm not sure if there's any faster construct, or
> perhaps a
> >> > more optimal way to use PrefixCodedTerms?
> >> >
> >> > Here is the specific query, highlighting some specific parts of the
> >> code:
> >> > - Build the PrefixCodedTerms (in my case the terms are called
> 'hashes'):
> >> >
> >> >
> >>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
> >> > - Count the matching terms in a segment (this is the main bottleneck
> in
> >> my
> >> > query):
> >> >
> >> >
> >>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
> >> >
> >> > I appreciate any suggestions you might have.
> >> >
> >> > - Alex
> >> >
> >>
> >
>

Re: [VOTE] Lucene logo contest, third time's a charm

2020-09-02 Thread Michael McCandless

A2, A1, C5, D (binding)

Thank you to everyone for working so hard to make such cool looking
possible future Lucene logos!  And to Ryan for the challenging job of
calling this VOTE :)

Mike McCandless

http://blog.mikemccandless.com


On Tue, Sep 1, 2020 at 4:21 PM Ryan Ernst  wrote:

> Dear Lucene and Solr developers!
>
> Sorry for the multiple threads. This should be the last one.
>
> In February a contest was started to design a new logo for Lucene
> [jira-issue]. The initial attempt [first-vote] to call a vote resulted in
> some confusion on the rules, as well the request for one additional
> submission. The second attempt [second-vote] yesterday had incorrect links
> for one of the submissions. I would like to call a new vote, now with more
> explicit instructions on how to vote, and corrected links.
>
> *Please read the following rules carefully* before submitting your vote.
>
> *Who can vote?*
>
> Anyone is welcome to cast a vote in support of their favorite
> submission(s). Note that only PMC member's votes are binding. If you are a
> PMC member, please indicate with your vote that the vote is binding, to
> ease collection of votes. In tallying the votes, I will attempt to verify
> only those marked as binding.
>
>
> *How do I vote?*
> Votes can be cast simply by replying to this email. It is a ranked-choice
> vote [rank-choice-voting]. Multiple selections may be made, where the order
> of preference must be specified. If an entry gets more than half the votes,
> it is the winner. Otherwise, the entry with the lowest number of votes is
> removed, and the votes are retallied, taking into account the next
> preferred entry for those whose first entry was removed. This process
> repeats until there is a winner.
>
> The entries are broken up by variants, since some entries have multiple
> color or style variations. The entry identifiers are first a capital
> letter, followed by a variation id (described with each entry below), if
> applicable. As an example, if you prefer variant 1 of entry A, followed by
> variant 2 of entry A, variant 3 of entry C, entry D, and lastly variant 4e
> of entry B, the following should be in your reply:
>
> (binding)
> vote: A1, A2, C3, D, B4e
>
> *Entries*
>
> The entries are as follows:
>
> A*.* Submitted by Dustin Haver. This entry has two variants, A1 and A2.
>
> [A1]
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> [A2]
> https://issues.apache.org/jira/secure/attachment/12997172/LuceneLogo.png
>
> B. Submitted by Stamatis Zampetakis. This has several variants. Within the
> linked entry there are 7 patterns and 7 color palettes. Any vote for B
> should contain the pattern number followed by the lowercase letter of the
> color palette. For example, B3e or B1a.
>
> [B]
> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
>
> C. Submitted by Baris Kazar. This entry has 8 variants.
>
> [C1]
> https://issues.apache.org/jira/secure/attachment/13006392/lucene_logo1_full.pdf
> [C2]
> https://issues.apache.org/jira/secure/attachment/13006393/lucene_logo2_full.pdf
> [C3]
> https://issues.apache.org/jira/secure/attachment/13006394/lucene_logo3_full.pdf
> [C4]
> https://issues.apache.org/jira/secure/attachment/13006395/lucene_logo4_full.pdf
> [C5]
> https://issues.apache.org/jira/secure/attachment/13006396/lucene_logo5_full.pdf
> [C6]
> https://issues.apache.org/jira/secure/attachment/13006397/lucene_logo6_full.pdf
> [C7]
> https://issues.apache.org/jira/secure/attachment/13006398/lucene_logo7_full.pdf
> [C8]
> https://issues.apache.org/jira/secure/attachment/13006399/lucene_logo8_full.pdf
>
> D. The current Lucene logo.
>
> [D]
> https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>
> Please vote for one of the above choices. This vote will close about one
> week from today, Mon, Sept 7, 2020 at 11:59PM.
>
> Thanks!
>
> [jira-issue] https://issues.apache.org/jira/browse/LUCENE-9221
> [first-vote]
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202006.mbox/%3cCA+DiXd74Mz4H6o9SmUNLUuHQc6Q1-9mzUR7xfxR03ntGwo=d...@mail.gmail.com%3e
> [second-vote]
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202009.mbox/%3cCA+DiXd7eBrQu5+aJQ3jKaUtUTJUqaG2U6o+kUZfNe-m=smn...@mail.gmail.com%3e
> [rank-choice-voting] https://en.wikipedia.org/wiki/Instant-runoff_voting
>

Re: Hierarchical facet select a subtree but one child

2020-08-17 Thread Michael McCandless

I think this is a missing API in DrillDownQuery?

Nicola, could you open an issue?

The filtering is as Mike Sokolov described, but I think we should add a
sugar method, e.g. DrillDownQuery.remove or something, to add a negated
query clause.

And until this API is added and you can upgrade to it, you can construct
your own TermQuery and then add it as a MUST_NOT clause.  Look at how
DrillDownQuery.add converts incoming facet paths to terms and use that
public DrillDownQuery.term method it exposes to create your own negated
TermQuery.

Mike McCandless

http://blog.mikemccandless.com


On Sat, Aug 15, 2020 at 11:55 AM Michael Sokolov  wrote:

> If you are trying to show documents that have facet value V1 excluding
> those with facet value V1.1, then you would need to issue a query
> like:
>
> +f:V1 -f:V1.1
>
> assuming your facet values are indexed in a field called "f". I don't
> think this really has anything to do with faceting; it's just a
> filtering problem.
>
> On Tue, Aug 4, 2020 at 4:47 AM nbuso  wrote:
> >
> > Hi,
> >
> > is there someone that can point me in the right API to negate facet
> > values?
> > May be this DrillDownQuery#add(dim, query) the API to permit this use
> > case?
> >
> https://lucene.apache.org/core/8_5_2/facet/org/apache/lucene/facet/DrillDownQuery.html#add-java.lang.String-org.apache.lucene.search.Query-
> >
> >
> > Nicola
> >
> >
> > On 2020-07-29 10:27, nbuso wrote:
> > > Hi,
> > >
> > > I'm a bit rusty with Lucene facets API and I have a common use case
> > > that I would like to solve.
> > > Suppose the following facet values tree:
> > >
> > > Facet
> > >  - V1
> > >- V1.1
> > >- V1.2
> > >- V1.3
> > >- V1.4
> > >- (not topK values)
> > >  - V2
> > >- V2.1
> > >- V2.2
> > >- V2.3
> > >- V2.4
> > >- (not topK values)
> > >
> > > With (not topK values) I mean values you are not showing in the UI
> > > because of space/visualization problems. You usually see them with the
> > > links "More ..."
> > >
> > > Use case:
> > > 1 - select V1 => all V1.x are selected
> > > 2 - de-select V1.1
> > >
> > > How can I achieve this? from the search results I know the values
> > > V1.[1-4] but I don't know the values that are not in topK. How can I
> > > select all the V1 subtree but V1.1?
> > >
> > > Please let me know if you need more info.
> > >
> > >
> > > Nicola Buso - EBI
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-30 Thread Michael McCandless

Hi Albert,

You're welcome.

Yes, it is likely Lucene changed behavior on this front since 3.0, and I
fear you are correct: I do not remember!

You could get away with loading a Document and making a new Document (for
indexing) from that one, if you use the loaded Document only to create e.g.
string and numeric values from the original document, but not schema level
information like whether offsets/positions are indexed into postings and
term vectors for each field, or not.  That would be safe, if you are trying
to avoid the cost of retrieving the full values for all fields from your
backing source of truth.

But you must make all such fields (that might need re-indexing) stored in
your index, so that their original value is returned when you load the
Document.  This may increase the size of your index beyond what would be
needed purely for searching.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 29, 2020 at 10:39 AM Albert MacSweeny <
albert.macswe...@profium.com> wrote:

> Hi MIke,
>
> Thanks for the quick reply. I guess this approach worked ok in version
> 3.0.0 since the project I'm working on relied on it? I know it's a long
> time ago maybe you don't remember :)
>
> I'm worried on my side that reconstructing a full doc in this situation
> might have a high performance cost so I'd like to avoid it if I can (or I
> might not have the original value of all fields available). Do you think it
> would work to just reconstruct the values for the field being modified, or
> am I likely to just run into more issues by modifying a loaded Document?
>
> Regards,
> Albert
>
> > From: "Michael McCandless" 
> > To: "java-user" , "albert macsweeny"
> > 
> > Sent: Monday, 29 June, 2020 15:23:43
> > Subject: Re: Adding fields with same field type complains that they have
> > different term vector settings
>
> > Hi Albert,
> > Unfortunately, you have fallen into a common and sneaky Lucene trap.
> > The problem happens because you loaded a Document from the index's
> stored fields
> > (the one you previously indexed) and then tried to modify that one and
> > re-index.
>
> > Lucene does not guarantee that this will work, because Lucene does not
> store all
> > information necessary to precisely reconstruct the original document you
> had
> > indexed.
>
> > The Document you loaded from the index is subtly different from the one
> you had
> > previously indexed. In particular, your custom FIELD_TYPE details were
> lost.
>
> > To sidestep this tar pit you must fully reconstruct the document
> yourself each
> > time you add it to the index.
>
> > Mike McCandless
>
> > [ http://blog.mikemccandless.com/ | http://blog.mikemccandless.com ]
>
> > On Mon, Jun 29, 2020 at 9:56 AM Albert MacSweeny < [
> > mailto:albert.macswe...@profium.com | albert.macswe...@profium.com ] >
> wrote:
>
> >> Hi,
>
> >> I'm upgrading a project to lucene 8.5.2 which had been using 3.0.0.
>
> >> Some tests are failing with a strange issue. The gist of it is, we
> create fields
> >> that need position and offset information. Inserting one field works
> ok, but
> >> then searching for the document and adding another value for the same
> field
> >> results in the following exception
>
> >> java.lang.IllegalArgumentException: all instances of a given field name
> must
> >> have the same term vectors settings (storeTermVectorPositions changed
> for
> >> field="f1")
> >> at
> >>
> org.apache.lucene.index.TermVectorsConsumerPerField.start(TermVectorsConsumerPerField.java:166)
> >> at
> org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:294)
> >> at
> >>
> org.apache.lucene.index.FreqProxTermsWriterPerField.start(FreqProxTermsWriterPerField.java:72)
> >> at
> >>
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:810)
> >> at
> >>
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
> >> at
> >>
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
> >> at
> >>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
> >> at
> >>
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
> >> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
> >> at
> org.apache.lucene.index.IndexWriter.addDocument(Ind

Re: Adding fields with same field type complains that they have different term vector settings

2020-06-29 Thread Michael McCandless

Hi Albert,

Unfortunately, you have fallen into a common and sneaky Lucene trap.

The problem happens because you loaded a Document from the index's stored
fields (the one you previously indexed) and then tried to modify that one
and re-index.

Lucene does not guarantee that this will work, because Lucene does not
store all information necessary to precisely reconstruct the original
document you had indexed.

The Document you loaded from the index is subtly different from the one you
had previously indexed.  In particular, your custom FIELD_TYPE details were
lost.

To sidestep this tar pit you must fully reconstruct the document yourself
each time you add it to the index.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 29, 2020 at 9:56 AM Albert MacSweeny <
albert.macswe...@profium.com> wrote:

> Hi,
>
> I'm upgrading a project to lucene 8.5.2 which had been using 3.0.0.
>
> Some tests are failing with a strange issue. The gist of it is, we create
> fields that need position and offset information. Inserting one field works
> ok, but then searching for the document and adding another value for the
> same field results in the following exception
>
> java.lang.IllegalArgumentException: all instances of a given field name
> must have the same term vectors settings (storeTermVectorPositions changed
> for field="f1")
> at
> org.apache.lucene.index.TermVectorsConsumerPerField.start(TermVectorsConsumerPerField.java:166)
> at
> org.apache.lucene.index.TermsHashPerField.start(TermsHashPerField.java:294)
> at
> org.apache.lucene.index.FreqProxTermsWriterPerField.start(FreqProxTermsWriterPerField.java:72)
> at
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:810)
> at
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
> at
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
> at
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
> at
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
> at
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213)
> at com.profium.sir.LuceneTest.writeDoc(LuceneTest.java:66)
> at com.profium.sir.LuceneTest.testLucene(LuceneTest.java:58)
>
> This is happening even though the exact same FieldType object is being
> used in the field each time, and it is frozen.
>
> I've isolated the problem to the following code snippet which reproduces
> it:
>
>
> import java.io.IOException;
> import java.nio.file.Path;
>
> import org.apache.lucene.analysis.en.EnglishAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.document.FieldType;
> import org.apache.lucene.index.DirectoryReader;
> import org.apache.lucene.index.IndexOptions;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.MMapDirectory;
>
> public class LuceneTest {
>
> private static FieldType FIELD_TYPE = new FieldType();
>
> static {
> FIELD_TYPE.setStored(true);
> FIELD_TYPE.setTokenized(true);
>
> FIELD_TYPE.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
> FIELD_TYPE.setStoreTermVectors(true);
> FIELD_TYPE.setStoreTermVectorPayloads(true);
> FIELD_TYPE.setStoreTermVectorPositions(true);
> FIELD_TYPE.setStoreTermVectorOffsets(true);
> FIELD_TYPE.freeze();
> }
>
> public static void main(String[] args) throws IOException {
> testLucene();
> }
>
> public static void testLucene() throws IOException {
> Document doc = new Document();
> doc.add(new Field("f1", "foo", FIELD_TYPE));
> writeDoc(doc);
> IndexSearcher searcher = new
> IndexSearcher(DirectoryReader.open(getDirectory()));
> doc = searcher.doc(0);
>
> doc.add(new Field("f1", "bar", FIELD_TYPE));
> writeDoc(doc);
> }
>
> private static void writeDoc(Document doc)
> throws IOException {
> Directory directory = getDirectory();
> IndexWriterConfig conf = new IndexWriterConfig(new
> EnglishAnalyzer());
> IndexWriter writer = new IndexWriter(directory , conf);
> writer.addDocument(doc);
> writer.flush();
> writer.close();
> }
>
> private static Directory getDirectory() throws IOException {
> return new MMapDirectory(Path.of("lucenttest"));
> }
> }
>

Re: Sharing buffer between large number of IndexWriters?

2020-06-22 Thread Michael McCandless

Hello Marcin,

Alas, Lucene does not have this capability out of the box.

However, you are able to live-update the
IndexWriterConfig.setRAMBufferSizeMB, and the change should take effect on
the next document indexed in that IndexWriter instance.  So you could build
your own "proportional RAM" on top of that.

But I would worry about the little not-accounted-for RAM that IndexWriter
uses ... summed across a few thousand instances that might start to matter.

When there are no merges running, IndexWriter should be quick to close and
re-open; maybe you want to do that more aggressively.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jun 16, 2020 at 9:25 AM Marcin Okraszewski  wrote:

> Hi,
> I want to create a separate index per tenant in application. It is due to
> both strong data separation requirements as well as query performance
> (active tenants with large indices affect others). The number of active
> IndexWriters would go into a few thousands. One of the concerns that rises
> is RAM buffers needed by IndexWritters, as even a few MBs of buffer per
> writer translates into heavy GBs of RAM.
>
> Is there any way to give all IndexWriters one cumulative limit of RAM so
> that they can share it proportionally to their traffic?
>
> Thank you,
> Marcin
>

Re: [VOTE] Lucene logo contest

2020-06-17 Thread Michael McCandless

Change is good :)

I vote Option A (binding PMC vote).

Thank you to all the open-source artists who helped out here.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 15, 2020 at 6:08 PM Ryan Ernst  wrote:

> Dear Lucene and Solr developers!
>
> In February a contest was started to design a new logo for Lucene [1].
> That contest concluded, and I am now (admittedly a little late!) calling a
> vote.
>
> The entries are labeled as follows:
>
> A. Submitted by Dustin Haver [2]
>
> B. Submitted by Stamatis Zampetakis [3] Note that this has several
> variants. Within the linked entry there are 7 patterns and 7 color
> palettes. Any vote for B should contain the pattern number, like B1 or B3.
> If a B variant wins, we will have a followup vote on the color palette.
>
> C. The current Lucene logo [4]
>
> Please vote for one of the three (or nine depending on your perspective!)
> above choices. Note that anyone in the Lucene+Solr community is invited to
> express their opinion, though only Lucene+Solr PMC cast binding votes
> (indicate non-binding votes in your reply, please). This vote will close
> one week from today, Mon, June 22, 2020.
>
> Thanks!
>
> [1] https://issues.apache.org/jira/browse/LUCENE-9221
> [2]
> https://issues.apache.org/jira/secure/attachment/12999548/Screen%20Shot%202020-04-10%20at%208.29.32%20AM.png
> [3]
> https://issues.apache.org/jira/secure/attachment/12997768/zabetak-1-7.pdf
> [4]
> https://lucene.apache.org/theme/images/lucene/lucene_logo_green_300.png
>

Re: CheckIndex complaining about -1 for norms value

2020-06-11 Thread Michael McCandless

Maybe we should fix CheckIndex to print norms as unsigned integers?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jun 11, 2020 at 3:00 AM Adrien Grand  wrote:

> To my knowledge, -1 always represented the maximum supported length, both
> before and after 7.0 (when we changed the norms encoding). One thing that
> changed when we introduced sparse norms is that documents with no value
> moved from having 0 as a norm to not having a norm at all, but I don't see
> how this could explain what you are seeing either.
>
> Do you know what is the Lucene version that initially indexed this document
> (and thus computed the norm value)?
>
> On Thu, Jun 11, 2020 at 8:45 AM Trejkaz  wrote:
>
> > Well,
> >
> > We're using the default Lucene similarity. But as far as I know, we've
> > always disabled norms as well. So I'm surprised I'm even seeing norms
> > mentioned in the context of our own index, which is why I wondered
> > whether -1 might have been an older placeholder for "no value" which
> > later became 0 or something.
> >
> > About the only thing I'm sure about at the moment is that whatever is
> > going on is weird.
> >
> > TX
> >
> > On Thu, 11 Jun 2020 at 15:38, Adrien Grand  wrote:
> > >
> > > Hi Trejkaz,
> > >
> > > Negative norm values are legal. The problem here is that Lucene expects
> > > that documents that have no terms must either not have a norm value
> > > (typically because the document doesn't have a value for the field),
> or a
> > > norm value equal to 0 (typically because the token stream over the
> field
> > > value produced no tokens).
> > >
> > > Are you using a custom similarity or one of the Lucene ones? One would
> > only
> > > get -1 as a norm with the Lucene similarities if it had a number of
> > tokens
> > > that is very close to Integer.MAX_VALUE.
> > >
> > > On Thu, Jun 11, 2020 at 4:22 AM Trejkaz  wrote:
> > >
> > > > Hi all.
> > > >
> > > > We use CheckIndex as a post-migration sanity check and are seeing
> this
> > > > quirk, and I'm wondering whether negative norms is even legit or
> > > > whether it should have been treated as if it were zero...
> > > >
> > > > TX
> > > >
> > > >
> > > > 0.00% total deletions; 378 documents; 0 deleteions
> > > > Segments file=segments_1 numSegments=1 version=8.5.1
> > > > id=52isly98kogao7j0cnautwknj
> > > >   1 of 1: name=_0 maxDoc=378
> > > > version=8.5.1
> > > > id=52isly98kogao7j0cnautwkni
> > > > codec=Lucene84
> > > > compound=false
> > > > numFiles=18
> > > > size (MB)=0.663
> > > > diagnostics = {java.vendor=Oracle Corporation, os=Mac OS X,
> > > > java.version=1.8.0_191, java.vm.version=25.191-b12,
> > > > lucene.version=8.5.1, os.arch=x86_64,
> > > > java.runtime.version=1.8.0_191-b12,
> source=addIndexes(CodecReader...),
> > > > os.version=10.15.5, timestamp=1591841756208}
> > > > no deletions
> > > > test: open reader.OK [took 0.004 sec]
> > > > test: check integrity.OK [took 0.002 sec]
> > > > test: check live docs.OK [took 0.000 sec]
> > > > test: field infos.OK [36 fields] [took 0.000 sec]
> > > > test: field norms.OK [26 fields] [took 0.001 sec]
> > > > test: terms, freq, prox...ERROR: java.lang.RuntimeException:
> > > > Document 0 doesn't have terms according to postings but has a norm
> > > > value that is not zero: -1
> > > >
> > > > java.lang.RuntimeException: Document 0 doesn't have terms according
> to
> > > > postings but has a norm value that is not zero: -1
> > > > at
> org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1678)
> > > > at
> > org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1871)
> > > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:724)
> > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > >
> > > > test: stored fields...OK [15935 total field count; avg 42.2
> > > > fields per doc] [took 0.003 sec]
> > > > test: term vectorsOK [1173 total term vector count; avg
> > > > 3.1 term/freq vector fields per doc] [took 0.170 sec]
> > > > test: docvalues...OK [16 docvalues fields; 11 BINARY; 2
> > > > NUMERIC; 0 SORTED; 2 SORTED_NUMERIC; 1 SORTED_SET] [took 0.003 sec]
> > > > test: points..OK [4 fields, 1509 points] [took 0.000
> > sec]
> > > > FAILED
> > > > WARNING: exorciseIndex() would remove reference to this segment;
> > > > full exception:
> > > > java.lang.RuntimeException: Term Index test failed
> > > > at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:750)
> > > > at org.apache.lucene.index.CheckIndex.doCheck(CheckIndex.java:2973)
> > > >
> > > > WARNING: 1 broken segments (containing 378 documents) detected
> > > > Took 0.355 sec total.
> > > > WARNING: would write new segments file, and 378 documents would be
> > > > lost, if -exorcise were specified
> > > >
> > > > -
> > > > To unsubscribe, e-ma

Re: Lucene Migration issue

2020-06-08 Thread Michael McCandless

You're welcome!

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 8, 2020 at 10:48 AM Adarsh Sunilkumar <
adarshsunilkuma...@gmail.com> wrote:

> Hi Michael,
>
> Thanks for your information.
>
>
> Thanks&Regards,
> Adarsh Sunilkumar
>
> On Mon, Jun 8, 2020, 20:15 Michael McCandless 
> wrote:
>
>> Ahh, yes is does!  That is the change that made Lucene catch this
>> mis-use, whereas previously it would silently throw things away (term
>> frequencies and positions).
>>
>> If you want to simply continue throwing things away like Lucene did
>> before, without rebuilding your index, switch your indexing to
>> IndexOptions.DOCS.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Jun 8, 2020 at 2:07 AM Adarsh Sunilkumar <
>> adarshsunilkuma...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> Thanks for the information. Does the error has any relationship with
>>> this patch ttps://issues.apache.org/jira/browse/LUCENE-8134
>>> <https://issues.apache.org/jira/browse/LUCENE-8134>
>>>
>>> Thanks& Regards,
>>> Adarsh Sunilkumar
>>>
>>> On Fri, Jun 5, 2020 at 7:28 PM Michael McCandless <
>>> luc...@mikemccandless.com> wrote:
>>>
>>>> This just means you previously indexed only docis (skipping term
>>>> frequencies, positions) for at least one of the fields in at least one
>>>> document in your existing index.
>>>>
>>>> But now you are trying to also index with term frequencies and
>>>> positions, which Lucene cannot do.
>>>>
>>>> You either have to reindex with
>>>> IndexOptions.DOCS_AND_FREQS_AND_POSITIONS, or keep your old index but index
>>>> that offending field with IndexOptions.DOCS.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Fri, Jun 5, 2020 at 6:58 AM Adarsh Sunilkumar <
>>>> adarshsunilkuma...@gmail.com> wrote:
>>>>
>>>>> Hi team,
>>>>>
>>>>> I am getting this error when I have tried migrating from lucene 7.3 to
>>>>> 8.5.1
>>>>> cannot change field "abc" from index options=DOCS to inconsistent index
>>>>> options=DOCS_AND_FREQS_AND_POSITIONS
>>>>> can I know the issue ?
>>>>> what should I change ?
>>>>> thanks in advance.
>>>>>
>>>>>
>>>>> Thanks&Regards,
>>>>> Adarsh Sunilkumar
>>>>>
>>>>

Re: Lucene Migration issue

2020-06-08 Thread Michael McCandless

Ahh, yes is does!  That is the change that made Lucene catch this mis-use,
whereas previously it would silently throw things away (term frequencies
and positions).

If you want to simply continue throwing things away like Lucene did before,
without rebuilding your index, switch your indexing to IndexOptions.DOCS.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Jun 8, 2020 at 2:07 AM Adarsh Sunilkumar <
adarshsunilkuma...@gmail.com> wrote:

> Hi Michael,
>
> Thanks for the information. Does the error has any relationship with this
> patch ttps://issues.apache.org/jira/browse/LUCENE-8134
> <https://issues.apache.org/jira/browse/LUCENE-8134>
>
> Thanks& Regards,
> Adarsh Sunilkumar
>
> On Fri, Jun 5, 2020 at 7:28 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> This just means you previously indexed only docis (skipping term
>> frequencies, positions) for at least one of the fields in at least one
>> document in your existing index.
>>
>> But now you are trying to also index with term frequencies and positions,
>> which Lucene cannot do.
>>
>> You either have to reindex with
>> IndexOptions.DOCS_AND_FREQS_AND_POSITIONS, or keep your old index but index
>> that offending field with IndexOptions.DOCS.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Fri, Jun 5, 2020 at 6:58 AM Adarsh Sunilkumar <
>> adarshsunilkuma...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I am getting this error when I have tried migrating from lucene 7.3 to
>>> 8.5.1
>>> cannot change field "abc" from index options=DOCS to inconsistent index
>>> options=DOCS_AND_FREQS_AND_POSITIONS
>>> can I know the issue ?
>>> what should I change ?
>>> thanks in advance.
>>>
>>>
>>> Thanks&Regards,
>>> Adarsh Sunilkumar
>>>
>>

Re: Lucene Migration issue

2020-06-05 Thread Michael McCandless

This just means you previously indexed only docis (skipping term
frequencies, positions) for at least one of the fields in at least one
document in your existing index.

But now you are trying to also index with term frequencies and positions,
which Lucene cannot do.

You either have to reindex with IndexOptions.DOCS_AND_FREQS_AND_POSITIONS,
or keep your old index but index that offending field with
IndexOptions.DOCS.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Jun 5, 2020 at 6:58 AM Adarsh Sunilkumar <
adarshsunilkuma...@gmail.com> wrote:

> Hi team,
>
> I am getting this error when I have tried migrating from lucene 7.3 to
> 8.5.1
> cannot change field "abc" from index options=DOCS to inconsistent index
> options=DOCS_AND_FREQS_AND_POSITIONS
> can I know the issue ?
> what should I change ?
> thanks in advance.
>
>
> Thanks&Regards,
> Adarsh Sunilkumar
>

Re: Retrieving query-time join fromQuery hits

2020-06-03 Thread Michael McCandless

Actually, I do not see how this can work efficiently with per-hit queries
after the join.

For each of the final joined hits, you must 1) retrieve the join key
value(s) by pulling doc values iterators and advancing to the right docid,
2) run another query to "join backwards" to the hits from the left side of
the join.

I don't see how step 2) can work efficiently when there are many possible
hits on the left side that might have matched those join keys?

Elasticsearch offers query time joins ... I wonder how it retrieves and
returns hits from both left and right?  It seems like the left side of the
join must retain some state, to know which top hits corresponded to those
join values, and then add an API to retrieve them?

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 20, 2020 at 6:31 PM Michael McCandless <
luc...@mikemccandless.com> wrote:

> I am trying first to understand the proposed solution from the previous
> thread.
>
> You run query #1, it returns top N hits.  From those hits you ask JoinUtil
> to create the "joined" query #2.  You run the query #2 to get the top final
> (joined) hits.
>
> Then, to reconstruct which docids from query #1 matched which hits from
> query #2, do you run a new query for every hit out of query #2?  E.g. if
> you want top 10 hits, you must run 10 new queries in the end, to match up
> each docid in the final result set with each docid hit from query #1?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, May 12, 2020 at 12:23 PM Stefan Onofrei 
> wrote:
>
>> Hi,
>>
>> When using Lucene’s query-time join feature [1], how can the hits from the
>> first phase which determine / contribute to the returned results be
>> retrieved?
>>
>> This topic has been brought up before [2], and at the time the
>> recommendation was to re-run the query with added constraints based on the
>> join fields values. Is there any alternative way of doing this when trying
>> to get the contributing hits for every returned result and in the context
>> of having multiple terms in the toField?
>>
>> I see that the info that is being tracked by the Join API refers to the
>> scores and the terms collected in the first phase. During this feature’s
>> development [3] there was also a 3-phased approach taken into
>> consideration, which involved recording fromSearcher’s docIds, translating
>> them into joinable terms and then recording toSearcher’s docIds. However,
>> even if docId info would be recorded between phases, it would then have to
>> be exposed somehow.
>>
>> Thanks,
>> Stefan Onofrei
>>
>> [1]
>>
>> https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
>> [2]
>>
>> https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
>> [3] https://issues.apache.org/jira/browse/LUCENE-3602
>>
>

Re: Retrieving query-time join fromQuery hits

2020-05-20 Thread Michael McCandless

I am trying first to understand the proposed solution from the previous
thread.

You run query #1, it returns top N hits.  From those hits you ask JoinUtil
to create the "joined" query #2.  You run the query #2 to get the top final
(joined) hits.

Then, to reconstruct which docids from query #1 matched which hits from
query #2, do you run a new query for every hit out of query #2?  E.g. if
you want top 10 hits, you must run 10 new queries in the end, to match up
each docid in the final result set with each docid hit from query #1?

Mike McCandless

http://blog.mikemccandless.com

On Tue, May 12, 2020 at 12:23 PM Stefan Onofrei 
wrote:

> Hi,
>
> When using Lucene’s query-time join feature [1], how can the hits from the
> first phase which determine / contribute to the returned results be
> retrieved?
>
> This topic has been brought up before [2], and at the time the
> recommendation was to re-run the query with added constraints based on the
> join fields values. Is there any alternative way of doing this when trying
> to get the contributing hits for every returned result and in the context
> of having multiple terms in the toField?
>
> I see that the info that is being tracked by the Join API refers to the
> scores and the terms collected in the first phase. During this feature’s
> development [3] there was also a 3-phased approach taken into
> consideration, which involved recording fromSearcher’s docIds, translating
> them into joinable terms and then recording toSearcher’s docIds. However,
> even if docId info would be recorded between phases, it would then have to
> be exposed somehow.
>
> Thanks,
> Stefan Onofrei
>
> [1]
>
> https://lucene.apache.org/core/8_5_1/join/org/apache/lucene/search/join/JoinUtil.html
> [2]
>
> https://lucene.472066.n3.nabble.com/access-to-joined-documents-td4412376.html
> [3] https://issues.apache.org/jira/browse/LUCENE-3602
>

Re: Resizable LRUQueryCache

2020-03-10 Thread Michael McCandless

Maybe start with your own cache implementation that implements a resize
method?  The cache is pluggable through IndexSearcher.

Fully discarding the cache and swapping in a newly sized (empty) one could
also be jarring, so I can see the motivation for this method.  Especially
for usages that are hosting many separate indices and want some control
over total heap usage.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Mar 10, 2020 at 1:48 PM Aadithya C  wrote:

> Hi Atri,
>
> Thanks for pointing out potential performance issues for high QPS workloads
> when downsizing. Depending on the size of the cache, it might make sense to
> create a new one with the necessary entries. On the other hand, if you
> consider cases where the process is under heavy memory pressure and
> performance is already bad, resizing the cache will only help irrespective
> of the workload. Additionally, we can also explore performance
> optimizations for large evictions.
> I can create a new derivative but I was wondering if this change can be
> beneficial to other users and should be a part of lucene. One advantage of
> being part of the LRUQueryCache is that this feature will evolve with other
> cache changes.
>
> Adithya
>
> On Thu, Mar 5, 2020 at 7:17 PM Atri Sharma  wrote:
>
> > On Fri, Mar 6, 2020 at 1:04 AM Aadithya C  wrote:
> > >
> > > In my personal opinion, there are a few advantages of resizing -
> > >
> > >
> > > 1) The size of the cache is unpredictable as there is a
> > fixed(guesstimate)
> > > accounting for the key size. With a resizable cache, we can potentially
> > > cache heavier queries and exploratively resize the cache when faced
> with
> > > memory pressure.
> > >
> > >
> > > 2) With a resizable cache, we can control memory allocation dynamically
> > > based on the workload. For a large cache, dropping the entire cache
> when
> > we
> > > want to reallocate some memory seems like an excessive action.
> > >
> > >
> > > 3) The query cache effectiveness is very workload dependent. I have
> > > observed the cache hit-rate can be in single digits for certain
> workloads
> > > and memory can be effectively used elsewhere.
> >
> > When you say resizing, do you mean only increasing sizes or decreasing as
> > well?
> >
> > IMO, this would require defining a model for on demand eviction until
> > the cache reaches a target size (be mindful of what happens to
> > incoming caching requests in that period). A potentially large
> > downsizing request can lead to a large eviction time -- not sure what
> > impact it would have on query performance.
> >
> > I would agree with Adrien's suggestion of using a new cache unless you
> > have a very compelling reason not to. Even then, I would be wary of
> > muddying LRUQueryCache's waters -- you might want to create your own
> > custom derivative of it?
> >
> > Atri
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>

Re: Lucene 7.7.2 Indexwriter.numDocs() replacement in Lucene 8.4.1

2020-02-26 Thread Michael McCandless

Yes.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Feb 24, 2020 at 5:55 PM  wrote:

> A typo corrected below.
>
> Best regards
>
>
> On 2/24/20 5:54 PM, baris.ka...@oracle.com wrote:
> > Hi,-
> >
> >  I hope everyone is doing great.
> >
> >
> > I think the Lucene 7.7.2  Indexwriter.numDocs()
> >
> >
> https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#numDocs--
> >
> >
> > can be replaced by the following in Lucene 8.4.1, right?
> >
> >
> https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/index/IndexWriter.html#getDocStats--
> >
> > --->>>
> >
> https://lucene.apache.org/core/8_4_1/core/org/apache/lucene/index/IndexWriter.DocStats.html#numDocs
> >
> > i.e./_*, IndexWriter.getDocStats().numDocs *_/
> >
> > Best regards
> >
> >
> >
>

Re: Searching number of tokens in text field

2020-01-02 Thread Michael McCandless

Norms encode the number of tokens in the field, but in a lossy manner (1
byte by default), so you could probably create a custom query that filtered
based on that, if you could tolerate the loss in precision?  Or maybe
change your norms storage to more precision?

You could use NormsFieldExistsQuery as a starting point for the sources for
your custom query.  Or maybe there's already a more similar Query based on
norms?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson 
wrote:

> This comes up occasionally, it’d be a neat thing to add to Solr if you’re
> motivated. It gets tricky though.
>
> - part of the config would have to be the name of the length field to put
> the result into, that part’s easy.
>
> - The trickier part is “when should the count be incremented?”. For
> instance, say you add 15 synonyms for a particular word. Would that add 1
> or 16 to the count? What about WordDelimiterGraphFilterFactory, that can
> output N tokens in place of one. Do stopwords count? What about shingles?
> CJK languages? The list goes on.
>
> If you tackle this I suggest you open a JIRA for discussion, probably a
> Lucene JIRA ‘cause the folks who deal with Lucene would have the best
> feedback. And probably ignore most of the possible interactions with other
> filters and document that most users should just put it immediately after
> the tokenizer and leave it at that ;)
>
> I can think of a few other options, but about the only thing that I think
> makes sense is something like “countTokensInTheSamePosition=true|false”
> (there’s _GOT_ to be a better name for that!), defaulting to false so you
> could control whether synonym expansion and WDGFF insertions incremented
> the count or not. And I suspect that if you put such a filter after WDGFF,
> you’d also want to document that it should go after
> FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over my
> suspicion...
>
> Best,
> Erick
>
> > On Dec 29, 2019, at 7:57 PM, Matt Davis  wrote:
> >
> > That is a clever idea.  I would still prefer something cleaner but this
> > could work.  Thanks!
> >
> > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov 
> wrote:
> >
> >> I don't know of any pre-existing thing that does exactly this, but how
> >> about a token filter that counts tokens (or positions maybe), and then
> >> appends some special token encoding the length?
> >>
> >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis 
> wrote:
> >>
> >>> Hello,
> >>>
> >>> I was wondering if it is possible to search for the number of tokens
> in a
> >>> text field.  For example find book titles with 3 or more words.  I
> don't
> >>> mind adding a field that is the number of tokens to the search index
> but
> >> I
> >>> would like to avoid analyzing the text two times.   Can Lucene search
> for
> >>> the number of tokens in a text field?  Or can I get the number of
> tokens
> >>> after analysis and add it to the Lucene document before/during
> indexing?
> >>> Or do I need to analysis the text myself and add the field to the
> >> document
> >>> (analyze the text twice, once myself, once in the IndexWriter).
> >>>
> >>> Thanks,
> >>> Matt Davis
> >>>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Lucene Index Cloud Replication

2019-07-09 Thread Michael McCandless

+1 to share code for doing 1) and 3) both of which are tricky!

Safely moving / copying bytes around is a notoriously difficult problem ...
but Lucene's "end to end checksums" and per-segment-file-GUID make this
safer.

I think Lucene's replicator module is a good place for this?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Jul 3, 2019 at 4:15 PM Michael Froh  wrote:

> Hi there,
>
> I was talking with Varun at Berlin Buzzwords a couple of weeks ago about
> storing and retrieving Lucene indexes in S3, and realized that "uploading a
> Lucene directory to the cloud and downloading it on other machines" is a
> pretty common problem and one that's surprisingly easy to do poorly. In my
> current job, I'm on my third team that needed to do this.
>
> In my experience, there are three main pieces that need to be implemented:
>
> 1. Uploading/downloading individual files (i.e. the blob store), which can
> be eventually consistent if you write once.
> 2. Describing the metadata for a specific commit point (basically what the
> Replicator module does with the "Revision" class). In particular, we want a
> downloader to reliably be able to know if they already have specific files
> (and don't need to download them again).
> 3. Sharing metadata with some degree of consistency, so that multiple
> writers don't clobber each other's metadata, and so readers can discover
> the metadata for the latest commit/revision and trust that they'll
> (eventually) be able to download the relevant files.
>
> I'd like to share what I've got for 1 and 3, based on S3 and DynamoDB, but
> I'd like to do it with  interfaces that lend themselves to other
> implementations for blob and metadata storage.
>
> Is it worth opening a Jira issue for this? Is this something that would
> benefit the Lucene community?
>
> Thanks,
> Michael Froh
>

Re: find documents with big stored fields

2019-07-01 Thread Michael McCandless

Hi Rob,

The codec records per docid how many bytes each document consumes -- maybe
instrument the codec's sources locally, then open your index and have it
visit stored fields for every doc in the index and gather stats?

Or, to avoid touching Lucene level code, you could make a small tool that
load stored fields for each doc, gather stats on total string length and
stored field count of all fields in the doc?

Mike McCandless

http://blog.mikemccandless.com

On Mon, Jul 1, 2019 at 5:24 AM Rob Audenaerde 
wrote:

> Hello,
>
> We are currently trying to investigate an issue where in the index-size is
> disproportionally large for the number of documents. We see that the .fdt
> file is more than 10 times the regular size.
>
> Reading the docs, I found that this file contains the fielddata.
>
> I would like to find the documents and/or field names/contents with extreme
> sizes, so we can delete those from the index without needing to re-index
> all data.
>
> What would be the best approach for this?
>
> Thanks,
> Rob Audenaerde
>

Re: ArrayIndexOutOfBoundsException during System.arraycopy in BKDWriter

2019-05-03 Thread Michael McCandless

Note that  the -Xint flag will make your code run tremendously more
slowly!  Likely to the point of not really being usable.  But it'd be
interesting to see if that side-steps the bug.

Is it possible to test with OpenJDK as well?

The BKDWriter code is quite complex, so it is also possible there is a
Lucene bug at work.

Can you open an issue in Lucene's jira and we can iterate there?

Thanks,

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 1, 2019 at 9:34 AM Torben Riis  wrote:

> Hi,
>
>
>
> I’m a bit stuck here and needs a clue or two in order to continue our
> investigations. Hope that someone can help. :)
>
>
>
> Periodically, around once a month, we get the below
> ArrayIndexOutOfBoundsException on our system. We use multiple indexes and
> the error can originate from any of them, but the error always occurs in
> line 1217 in BKDWriter (during a System.arraycopy).
>
>
>
> We found a couple of issues on the net regarding JIT optimization problem
> related to J9, but they all looks like they have been resolved and cannot
> be reproduced anymore. But nevertheless, we have just added the -Xint flag
> (disable JIT compiler) in order to see whether this has any impact.
> Unfortunately we do not have the result of this yet, but I’ll of course
> post it when it is known.
>
>
>
> Are there any of you clever guys out there, that has some good ideas
> further investigations? Or have seen such issue before?
>
>
>
> We are using Lucene 6.6.0 and runs on IBM J9 on the IBM I platform.
>
>
>
>
>
> *Java version*
>
> java version "1.8.0_191"
>
> Java(TM) SE Runtime Environment (build 8.0.5.25 -
> pap6480sr5fp25-20181030_01(SR5 FP25))
>
> IBM J9 VM (build 2.9, JRE 1.8.0 OS/400 ppc64-64-Bit Compressed References
> 20181029_400846 (JIT enabled, AOT enabled)
>
> OpenJ9   - c5c78da
>
> OMR  - 3d5ac33
>
> IBM  - 8c1bdc2)
>
> JCL - 20181022_01 based on Oracle jdk8u191-b26
>
> NOTICE: If no version information is found above, this could indicate a
> corrupted Java installation!
>
> Java detected was: /QOpenSys/QIBM/ProdData/JavaVM/jdk80/64bit/bin/java
>
> -Dmultiarchive.basepath=/home/NEXTOWN/Multi-Support/Next -Xms128m -Xmx2048m
>
>
>
>
>
> *Stacktrace*
>
> Exception in thread "Lucene Merge Thread #0" 2019-05-01T06:10:07.970 CEST
> [Lucene Merge Thread #0]
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.ArrayIndexOutOfBoundsException
>
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:703)
>
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:683)
>
> Caused by: 2019-05-01T06:10:07.971 CEST [Lucene Merge Thread #0]
> java.lang.ArrayIndexOutOfBoundsException
>
> at
> org.apache.lucene.util.bkd.BKDWriter.recursePackIndex(BKDWriter.java:1217)
>
> at
> org.apache.lucene.util.bkd.BKDWriter.recursePackIndex(BKDWriter.java:1197)
>
> at
> org.apache.lucene.util.bkd.BKDWriter.packIndex(BKDWriter.java:1078)
>
> at
> org.apache.lucene.util.bkd.BKDWriter.writeIndex(BKDWriter.java:1245)
>
> at
> org.apache.lucene.util.bkd.BKDWriter.access$600(BKDWriter.java:82)
>
> at
> org.apache.lucene.util.bkd.BKDWriter$OneDimensionBKDWriter.finish(BKDWriter.java:648)
>
> at
> org.apache.lucene.util.bkd.BKDWriter.merge(BKDWriter.java:560)
>
> at
> org.apache.lucene.codecs.lucene60.Lucene60PointsWriter.merge(Lucene60PointsWriter.java:212)
>
> at
> org.apache.lucene.index.SegmentMerger.mergePoints(SegmentMerger.java:173)
>
> at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:122)
>
> at
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356)
>
> at
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931)
>
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
>
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)
>
> Exception in thread "Lucene Merge Thread #0" 2019-05-01T06:10:08.075 CEST
> [Lucene Merge Thread #0]
> org.apache.lucene.index.MergePolicy$MergeException:
> java.lang.ArrayIndexOutOfBoundsException
>
> at
> org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:703)
>
> at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:683)
>
> Caused by: 2019-05-01T06:10:08.076 CEST [Lucene Merge Thread #0]
> java.lang.ArrayIndexOutOfBoundsException
>
> at
> org.apache.lucene.util.bkd.BKDWriter.recursePackIndex(BKDWriter.java:1217)
>
> at
> org.apache.lucene.util.bkd.BKDWriter.recursePackIndex(BKDWriter.java:1197)
>
> at
> org.apache.lucene.util.bkd.BKDWriter.packIndex(BKDWriter.java:1078)
>
> at
> org.apache.lucene.

Re: Ask about Lucene/Core/Index DocumentsWriter

2019-03-19 Thread Michael McCandless

Can you try increasing your IndexWriter.setRAMBufferSizeMB?  That flush
control logic will block incoming threads if the number of bytes trying to
flush to disk is too large relative to your RAM buffer.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Mar 18, 2019 at 2:30 PM yuncheng lu  wrote:

> When i check the code lucene/core DocumentsWriter.preUpdate code. I see the
> flushControl is used when thread is Stalled.
> When we have a lot of documents write into disk which is SSD. We monitored
> that the thread is all in flush, and request continuously use addDocument
> which can go into preUpdate check, it make more flush request into IO
> queue. As this situation, it make all request blocked, and disk util is no
> full of usage. So may be this optimization can take more condition into
> consideration?
>
>
> Thanks & Best Regards
> Yuncheng Lu
>

Re: FlattenGraphFilter assertion error

2019-03-12 Thread Michael McCandless

Hello Nicolás,

Can you please open an issue for this?  Thanks.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Mar 7, 2019 at 10:08 PM Nicolás Lichtmaier 
wrote:

> Oops, sorry... in that code there's a "camelCase" parameter that is not
> implemented in normal Lucene. That is an option I've added for better camel
> case support, but the bug happens without that option as well.
> El 7/3/19 a las 20:33, Nicolás Lichtmaier escribió:
>
> After a lot of time... Here's an small example that triggers that
> assertion.
>
> Builder builder = CustomAnalyzer.builder();
>
> builder.withTokenizer(StandardTokenizerFactory.class);
> builder.addTokenFilter(WordDelimiterGraphFilterFactory.class,
> "camelCase", "1", "preserveOriginal", "1");
> builder.addTokenFilter(StopFilterFactory.class);
>
> builder.addTokenFilter(FlattenGraphFilterFactory.class);
> Analyzer analyzer = builder.build();
>
> TokenStream ts = analyzer.tokenStream("*", new
> StringReader("x7in"));
> ts.reset();
> while(ts.incrementToken())
> ;
>
> This gives:
>
> Exception in thread "main" java.lang.AssertionError: 2
> at
> org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195)
> at
> org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:258)
> at com.wolfram.textsearch.AnalyzerError.main(AnalyzerError.java:32)
>
> It's the interaction between WordDelimiterGraphFilter and stop word
> removal, it seems, that trigger an assertion when flattening.
>
>
> El 12/10/17 a las 19:18, Michael McCandless escribió:
>
> Hmm, that's not good!  Clearly there is a bug somewhere.
>
> Are you able to isolate a small example, e.g. text input and synonyms you
> fed to SynonymGraphFilter, to show this assertion trip?
>
> Are you using any custom analysis components before the FlattenGraphFilter?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Oct 10, 2017 at 11:24 AM, Nicolás Lichtmaier  > wrote:
>
>> Hi!
>>
>> I was getting an exception in FlattenGraphFilter and, as I saw there was
>> assertion statements nearby, I reran everything with assertions enabled.
>> And I see it crashes here (FlattenGraphFilter.java:174)
>>
>>
>> At this point inputNode has all fields with -1 (except nextOut, which is
>> 0).. and outputFrom's value is 395.
>>
>> The code is pretty complex, so before trying to undestand it I thought
>> maybe someone could know what's happening just seeing this, maybe not. =)
>>
>> I'll keep the debugging session open for a while in case some more
>> variables could be useful to debug this.
>>
>> Thanks!
>>
>>
>>
>

Re: IndexWriter concurrent flushing

2019-02-17 Thread Michael McCandless

+1 to make it simple to let multiple threads help with commit/refresh
operations.

IW.yield is a simple way to achieve it, matching (roughly) how IW's
commit/refresh work today, hijacking incoming indexing threads to gain
concurrency.  I think this would be a small change?

Adding an ExecutorService to e.g. IndexWriterConfig, so all ops (commit,
refresh, eventually also merging which today still spawns its own threads)
could be concurrent when possible would be a nice longer term solution but
I suspect that's a much more invasive change than the simple IW.yield.

Progress not perfection :)

Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 15, 2019 at 4:11 PM Michael Sokolov  wrote:

> I noticed that commit() was taking an inordinately long time. It turned out
> IndexWriter was flushing using only a single thread because it relies on
> its caller to supply it with threads (via updateDocument, deleteDocument,
> etc), which it then "hijacks" to do flushing. If (as we do) a caller
> indexes a lot of documents and then calls commit at the end of a large
> batch, when no indexing is ongoing, the commit() takes much longer than
> needed since it is unable to make user of multiple cores to do concurrent
> I/O.
>
> How can we support this batch-mode use case better? I think we should -
> it's not an unreasonable thing to do, since it can lead to the shortest
> overall indexing time if you have sufficient RAM and don't need to search
> until you're done indexing. I tried adding an IndexWriter.yield() method
> that just flushes pending segments and does other queued work; the caller
> can invoke this in order to provide resources. A more convenient API would
> be to grant IndexWriter an ExecutorService of its own, but this is more
> involved since it would ne necessary to arbitrate where the work should be
> done. Maybe a middle ground would be to offer a commit(ExecutorService)
> method. Any other ideas? Any interest in a patch for IndexWriter.yield()?
>
> -Mike
>

Re: prorated early termination

2019-02-03 Thread Michael McCandless

On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov  wrote:

 > > In single-threaded mode we can check against minCompetitiveScore and
> terminate collection for each segment appropriately,
>
> > Does Lucene do this today by default?  That should be a nice
> optimization,
> and it'd be safe/correct.
>
> Yes, it does that today (in TopFieldCollector -- see
>
> https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> )
>

Ahh -- great, thanks for finding that.


> Re: our high cost of collection in static ranking phase -- that is true,
> Mike, but I do also see a nice improvement on the luceneutil benchmark
> (modified to have a sorted index and collect concurrently) using just a
> vanilla TopFieldCollector. I looked at some profiler output, and it just
> seems to be showing more time spent walking postings.
>

Yeah, understood -- I think pro-rating the N collected per segment makes a
lot of sense.

Mike McCandless

http://blog.mikemccandless.com

Re: prorated early termination

2019-02-03 Thread Michael McCandless

I think this is because our per-hit cost is sometimes very high -- we have
"post filters" that are sometimes very restrictive.  We are working to get
those post-filters out into an inverted index to make them more efficient,
but net/net reducing how many hits we must collect for each segment can
help latencies and throughput.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Feb 1, 2019 at 11:42 AM Michael Sokolov  wrote:

> That's a good question. I don't have a very satisfying answer other than to
> say we saw some improvements, and I would have to dig more now to say why.
> It may be that in our system we have some additional per-document costs in
> a custom collector that were saved by this reduction, and that's why we saw
> a latency reduction. However I also note these luceneutil benchmarks
> results:
>
> ** This is with -topN=500, and a searcher threadpool=8
> Report after iter 19:
> TaskQPS baseline  StdDevQPS candidate
> StdDevPct diff
>HighTermDayOfYearSort  391.23  (2.4%)  627.92  (4.8%)
> 60.5% (  52% -   69%)
>
> ** This is with -topN=500, and no searcher threadpool
> Report after iter 19:
> TaskQPS baseline  StdDevQPS candidate
> StdDevPct diff
>HighTermDayOfYearSort  239.52  (3.3%)  232.70  (3.0%)
> -2.8% (  -8% -3%)
>
> which show QPS improvement from using threads even in the baseline case
> (and then an additional improvement from prorating).
>
> On Fri, Feb 1, 2019 at 11:28 AM Adrien Grand  wrote:
>
> > Something makes me curious: queries that can leverage sorted indices
> > should be _very_ fast, for instance in your case they only need to
> > look at 500 documents per segment at most (less in practice since we
> > stop collecting as soon as a non-competitive hit is found), so why do
> > you need to parallelize query execution?
> >
> > On Fri, Feb 1, 2019 at 3:18 PM Michael Sokolov 
> wrote:
> > >
> > > I want to propose an optimization to early termination that gives nice
> > > speedups for large result sets when searching with multiple threads at
> > the
> > > cost of a small (controllable) probability of collecting documents out
> of
> > > order: in benchmarks I see +60-70% QPS for tasks like
> > HighTermDayOfYearSort
> > > when topN=500, using 8 threads to search, and in our production system
> > this
> > > optimization cut the latency of our slowest queries substantially.
> > >
> > > In a multi-phase ranking scenario a typical pattern is to retrieve a
> > > largish number of matches in a first pass using indexed sort, followed
> > by a
> > > second pass that re-ranks and selects a smaller top K, using a more
> > > expensive ranking. N is chosen to provide sufficient probabililty of
> > > finding the desired top K across the whole index, given that the index
> > sort
> > > is some approximation to the desired sort. When ranking by indexed
> sort,
> > as
> > > in TopFieldCollector, we can now early-terminate when a sufficient
> number
> > > of matches have been found so that we only need retrieve N documents
> from
> > > each segment. In single-threaded mode we can check against
> > > minCompetitiveScore and terminate collection for each segment
> > > appropriately, but when using multiple threads to search concurrently
> > there
> > > is no such coordination and we end up collecting N documents *per
> > segment*,
> > > which are then merged down to N.
> > >
> > > We do not need to collect so many documents though. For any given
> > segment,
> > > let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in
> > that
> > > segment. Assuming that documents are distributed randomly among
> segments,
> > > we can expect that on average we will find p*N of the top N documents
> in
> > > the given segment. If we only collect p*N documents, we will sometimes
> > miss
> > > some documents that we should have collected, collecting some
> > > less-competitive documents from one segment, while not collecting all
> the
> > > competitive documents from another. But how many should we collect in
> > order
> > > to make this occur only very rarely?
> > >
> > > The worst case is that all top N documents occur in a single segment.
> For
> > > even small values of N and small numbers of segments S, this
> probability
> > is
> > > vanishingly small (N=10, S=10) -> 10^(1-N) = 1/10^9. More generally,
> this
> > > distribution of documents among segments is a multinomial distribution,
> > and
> > > the variance of the number of documents in a single segment is that of
> a
> > > binomial distribution. The binomial variance in this case
> (p=probability
> > of
> > > document in the segment, N number of documents) is p*(1-p)*N; we can
> use
> > > this to compute the number of documents to collect per leaf in order to
> > > bound the probability of a ranking error. I'm using a cutoff of 3
> > standard
> > > deviations, i.e.  collecting p*N + 3*(p*(1-p)*N)^1/2 documents f

Re: prorated early termination

2019-02-03 Thread Michael McCandless

One question about this:

> In single-threaded mode we can check against minCompetitiveScore and
terminate collection for each segment appropriately,

Does Lucene do this today by default?  That should be a nice optimization,
and it'd be safe/correct.

Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 1, 2019 at 9:18 AM Michael Sokolov  wrote:

> I want to propose an optimization to early termination that gives nice
> speedups for large result sets when searching with multiple threads at the
> cost of a small (controllable) probability of collecting documents out of
> order: in benchmarks I see +60-70% QPS for tasks like HighTermDayOfYearSort
> when topN=500, using 8 threads to search, and in our production system this
> optimization cut the latency of our slowest queries substantially.
>
> In a multi-phase ranking scenario a typical pattern is to retrieve a
> largish number of matches in a first pass using indexed sort, followed by a
> second pass that re-ranks and selects a smaller top K, using a more
> expensive ranking. N is chosen to provide sufficient probabililty of
> finding the desired top K across the whole index, given that the index sort
> is some approximation to the desired sort. When ranking by indexed sort, as
> in TopFieldCollector, we can now early-terminate when a sufficient number
> of matches have been found so that we only need retrieve N documents from
> each segment. In single-threaded mode we can check against
> minCompetitiveScore and terminate collection for each segment
> appropriately, but when using multiple threads to search concurrently there
> is no such coordination and we end up collecting N documents *per segment*,
> which are then merged down to N.
>
> We do not need to collect so many documents though. For any given segment,
> let p=(leaf.maxDoc/topLevel.maxDoc) be the proportion of documents in that
> segment. Assuming that documents are distributed randomly among segments,
> we can expect that on average we will find p*N of the top N documents in
> the given segment. If we only collect p*N documents, we will sometimes miss
> some documents that we should have collected, collecting some
> less-competitive documents from one segment, while not collecting all the
> competitive documents from another. But how many should we collect in order
> to make this occur only very rarely?
>
> The worst case is that all top N documents occur in a single segment. For
> even small values of N and small numbers of segments S, this probability is
> vanishingly small (N=10, S=10) -> 10^(1-N) = 1/10^9. More generally, this
> distribution of documents among segments is a multinomial distribution, and
> the variance of the number of documents in a single segment is that of a
> binomial distribution. The binomial variance in this case (p=probability of
> document in the segment, N number of documents) is p*(1-p)*N; we can use
> this to compute the number of documents to collect per leaf in order to
> bound the probability of a ranking error. I'm using a cutoff of 3 standard
> deviations, i.e.  collecting p*N + 3*(p*(1-p)*N)^1/2 documents for each
> segment. For N=500, p=0.2, we can collect 67 documents instead of 500 at
> the cost of an error that occurs < 3/1000.
>
> Also note that the kind of errors we make are typically benign. In most
> cases we will return the correct top N-1 documents, but instead of
> returning the Nth-ranked document in position N, we return the N+1st.
>
> Implementing this in Lucene requires a small patch to TopFieldCollector to
> introduce a leafHitsThreshold comparable to the existing
> totalHitsThreshold. Given the possibility of error, it might be good to
> have a way to disable this, but my inclination would be to enable it
> whenever approximate counts are enabled (ie by default), and disable when
> totalHitsThreshold is MAX_VALUE.
>
> What do you think? Shall I open an issue?
>

Re: RamUsageCrawler

2018-12-06 Thread Michael McCandless

I think you mean RamUsageEstimator (in Lucene's test-framework)?

It's entirely possible it fails to dig into Maps correctly with newer Java
releases; maybe Dawid or Uwe would know?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Dec 4, 2018 at 12:18 PM Michael Sokolov  wrote:

> Hi, I'm using RamUsageCrawler to size some things, and I find it seems
> to underestimate the size of Map (eg HashMap and ConcurrentHashMap).
> This is using a Java 10 runtime, with code compiled to Java 8. I
> looked at the implementation and it seems as if for JRE classes, when
> JRE >= 9, we can no longer use reflection to size them accurately?
> Instead the implementation estimates the map size by treating it as an
> array of (keys and value) plus some constant header size. But this
> seems to neglect the size of the HashMap$Node (in the case of HashMap
> - I haven't looked at ConcurrentHashMap or TreeMap or anything). In my
> case, I have a great many maps of a relatively small number of shared
> keys and values, so the crawler seems to be wildly under-counting. I'm
> comparing to sizes gleaned from heap dumps, eclipse mat, and OOM
> events.
>
> I wonder if we can we improve on the estimates for Maps and other
> Collections?
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Race condition between IndexWriter.commit and IndexWriter.close

2018-12-05 Thread Michael McCandless

Yeah I agree we should fix the javadocs to say that you should not try to
close and commit concurrently from different threads.

Wanna open a quick PR or issue with a patch?

Thanks,

Mike

On Wed, Dec 5, 2018, 4:06 AM Boris Petrov  So you're saying that this race-condition is OK? Nowhere in the
> documentation does it say that these two calls should be synchronized...
> at least that must be fixed. :)
>
> On 12/1/18 6:25 PM, Michael McCandless wrote:
> > I think if you call commit and close concurrently the results are
> undefined
> > and so this is acceptable.
> >
> > Mike
> >
> > On Thu, Nov 29, 2018 at 5:53 AM Boris Petrov 
> > wrote:
> >
> >> Hi all,
> >>
> >> We're getting the following exception:
> >>
> >> java.lang.IllegalStateException: cannot close: prepareCommit was already
> >> called with no corresponding call to commit
> >> at
> org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1025)
> >> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1078)
> >> ...
> >>
> >> We are only calling "commit" on IndexWriter. By reading the code I can
> >> see that if you call IndexWriter.commit in parallel to
> >> IndexWriter.close, it is possible to get this exception. More
> >> specifically, after setting "IndexWriter.pendingCommit" on line 4779
> >> (this is using Lucene 7.5.0) and before setting it to "null" on line
> >> 4793 this problem could happen.
> >>
> >> Is this by design or is it a bug?
> >>
> >> Thanks,
> >>
> >> Boris Petrov
> >>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >> --
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
>

Re: Race condition between IndexWriter.commit and IndexWriter.close

2018-12-01 Thread Michael McCandless

I think if you call commit and close concurrently the results are undefined
and so this is acceptable.

Mike

On Thu, Nov 29, 2018 at 5:53 AM Boris Petrov 
wrote:

> Hi all,
>
> We're getting the following exception:
>
> java.lang.IllegalStateException: cannot close: prepareCommit was already
> called with no corresponding call to commit
> at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:1025)
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1078)
> ...
>
> We are only calling "commit" on IndexWriter. By reading the code I can
> see that if you call IndexWriter.commit in parallel to
> IndexWriter.close, it is possible to get this exception. More
> specifically, after setting "IndexWriter.pendingCommit" on line 4779
> (this is using Lucene 7.5.0) and before setting it to "null" on line
> 4793 this problem could happen.
>
> Is this by design or is it a bug?
>
> Thanks,
>
> Boris Petrov
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> --
Mike McCandless

http://blog.mikemccandless.com

Re: SearcherManager not seeing changes in IndexWriteral and

2018-11-12 Thread Michael McCandless

Thanks for bringing closure, Boris.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Nov 12, 2018 at 7:13 AM Boris Petrov  wrote:

> Hello,
>
> OK, so actually this appears to be a bug in our code - Lucene is searching
> correctly, we were doing something wrong with the result after that. Sorry
> all for the confusion! Thanks for the support!
> 
> From: Erick Erickson 
> Sent: Friday, November 9, 2018 6:16 PM
> To: java-user
> Subject: Re: SearcherManager not seeing changes in IndexWriteral and
>
> You might be able to do this with a couple of threads in a single
> program, but certainly up to you.
>
> Best,
> Erick
> On Fri, Nov 9, 2018 at 7:47 AM Boris Petrov  wrote:
> >
> > Well, while debugging I put a bunch of println's which showed the
> > expected order. And, besides, I've written the code, I know that writing
> > to the index happens way before searching in it - the test makes sure of
> > that.
> >
> > If you think there indeed might be some problem, I'll try to reproduce
> > it in a small example. But I will try only the obvious (which perhaps
> > you've tried a million times and have tests for) - I'll have one thread
> > write to the index, another (which starts after the first) search in it
> > and I'll create a bash script that runs the program until it fails (what
> > I currently do with our test). I'll do this beginning of next week.
> >
> > Thank you for the support!
> >
> > On 11/9/18 5:37 PM, Erick Erickson wrote:
> > > If it's hard to do in a single thread, how about timestamping the
> > > events to insure that they happen in the expected order?
> > >
> > > That would verify the sequencing is happening as you expect and
> > > _still_ not see the expected docs...
> > >
> > > Assuming it does fail, do you think you could reduce it to a
> > > multi-threaded test case?
> > >
> > > Best,
> > > Erick
> > > On Fri, Nov 9, 2018 at 7:03 AM Boris Petrov 
> wrote:
> > >> Well, it's a bit involved to try it in a single thread as I've
> > >> oversimplified the example. But as far as I understand this should
> work,
> > >> right? So something else is wrong? Committing the writer and then
> > >> "maybeRefreshBlocking" should be enough to have the changes visible,
> yes?
> > >>
> > >> On 11/9/18 4:45 PM, Michael Sokolov wrote:
> > >>> That should work, I think, but if you are serializing these threads
> so
> > >>> that they cannot run concurrently, maybe try running both operations
> > >>> in a single thread, at least as a test.
> > >>> On Fri, Nov 9, 2018 at 9:16 AM Boris Petrov 
> wrote:
> >  If you mean the synchronization of the threads, it is not in the
> >  example, but Thread 2 is *started* after Thread 1 finished
> executing the
> >  code that I gave as an example. So there is happens-before between
> them.
> >  If you mean synchronization on the Lucene level - isn't that what
> >  "maybeRefreshBlocking" should do?
> > 
> >  On 11/9/18 3:29 PM, Michael Sokolov wrote:
> > > I'm not seeing anything there that would synchronize, or
> serialize, the
> > > read after the write and commit. Did you expect that for some
> reason?
> > >
> > > On Fri, Nov 9, 2018, 6:00 AM Boris Petrov  wrote:
> > >
> > >> Hi all,
> > >>
> > >> I'm using Lucene version 7.5.0. We have a test that does
> something like:
> > >>
> > >> Thread 1:
> > >>
> > >> Field idStringField = new StringField("id", id,
> > >> Field.Store.YES);
> > >> Field contentsField = new TextField("contents",
> reader);
> > >> Document document = new Document();
> > >> document.add(idStringField);
> > >> document.add(contentsField);
> > >>
> > >> writer.updateDocument(new Term(ID_FIELD, id),
> document);
> > >> writer.flush(); // not sure this flush is needed?
> > >> writer.commit();
> > >>
> > >> Thread 2:
> > >>
> > >> searchManager.maybeRefreshBlocking();
> > >> IndexSearcher searcher = searchManager.acquire();
> > >> try {
> > >> QueryParser parser = new QueryParser("contents",
> analyzer);
> > >> Query luceneQuery = parser.parse(queryText);
> > >> ScoreDoc[] hits = searcher.search(luceneQuery,
> > >> 50).scoreDocs;
> > >> } finally {
> > >> searchManager.release(searcher);
> > >> }
> > >>
> > >> Thread 1 happens before Thread 2.
> > >>
> > >> Sometimes, only sometimes, the commit from thread 1 is not
> *immediately*
> > >> visible in Thread 2. If I put a "Thread.sleep(1000)" it always
> works.
> > >> Without it, sometimes the search is empty. I'm not sure if I'm
> doing
> > >> something wrong or this is a bug?
> > >>
> > >> Thanks!
> > >>
> > >>
> > >> --

Re: MultiPhraseQuery or PhraseQuery to take the synonyms into account?

2018-09-22 Thread Michael McCandless

PhraseQuery can indeed be used to represent a multi-token synonym.

In fact, I mis-spoke before: MultiPhraseQuery can also represent a
multi-token synonym when the multiple tokens are all the same except in one
spot.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Sep 20, 2018 at 2:32 PM baris.kazar  wrote:

> i should have asked this way as Mike made clear for MultiPhraseQuery:
> is PhraseQuery ok to account for synonyms?
> Best
>
> > On Sep 20, 2018, at 2:02 PM, baris.ka...@oracle.com wrote:
> >
> > Hi,-
> >
> >  should i use MultiPhraseQuery or PhraseQuery to take synonyms into
> account?
> >
> > Best regards
> >
> > baris
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Question About FST, multiple-column index

2018-09-22 Thread Michael McCandless

You might want to index the name field normally (as StringField, for
example), then index the age as a NumericDocValuesField, and then make a
BooleanQuery with two required clauses, one clause TermQuery on the name,
the other a NumericDocValuesField.newSlowExactQuery.  Even though its name
is "slow", it can be very fast for cases like what you are doing, where you
expect very few matches by name, and many many matches with exactly a
specific age.

This is assuming you want precise (including case) matching on the name; if
you do not, then index the name as TextField, and analyzing the search
terms at query time using a query parser.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 20, 2018 at 10:57 AM ly铖 <52048...@qq.com> wrote:

> Hi,
>
>
> When I using Lucene as a Full Text search engine, I have a question about
> multi field index. For example, We have two fields: user, age. We always
> want to search one user which name is "xxx" and with the exactly age. So we
> add two fields to the lucene(may be there is better ways, I just want to
> explain my question ). In this case, we can see user result dataset is
> small, age result dataset is much more than previous. Even If lucene use
> Leading Query to reduce query result bitsets. but I wonder is there any
> Combined Index structure like multiple-column indexes in mysql? I think is
> there any solutions to extends to FST which make the FINAL state connect to
> another FST?
>
>
> THANKS

Re: MultiPhraseQuery

2018-09-18 Thread Michael McCandless

Yes, +1 for a patch to improve the docs!

MultiPhraseQuery only works for single term synonyms, and is usually
produced by query parsers when the incoming query text had single term
synonyms matching, I think?  The query parser will use other (span?)
queries for multi token synonyms.

I think the example in the javadoc should be simplified to not use "app*",
e.g. maybe just matching "Microsoft Excel|Word"?

Mike McCandless

http://blog.mikemccandless.com


On Wed, Sep 19, 2018 at 5:59 AM Erick Erickson 
wrote:

> bq. i wish the Javadocs has examples like PhraseQuery Javadocs gave.
>
> This is where someone coming into the examples for the first time is
> invaluable, javadoc patches are most welcome! It can be hard to back
> off enough to remember what the confusing bits are when you wrote the
> code ;)
> On Tue, Sep 18, 2018 at 1:56 PM  wrote:
> >
> > Any suggestions please?
> > Two main questions:
> > - how do synonyms get utilized by MultiPhraseQuery?
> > - how do we get second token "app" applied to the example on
> > MultiPhraseQuery javadocs page? (and how do we get Terms[] array from
> > Terms object?)
> >
> > Now three questions :)
> >
> > i wish the Javadocs has examples like PhraseQuery Javadocs gave.
> >
> > Best
> >
> > On 9/18/18 4:45 PM, baris.ka...@oracle.com wrote:
> > > Trying to implement the example on
> > >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_6-5F6-5F1_core_org_apache_lucene_search_MultiPhraseQuery.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=7WmT3NC9wzVk4FPBupACoALoL4kho6V7-c2o4Kac5QM&s=gM6_4hvpLEZY1_7r-CEInZbUb-ublYDcJOQ8rmeAgVA&e=
> > >
> > > // A generalized version of PhraseQuery, with the possibility of
> > > adding more than one term at the same position that are treated as a
> > > disjunction (OR). To use this class to search for the phrase
> > > "Microsoft app*" first create a Builder and use
> > >
> > > // MultiPhraseQuery.Builder.add(Term) on the term "microsoft"
> > > (assuming lowercase analysis), then find all terms that have "app" as
> > > prefix using LeafReader.terms(String), seeking to "app" then iterating
> > > and collecting terms until there is no longer that prefix,
> > >
> > > // and finally use MultiPhraseQuery.Builder.add(Term[]) to add them.
> > > MultiPhraseQuery.Builder.build() returns the fully constructed (and
> > > immutable) MultiPhraseQuery.
> > >
> > >
> > > IndexSearcher is = new IndexSearcher(indexReader);
> > >
> > > MultiPhraseQuery.Builder builder = new MultiPhraseQuery.Builder();
> > > builder.add(new Term("body", "one"), 0);
> > >
> > > Terms terms = LeafReader.terms("body"); // will this be slow? and how
> > > do we incorporate token/word "app" here?
> > >
> > > // i STILL dont see how to get individual Term objects from terms
> > > object and plus do i need to declare LeafReader object?
> > >
> > > Term[] termArr = new Term[k]; // i will get this filled via using
> > > Terms.iterator
> > > builder.add(termArr);
> > > MultiPhraseQuery mpq = builder.build();
> > > TopDocs hits = is.search(mpq, 20);// 20 hits
> > >
> > >
> > > Best regards
> > >
> > >
> > > On 9/18/18 4:16 PM, baris.ka...@oracle.com wrote:
> > >> Hi,-
> > >>
> > >>  how does MultiPhraseQuery treat synonyms?
> > >>
> > >> is the following possible?
> > >>
> > >> ... (created index with synonyms and indexReader object has the index)
> > >>
> > >> IndexSearcher is = new IndexSearcher(indexReader);
> > >>
> > >> MultiPhraseQuery.Builder builder = new MultiPhraseQuery.Builder();
> > >> builder.add(new Term("body", "one"), 0);
> > >> builder.add(new Term("body", "two"), 1);
> > >> MultiPhraseQuery mpq = builder.build();
> > >> TopDocs hits = is.search(mpq, 20);// 20 hits
> > >>
> > >> Best regards
> > >>
> > >>
> > >> -
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: SynonymGraphFilter

2018-09-11 Thread Michael McCandless

Try reading the blog post I wrote about token stream graphs?

http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Mike McCandless

http://blog.mikemccandless.com

On Tue, Sep 11, 2018 at 1:35 PM,  wrote:

> Any comments please?
>
> Thanks
>
>
> On 9/10/18 5:07 PM, baris.ka...@oracle.com wrote:
>
>> Any examples on this? i think it would be nice if Javadocs had an example
>> on this:
>>
>> However, if you use this during indexing, you must follow it with
>> FlattenGraphFilter to squash tokens on top of one another like
>> SynonymFilter, because the indexer can't directly consume a graph. To get
>> fully correct positional queries when your synonym replacements are
>> multiple tokens, you should instead apply synonyms using this TokenFilter
>> at query time and translate the resulting graph to a TermAutomatonQuery
>> e.g. using TokenStreamToTermAutomatonQuery.
>>
>> multiple tokens means: a synonym with multiple equivalents??
>>
>> or does it mean a synonym with multiple words?
>>
>> this is not clear to me.
>>
>> Best regards
>>
>>
>> On 9/10/18 3:15 PM, baris.ka...@oracle.com wrote:
>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.
>>> apache.org_core_6-5F4-5F1_analyzers-2Dcommon_org_apache_luce
>>> ne_analysis_synonym_SynonymGraphFilter.html&d=DwICaQ&c=RoP1Y
>>> umCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BK
>>> NeyLlULCbaezrgocEvPhQkl4&m=E2-7wwk3FgEU_ykuPnXNoOe0IIkgxivSa
>>> YV3p-2lGfY&s=guRDJ6HEg5JJkMQqdDVZkKs0gbuI7naZK2TUXFHN9w8&e=
>>>
>>> Does this mean i dont have to repeat it in the search analyzer when i do
>>> this at indexing time?
>>>
>>> Best regards
>>>
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: SynonymMap.Builder.add method

2018-09-11 Thread Michael McCandless

That's correct.

When the input sequence is seen during tokenization, the synonym (graph)
filter will also insert the output tokens into the TokenStream, as if they
"naturally" occurred.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Sep 11, 2018 at 1:35 PM,  wrote:

> Any comments please?
>
> Thanks
>
>
> On 9/10/18 5:21 PM, baris.ka...@oracle.com wrote:
>
>> i am trying to understand the add method here
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.
>> apache.org_core_6-5F4-5F1_analyzers-2Dcommon_org_apache_luce
>> ne_analysis_synonym_SynonymMap.Builder.html&d=DwICaQ&c=RoP1Y
>> umCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BK
>> NeyLlULCbaezrgocEvPhQkl4&m=gx_fm_PM8HtJte5j2BP3ZjtjQaw3Es0Vj
>> aNaxfHBFIk&s=PU0Pe1g1osmT_cIQepEV5jp4fQ1XTaOuc_EGPIbWPbo&e=
>>
>>
>> /public void add(CharsRef input,//
>> //CharsRef output,//
>> //boolean includeOrig)//
>> //Add a phrase->phrase synonym mapping. Phrases are character sequences
>> where words are separated with character zero (U+). Empty words (two
>> U+s in a row) are not allowed in the input nor the output!//
>> //Parameters://
>> //input - input phrase//
>> //output - output phrase//
>> //includeOrig - true if the original should be included/
>>
>>
>> That means if the search string has the input expression, it looks for
>> all output expressions and treats them to be equivalent, right?
>>
>> Best regards
>>
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: SynonymMap

2018-09-10 Thread Michael McCandless

The SynonymMap.Builder constructor takes a dedup parameter to tell it what
to do in that case (when input and output are identical across added rules).

Mike McCandless

http://blog.mikemccandless.com

On Thu, Sep 6, 2018 at 2:06 PM, Baris Kazar  wrote:

> Hi,-
> how does SynonymMap deal with repeated values?
> Best regards
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: offsets

2018-07-29 Thread Michael McCandless

How would a fixup API work?  We would try to provide correctOffset
throughout the full analysis chain?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov  wrote:

> I've run into some difficulties with offsets in some TokenFilters I've been
> writing, and I wonder if anyone can shed any light. Because characters may
> be inserted or removed by prior filters (eg ICUFoldingFilter does this with
> ellipses), and there is no offset-correcting data structure available to
> TokenFilters (as there is in CharFilter), there doesn't seem to be any
> reliable way to calculate the offset at a point interior to a token, which
> means that essentially the only reasonable thing to do with OffsetAttribute
> is to preserve the offsets from the input. This is means that filters that
> split their tokens (like WordDelimiterGraphFilter) have no reliable way of
> mapping their split tokens' offsets. One can try, but it seems inevitably
> to require making some arbitrary "fixup" stage in order to guarantee that
> the offsets are nondecreasing and properly bounded by the original text
> length.
>
> If this analysis is correct, it seems one should really never call
> OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of
> method to provide. (hmm now I see this comment in OffsetAttributeImpl
> suggesting making the method call-once). If that really is the case, I
> think some assertion, deprecation, or other API protection would be useful
> so the policy is clear.
>
> Alternatively, do we want to consider providing a "fixup" API as we have
> for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an
> API for setting offset deltas. This would make more precise highlighting
> possible in these cases, at least. I'm not sure what other use cases folks
> have come up with for offsets?
>
> -Mike
>

Re: Deleted documents and NRT Readers

2018-07-20 Thread Michael McCandless

Yeah it is surprising that Lucene applied that one delete when you said it
didn't have to.

Which Lucene version?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 19, 2018 at 5:54 PM, Stuart Goldberg 
wrote:

> Understood. But I would think that in a tiny program where I add one
> document and then update it, that the load is so small that it for sure
> would not have applied the delete.
>
> Why am I wrong in thinking this?
>
>
> On Thu, Jul 19, 2018, 5:50 PM Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Passing applyDeletes=false means Lucene does not have to apply all of its
>> buffered deletes.
>>
>> But, it still may have already applied some deletes, so there's no
>> guarantee that it won't have applied deletes.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Thu, Jul 19, 2018 at 3:23 PM, Stuart Goldberg 
>> wrote:
>>
>>> I used NRT readers all the time. I create then with 'applyDeletes' set to
>>> false for performance reasons and take the javadoc at its word that my
>>> code
>>> has to be prepared to deal with deleted documents. I thought I understood
>>> that and I wrote my code to be deleted-document-safe.
>>>
>>> But I have recently revisited the issue and tried to understand what
>>> happens using a little test program. I create a document and add it to
>>> the
>>> index. I then create a new document that mirrors the first one but I
>>> change
>>> the value of a field. Then I call IndexWriter.updateDocument() which is a
>>> delete and an add.
>>>
>>> I then get a NRT reader with applyDeletes set to false and do a
>>> MatchAllDocsQuery search. I would expect to get 2 documents back: the
>>> current one and the updated one. But I only get back the updated one.
>>>
>>> But I know in real code with 1000's of documents flying into the index
>>> that
>>> I have gotten deleted documents returned.
>>>
>>> Can someone explain to me why my small test program doesn't get the
>>> deleted
>>> documents back?
>>>
>>> Stuart M Goldberg
>>>
>>> Senior Vice President of Software Develpment
>>> *FIX Flyer LLC*
>>> http://www.FIXFlyer.com/ <http://www.fixflyer.com/>
>>>
>>> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
>>> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
>>> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
>>> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
>>> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
>>> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE
>>> THIS
>>> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>>>
>>> --
>>> *Notice to Recipient*: https://www.fixflyer.com/disclaimer
>>> <https://www.fixflyer.com/disclaimer>
>>>
>>
>>
> *Notice to Recipient*: https://www.fixflyer.com/disclaimer

Re: Deleted documents and NRT Readers

2018-07-19 Thread Michael McCandless

Passing applyDeletes=false means Lucene does not have to apply all of its
buffered deletes.

But, it still may have already applied some deletes, so there's no
guarantee that it won't have applied deletes.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 19, 2018 at 3:23 PM, Stuart Goldberg 
wrote:

> I used NRT readers all the time. I create then with 'applyDeletes' set to
> false for performance reasons and take the javadoc at its word that my code
> has to be prepared to deal with deleted documents. I thought I understood
> that and I wrote my code to be deleted-document-safe.
>
> But I have recently revisited the issue and tried to understand what
> happens using a little test program. I create a document and add it to the
> index. I then create a new document that mirrors the first one but I change
> the value of a field. Then I call IndexWriter.updateDocument() which is a
> delete and an add.
>
> I then get a NRT reader with applyDeletes set to false and do a
> MatchAllDocsQuery search. I would expect to get 2 documents back: the
> current one and the updated one. But I only get back the updated one.
>
> But I know in real code with 1000's of documents flying into the index that
> I have gotten deleted documents returned.
>
> Can someone explain to me why my small test program doesn't get the deleted
> documents back?
>
> Stuart M Goldberg
>
> Senior Vice President of Software Develpment
> *FIX Flyer LLC*
> http://www.FIXFlyer.com/ 
>
> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
> --
> *Notice to Recipient*: https://www.fixflyer.com/disclaimer
> 
>

Re: Lucene Speed

2018-07-18 Thread Michael McCandless

Hi Ehson,

Have you looked at the luceneutil source code that runs the benchmarks?
https://github.com/mikemccand/luceneutil

The sources are not super clean, but that's what's running the nightly
benchmarks, starting from src/main/perf/Indexer.java.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jul 18, 2018 at 10:11 AM, Ehson Umrani  wrote:

> Hello,
>
> My name is Ehson Umrani and I am currently running some experiments using
> Lucene. FOr the expiraments I am running I need Lucene to run as fast as
> possible. Do you have any suggestions on how to achieve speeds listed on
> the nightly benchmark page. I am also using 1kb Wikipedia files and the
> same 2048 MB RAM buffer but the speeds I am getting are not close to the
> speeds achieved by you guys
> Any help would be awesome!!!
>
> Thanks,
> Ehson Umrani
>

Re: Recreating index lucene without stopping client applications

2018-07-18 Thread Michael McCandless

If you use IndexWriter.deleteAll, and not any of the other delete by Query,
Term methods, it should be quite efficient to delete, as IndexWriter just
drops all segments.

That API is also transactional, so you could call IW.deleteAll, proceed to
reindex all your documents, and if somehow that crashes before finishing,
your index will still reflect the old index with nothing deleted or
updated.  Only once you successfully commit will the new index become
visible to maybeRefresh() calls on a non-NRT reader.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jul 17, 2018 at 5:15 PM, Michael Sokolov  wrote:

> If you create a completely new index, rather than applying updates to an
> existing index, you will not be able to see that by calling maybeRefresh(),
> I think, since that is looking for updates to an existing index.
> Conceivably you could open a writer on the existing index, delete all of
> its documents, and then write new ones and commit. After that, your refresh
> call would see the updates. But I wouldn't recommend this since it might be
> inefficient to do all those deletions. Instead I would suggest creating a
> new index directory, and having some process that watches for a new
> directory being created. Then when it sees that, it could open a new
> searcher using that directory, and replace your existing searcher. In other
> words, implement the refresh yourself, since you have taken over the
> process of writing new indexes outside of what Lucene manages. Another
> possibility would be to maintain a timestamp on your documents, write all
> your new documents, and then query-and-delete any documents with old
> timestamps.  But the key point here is that you can't just create a new
> index and expect your reader to know about it just because you stuck it in
> the same file system directory where the old one was.
>
> On Wed, Jul 11, 2018 at 11:46 AM Eduardo Costa Lopes <
> eduardo-costa.lo...@serpro.gov.br> wrote:
>
> > Hi Marco,
> >
> > Basically, the content of lucene index directory is deleted and after,
> the
> > index is recreated (under the same directory). Months ago, I've
> researched
> > how to "refresh" the lucene access to get the newest data withou
> restarting
> > the wep applications and, in the 6.1.0 version, it is available the class
> > SearchManager, which according to the documentation, should be called its
> > method maybeRefresh() periodically to reopen the index. Our "reopen
> > scheduler" runs hourly and even being executed with success it seems the
> > data wasn't the newest.
> >
> > Thanks.
> >
> >
> > ==
> > Eduardo Costa Lopes
> > SERPRO - SUPDE/DEPAE/DE009
> >
> > e-mail: eduardo-costa.lo...@serpro.gov.br
> > telefone: (51) 2129 - 1180
> >
> > - Mensagem original -
> > De: "Marco Reis" 
> > Para: "java-user" 
> > Enviadas: Quarta-feira, 11 de julho de 2018 12:06:18
> > Assunto: Re: Recreating index lucene without stopping client applications
> >
> > Hi Eduardo,
> >
> > It's not clear the index recreation process, but I think you have two
> > different SearcherManagers, one for the app and a different one for the
> > command line. At some point, one of them could see the document
> exclusion,
> > and the JBoss doen't. Maybe reopen the index directory could help.
> >
> >
> >
> >
> > On Wed, Jul 11, 2018 at 11:46 AM Eduardo Costa Lopes <
> > eduardo-costa.lo...@serpro.gov.br> wrote:
> >
> > > Hello,
> > >
> > > I have a Jboss application querying a lucene index to get some customer
> > > info. Sometimes the index are recreated while the application is
> running.
> > > Basically, the old index is erased and a new one is created. In the
> > > application side we have a scheduler calling
> > > org.apache.lucene.search.SearcherManager..maybeRefresh(), in order to
> > get a
> > > new connection to the index. The issue is: today we have updated the
> > index
> > > and looking for a certain name our command-line returns 4955 hits, but
> in
> > > the web app we got 4058 hits (three more). The correct hit number only
> is
> > > show if restart the jboss. I'd like to know how can we recreate the
> > lucene
> > > index without need to restart the applications.
> > >
> > > Thanks in advance,
> > >
> > > Eduardo Lopes.
> > >
> > >
> > >
> > >
> > > -
> > >
> > >
> > > "Esta mensagem do SERVIÇO FEDERAL DE PROCESSAMENTO DE DADOS (SERPRO),
> > > empresa pública federal regida pelo disposto na Lei Federal nº 5.615, é
> > > enviada exclusivamente a seu destinatário e pode conter informações
> > > confidenciais, protegidas por sigilo profissional. Sua utilização
> > > desautorizada é ilegal e sujeita o infrator às penas da lei. Se você a
> > > recebeu indevidamente, queira, por gentileza, reenviá-la ao emitente,
> > > esclarecendo o equívoco."
> > >
> > > "This message from SERVIÇO FEDERAL DE PROCESSAMENTO DE DADOS (SERPRO)
> --
> > a
> > > government company established under Brazilian law (5.615/70) -- is
> > > directed exclusively to its ad

Re: UTF8TaxonomyWriterCache inconsistency

2018-07-02 Thread Michael McCandless

Yes please create a Jira issue!

Mike

On Mon, Jul 2, 2018, 12:31 AM Руслан Торобаев  wrote:

> Hi!
>
> I’m facing a problem with taxonomy writer cache inconsistency. At some
> point in time UTF8TaxonomyWriterCache starts to return wrong ord for some
> facet labels. As result wrong ord are written in doc facet fields, and
> wrong counts are returned (undercount) during search. This bug is
> manifested on different servers with different index contents (we have
> several separate indexe with unique data).
> Unfortunately I can’t reproduce this behaviour in tests. All I have now is
> taxonomy dir state and  UTF8TaxonomyWriterCache dump I created on “broken"
> application instance. I’ve also created simple app to load and compare
> cache state with taxonomy, and I can share it.
> We using Lucene 7.1.0 and AFAIK there was no major changes in facets cache
> code since that release.
>
> Can someone help me investigate this situation? Should I create ticket in
> Lucene bug tracker?
>
>
> -
>
> Regards
> Ruslan Torobaev
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Help! - Max Segment name reached

2018-04-21 Thread Michael McCandless

Well I think as time goes on we'll see more and more people running into it
;)

But you really need to commit at a surprisingly high rate, and have a
surprisingly long lived index, to overflow the int that holds the segment
number.  E.g. if you commit once per second, it should take ~68 years to
overflow the int.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Apr 17, 2018 at 4:04 PM, Stuart Goldberg 
wrote:

> Thanks, I will try that.
>
> Why haven't more people run into this issue? The next segment number is
> persisted, so if an index has a long life it should eventually run into
> this problem.
>
> Stuart M Goldberg
> Senior Vice President of Software Develpment
> FIX Flyer LLC
> http://www.FIXFlyer.com/
> NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO THIS
> INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE INTENDED
> RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE DELETE THIS
> E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
>
> -Original Message-
> From: Uwe Schindler 
> Sent: Tuesday, April 17, 2018 4:02 PM
> To: java-user@lucene.apache.org
> Subject: Re: Help! - Max Segment name reached
>
> Hi,
>
> Create a new empty index in a new directory and use addIndex() using the
> other directory with the broken index.
>
> This will copy all segments but renumber them.
>
> Uwe
>
> Am April 17, 2018 3:52:27 PM UTC schrieb Stuart Goldberg <
> sgoldb...@fixflyer.com>:
> >We have an index that has run into this bug:
> >https://issues.apache.org/jira/browse/LUCENE-7999
> >
> >
> >
> >Although this is reported to be fixed in Lucene 7.2, we are at 4.10.4
> >and cannot upgrade.
> >
> >
> >
> >By looking at the code it seems that the last segment number counter is
> >persisted in segment_h. When creating a new segment, it names the
> >segment based on the persisted counter. If this counter is larger than
> >Integer.MAX_VALUE how can we recover this index.
> >
> >
> >
> >Is there anything we can do?
> >
> >
> >
> >Stuart M Goldberg
> >
> >Senior Vice President of Software Develpment FIX Flyer LLC
> > http://www.FIXFlyer.com/
> >
> >NOTICE TO RECIPIENT: THIS E- MAIL IS MEANT ONLY FOR THE INTENDED
> >RECIPIENT(S) OF THE TRANSMISSION, AND CONTAINS CONFIDENTIAL INFORMATION
> >WHICH IS PROPRIETARY TO FIX FLYER LLC ANY UNAUTHORIZED USE, COPYING,
> >DISTRIBUTION, OR DISSEMINATION IS STRICTLY PROHIBITED. ALL RIGHTS TO
> >THIS INFORMATION IS RESERVED BY FIX FLYER LLC. IF YOU ARE NOT THE
> >INTENDED RECIPIENT, PLEASE CONTACT THE SENDER BY REPLY EMAIL AND PLEASE
> >DELETE THIS E-MAIL FROM YOUR SYSTEM AND DESTROY ANY COPIES.
> >
> >
>
> --
> Uwe Schindler
> Achterdiek 19, 28357 Bremen
> https://www.thetaphi.de
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: WordDelimiterGraphFilter does not respect KeywordAttribute

2018-04-21 Thread Michael McCandless

+1

Mike

On Fri, Apr 20, 2018, 9:42 AM Michael Sokolov  wrote:

> I have a use case that generates some tokens containing punctuation
> (fractions and other numerical constructs), but I am handling most
> punctuation with WordDelimiterGraphFilter, which then decomposes those
> tokens into parts and re-composes, so eg 1/2 becomes {1, 2, 12}. I thought
> at first that I could mark those tokens as keywords to prevent any future
> analysis, but I discovered WDGF ignores that.
>
> I have a workaround using Arabic numerals as separators instead of
> punctuation (1/2 -> 1١2) -- these are classified as digits, so WDGF does
> not split on them --, but someday I would like to support Arabic (or Hindi)
> language numbers as well, and then this hack will bite me.
>
> Does it seem reasonable to update WDGF (and its cousin WDF) to respect
> KeywordAttribute? I think it can be done with a very small change.
>

Re: IndexWriter updateDocument is removing doc from index

2018-03-16 Thread Michael McCandless

Yes you can add documents by calling updateDocument -- if no prior
documents matched the deletion Term you provide, nothing is deleted and
your new doc is added.

Hmm are you sure your 2nd update really updated and then added 12 new
docs?   Dropping segment 1 makes sense because you deleted the one doc
(from your first update) and Lucene drops 100% deleted segments.

But your 3rd segment should have hand 13 docs if you really added 12 new
docs and updated.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Mar 15, 2018 at 9:17 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> While writing some tools to build and maintain lucene indexes I noticed
> some strange behavior during testing.
> A doc disappears from lucene index while using IndexWriter updateDocument.
>
> The API of lucene 6.4.2 states:
> "Updates a document by first deleting the document(s) containing term and
> then adding the new document. The delete and then add are atomic as seen
> by a reader on the same index (flush may happen only after the add)."
>
> I could reproduce it but it might be it works as designed and I have
> to call some "flush" after using updateDocument?
>
> Any known issue or pitfall with 
> org.apache.lucene.index.IndexWriter.updateDocument
> ?
>
> Steps I took:
> - created a new lucene index with 8 docs and 1 segment
>   segment_0 with DelCount:0, DelGen:-1, numDocs:8, maxDocs:8
> - updated 1 doc in the index with updateDocument which results in
>   segment_0 with DelCount:1, DelGen:1, numDocs:7, maxDocs:8
>   segment_1 with DelCount:0, DelGen:-1, numDocs:1, maxDocs:1
> so far OK, but now:
> - updated again the same doc as before and added 12 new docs
>   segment_0 with DelCount:1, DelGen:1, numDocs:7, maxDocs:8
>   segment_2 with DelCount:0, DelGen:-1, numDocs:12, maxDocs:12
>
> The result is that segment_1 disappeared and therefore the updated
> document.
> Only the 7 docs of segment_0 and the 12 new added documents of segment_2.
>
> By the way, is it allowed to use updateDocument to also add new docs?
>
> Regrads
> Bernd
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: any api to get segment number of index

2018-01-14 Thread Michael McCandless

How about IndexSearcher.getIndexReader().leaves().size()?

Mike McCandless

http://blog.mikemccandless.com

On Wed, Jan 10, 2018 at 5:19 AM, Yonghui Zhao  wrote:

> Hi,
>
> Is there any public API that I can get segment number of current version
> index?
>
> I didn't find in indexwriter or indexsearcher in lucene 4.10.
>

Re: typed IntPoint.RangeQuery & LongPoint.rangeQuery

2018-01-09 Thread Michael McCandless

Lucene doesn't (shouldn't?) let you add 'a' at first as an IntPoint and
then later as a LongPoint -- they must always be consistent.

So however you indexed it, you must use the corresponding class to
construct the query.

String 'hi' can only be found if you had indexed a token 'hi' in that field
-- the terms + postings are separate from the dimensional points (and from
doc values) so they cannot "cross talk".

Mike McCandless

http://blog.mikemccandless.com

On Sun, Dec 31, 2017 at 7:17 AM, Cristian Lorenzetto <
cristian.lorenze...@gmail.com> wrote:

> Hi i have doubt.
> Considering i want search a document field 'a' in a range. The same problem
> is also if i search for a exact point.
> I know that it is possible a document containing either a property 'a' with
> type Integer either with type Long. For example document 1 contains {a:5}
> where 5 is int, document 2 is {a:100} is long.
>
> How to search all documents in range [4,101].
>
> The right query is
>  IntPoint.rangequery(a,4,101) or LongPoint.rangequery(a,4,101) ?
>
> The question is more generic: every type has its algorithm for matching so
> i thought theoretically
> it is possible in the while i search a string 'hi' i find also a document
>  with a property with type not string because it contains the same ByteRef?
>

Re: index sorting merge

2017-12-28 Thread Michael McCandless

You should upgrade to newer versions of Lucene, where all segments are
sorted, not just merged segments.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Dec 28, 2017 at 11:13 AM, Yonghui Zhao 
wrote:

> Hi,
>
> I specified a  SortingMergePolicy in my case. I find only the first N-1
> segments are sorted as expected, the last segment is still disordered  when
> I call forceMerge(N), N > 1,
>
> I think it is by design, but is there any way to make all segments sorted.
>
> Thanks !
>

Re: may be lucene bug

2017-12-28 Thread Michael McCandless

I think there's a bug in your code: this line:

 doc.doc <= leaf.docBase + leaf.reader().maxDoc())

should be < not <=.

Mike McCandless

http://blog.mikemccandless.com

On Thu, Dec 28, 2017 at 6:15 AM, 291699763 <291699...@qq.com> wrote:

> Lucene version:6.6.0
>
> when Index
> document.add(new BinaryDocValuesField("CBID.CCID", new
> BytesRef(myValue)));
>
> and when search
>
>
> int totalHits = indexSearcher.count(SpanNearQuery);
> int from = 0;
> int size = 1;
> int pageTime = 0;
> int loadTime = 0;
> Set fieldsToLoad = new HashSet<>();
> fieldsToLoad.add("CBID.CCID");
> List leaves = indexSearcher.getIndexReader()
> .leaves();
> while (from < totalHits) {
> if (from > 0) {
> //翻页
> Stopwatch started = Stopwatch.createStarted();
> TopDocs search = indexSearcher.search(content, from);
> ScoreDoc scoreDoc = search.scoreDocs[search.scoreDocs.length
> - 1];
> TopDocs topDocs = indexSearcher.searchAfter(scoreDoc,
> content, size);
> pageTime += started.elapsed(TimeUnit.MILLISECONDS);
>
> started = Stopwatch.createStarted();
> ScoreDoc[] scoreDocs = topDocs.scoreDocs;
> for (ScoreDoc doc : scoreDocs) {
> for (LeafReaderContext leaf : leaves) {
> BinaryDocValues binary = 
> DocValues.getBinary(leaf.reader(),
> "CBID.CCID");
> if (doc.doc >= leaf.docBase && doc.doc <=
> leaf.docBase + leaf.reader().maxDoc()) {
> BytesRef bytesRef = binary.get(doc.doc -
> leaf.docBase);
> keyValue.add(bytesRef.utf8ToString());
> }
> }
> }
> loadTime += started.elapsed(TimeUnit.MILLISECONDS);
> } else {
> //不翻页
> Stopwatch started = Stopwatch.createStarted();
> TopDocs search = indexSearcher.search(content, size);
> pageTime += started.elapsed(TimeUnit.MILLISECONDS);
> started = Stopwatch.createStarted();
> ScoreDoc[] scoreDocs = search.scoreDocs;
> for (ScoreDoc doc : scoreDocs) {
> for (LeafReaderContext leaf : leaves) {
> BinaryDocValues binary = 
> DocValues.getBinary(leaf.reader(),
> "CBID.CCID");
> if (doc.doc >= leaf.docBase && doc.doc <=
> leaf.docBase + leaf.reader().maxDoc()) {
> BytesRef bytesRef = binary.get(doc.doc -
> leaf.docBase);
> keyValue.add(bytesRef.utf8ToString());
> }
> }
> }
> loadTime += started.elapsed(TimeUnit.MILLISECONDS);
> }
> from += size;
> }
>
>
> but throw exception
> Exception in thread "main" java.lang.RuntimeException:
> java.io.EOFException: read past EOF: MMapIndexInput(path="/data/
> home/p_wxuwang/index_withDocValues/_26.cfs") [slice=_26_Lucene54_0.dvd]
> [slice=var-binary]
> at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$6.
> get(Lucene54DocValuesProducer.java:740)
> at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$
> LongBinaryDocValues.get(Lucene54DocValuesProducer.java:1197)
> at com.yuewen.nrzx.keyword.Main2.getWithDocValues(Main2.java:111)
> at com.yuewen.nrzx.keyword.Main2.main(Main2.java:187)
> Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/data/
> home/p_wxuwang/index_withDocValues/_26.cfs") [slice=_26_Lucene54_0.dvd]
> [slice=var-binary]
> at org.apache.lucene.store.ByteBufferIndexInput.readBytes(
> ByteBufferIndexInput.java:98)
> at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$6.
> get(Lucene54DocValuesProducer.java:736)
>
>
> I don't know why??
>
>
>
>
>
>
>
> 王旭   技术部/数据支持
> 18302118258|291699763
> 上海市浦东新区碧波路690号6号楼（201203）
> www.yuewen.com
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2755 matches

Mail list logo