from:"Robert Muir"

Re: Excessive reads while doing commit in lucene

2024-09-04 Thread Robert Muir

On Wed, Sep 4, 2024 at 7:07 AM Gopal Sharma  wrote:
>
> Hi Team,
>
> I am using aws efs to store a lucene index.

That's the issue, don't use NFS!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SpanMultiTermQueryWrapper with PrefixQuery hitting num clause limit

2024-03-28 Thread Robert Muir

using spans and wildcards together is asking for trouble, you will hit
limits, it is not efficient by definition.

I'd recommend to change your indexing so that your queries are fast
and you aren't using wildcards that enumerate many terms at
search-time.
Don't index words such as "bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
and then use wildcards to match just "bar".
Instead add a synonym "bar" (or similar, whatever you want) to
"bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6"
This way you can match it with ordinary termquery: "bar"

e.g. for your simple example, this would look approximately like this:
instead of: abc foo bar_" + UUID.randomUUID()
index something like: abc foo bar bar_" + UUID.randomUUID()

but if you use an analyzer, then
bar_294e50e1-fc3c-450f-a04f-7b4ad79587d6 and its synonym "bar" will
sit at the same position, so your spans/sloppy-phrases will work fine.

On Thu, Mar 28, 2024 at 11:37 AM Yixun Xu  wrote:
>
> Hello,
>
> We are trying to search for phrases where the last term is a prefix match.
> For example, find all documents that contain "foo bar.*", with a
> configurable slop between "foo" and "bar". We were able to do this using
> `SpanNearQuery` where the last clause is a `SpanMultiTermQueryWrapper` that
> wraps a `PrefixQuery`. However, this seems to run into the limit of 1024
> clauses very quickly if the last term appears as a common prefix in the
> index.
>
> I have a branch that reproduces the query at
> https://github.com/apache/lucene/compare/main...yixunx:yx/span-query-limit?expand=1,
> and also pasted the code below.
>
> It seems that if slop = 0 then we can use `MultiPhraseQuery` instead, which
> doesn't hit the clause limit. For the slop != 0 case, is it intended that
> `SpanMultiTermQueryWrapper` can easily hit the clause limit, or am I using
> the queries wrong? Is there a workaround other than increasing
> `maxClauseCount`?
>
> Thank you for the help!
>
> ```java
> public class TestSpanNearQueryClauseLimit extends LuceneTestCase {
>
> private static final String FIELD_NAME = "field";
> private static final int NUM_DOCUMENTS = 1025;
>
> /**
>  * Creates an index with NUM_DOCUMENTS documents. Each document has a
> text field in the form of "abc foo bar_[UUID]".
>  */
> private Directory createIndex() throws Exception {
> Directory dir = newDirectory();
> try (IndexWriter writer = new IndexWriter(dir, new
> IndexWriterConfig())) {
> for (int i = 0; i < NUM_DOCUMENTS; i++) {
> Document doc = new Document();
> doc.add(new TextField("field", "abc foo bar_" +
> UUID.randomUUID(), Field.Store.YES));
> writer.addDocument(doc);
> }
> writer.commit();
> }
> return dir;
> }
>
> public void testSpanNearQueryClauseLimit() throws Exception {
> Directory dir = createIndex();
>
> // Find documents that match "abc  bar.*", which should
> match all documents.
> try (IndexReader reader = DirectoryReader.open(dir)) {
> Query query = new SpanNearQuery.Builder(FIELD_NAME, true)
> .setSlop(1)
> .addClause(new SpanTermQuery(new Term(FIELD_NAME,
> "abc")))
> .addClause(new SpanMultiTermQueryWrapper<>(new
> PrefixQuery(new Term(FIELD_NAME, "bar"
> .build();
>
> // This throws exception if NUM_DOCUMENTS is > 1024.
> // ```
> // org.apache.lucene.search.IndexSearcher$TooManyNestedClauses:
> Query contains too many nested clauses;
> // maxClauseCount is set to 1024
> // ```
> TopDocs docs = new IndexSearcher(reader).search(query, 10);
> System.out.println(docs.totalHits);
> }
>
> dir.close();
> }
> }
> ```
>
> Thank you,
> Yixun Xu

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Re-ranking using cross-encoder after vector search (bi-encoder)

2023-02-10 Thread Robert Muir

I think it would be good to provide something like a VectorRerankField
(sorry for the bad name, maybe FastVectorField would be amusing too),
that just stores vectors as docvalues (no HNSW) and has a
newRescorer() method that implements
org.apache.lucene.search.Rescorer. Then its easy to do as that
document describes, pull top 500 hits with BM25 and rerank them with
your vectors, very fast, only 500 calculations required, no HNSW or
anything needed. Of course you could use a vector search instead of a
BM25 search as the initial search to pull the top 500 hits too.

So it could meet both use-cases and provide a really performant option
for users that want to integrate vector search.

On Fri, Feb 10, 2023 at 10:21 AM Michael Wechner
 wrote:
>
> Hi
>
> I use the vector search of Lucene, whereas the embeddings I get from
> SentenceBERT for example.
>
> According to
>
> https://www.sbert.net/examples/applications/retrieve_rerank/README.html
>
> a re-ranking with a cross-encoder after the vector search (bi-encoding)
> can improve the ranking.
>
> Would it make sense to add this kind of functionality to Lucene or is
> somebody already working on something similar?
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Prioritising certain documents in the search results

2023-02-01 Thread Robert Muir

check out 
https://lucene.apache.org/core/9_5_0/core/org/apache/lucene/document/FeatureField.html

I think this is how you want to do it: it has some suggestions on how
to start without training the actual values in the docs, see "if you
don't know where to start"

On Wed, Feb 1, 2023 at 12:03 PM Trevor Nicholls
 wrote:
>
> Hi
>
>
>
> I'm currently using Lucene 8-6.3, and indexing a few thousand documents.
> Some of these documents need to be prioritised in the search results, but
> not by too much; e.g. an exact phrase match in a normal document still needs
> to top the rankings ahead of a priority document that just matches the
> individual words.
>
> I'm tokenising and indexing a Title field and a Content field, and storing a
> Category field and a Version field. Searches are executed with a standard
> QueryParser on Title and Content.
>
>
>
> Most documents do not have a category or a version, but if a document has
> the category ReleaseNote then it needs to be boosted, and the value of the
> boost factor should be correlated with the Version (so Version = 10 is
> boosted more than Version = 6, etc).
>
>
>
> Looking back through the older versions of the documentation it appears that
> document.setBoost might have provided something I could work with, but later
> versions of Lucene appear to have dropped this feature. If it still existed
> I expect I'd have to experiment with the actual coefficients until I got
> reasonable results but in principle it should be possible to arrive at a
> good result.
>
>
>
> I can't see how to achieve this same thing in the later versions of Lucene.
> Best I can come up with is to calculate a factor from the category and
> version number and store it as a separate field, then somehow use this field
> to scale up (or down) the actual scores reported for the query results. But
> I'm sure there must be a better way to do it - can anyone show me what it
> is?
>
>
>
> cheers
>
> T
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Handling Indian regional languages

2023-01-16 Thread Robert Muir

On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian
 wrote:
>
> For handling Indian regional languages, what is the advisable approach?
>
> 1. Indexing each language data(Tamil, Hindi etc) in specific fields like
> content_tamil, content_hindi with specific per field Analyzer like Tamil
> for content_tamil, HindiAnalyzer for content_hindi?

You don't need to do this just to tokenize. You only need to do this
if you want to do something fancier on top (e.g. stemming and so on).
If you look at newer lucene versions, there are more analyzers for
more languages.

>
> 2. Indexing all language data in the same field but handling tokenization
> with specific unicode range(similar to THAI) in tokenizer like mentioned
> below..
>
> THAI   = [\u0E00-\u0E59]
> > TAMIL  = [\u0B80-\u0BFF]
> > // basic word: a sequence of digits & letters (includes Thai to enable
> > ThaiAnalyzer to function)
> > ALPHANUM   = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+

Don't do this: Just use StandardTokenizer instead of ClassicTokenizer.
StandardTokenizer can tokenize all the Indian writing systems
out-of-box.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Recurring index corruption

2023-01-02 Thread Robert Muir

Your files are getting truncated. Nothing lucene can do.

If this is really the only way you can store data in this azure cloud,
and this is how they treat it, then run away... don't just walk... to
a different cloud.

On Mon, Jan 2, 2023 at 5:19 AM S S  wrote:
>
> We are experimenting with Elastic Search deployed in Azure Container 
> Instances (Debian + OpenJDK). The ES indexes are stored into an Azure file 
> share mounted via SMB (3.0). The Elastic Search cluster is made up of 4 
> nodes, each one have a separate file share to store the indices.
>
> This configuration has been influenced by some ACIs limitations, specifically:
>
> we cannot set the max_map_count value as we do not have access to the 
> underlying host 
> (https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html).
>  Unfortunately, this is required to run an ES cluster, therefore we were 
> forced to use NIOF
> ACI’s storage is ephemera, therefore we had to map volumes to persist the 
> indexes. ACIs only allow volume mappings using Azure File Shares, which only 
> works with NFS or SMB.
>
> We are experiencing recurring index corruption, specifically a "read past 
> EOF" exception. I asked on the Elastic Search forum but the answer I got was 
> a bit generic and not really helpful other than confirming that, from ES 
> point of view, ES should work on an SMB share as long as it behaves as a 
> local drive. As the underlying exception relates to an issue with a Lucene 
> index, I was wondering if you could help out? Specifically, can Lucene work 
> on SMB? I can only find sparse information on this configuration and, while 
> NFS seems a no-no, for SMB is not that clear. Below is the exception we are 
> getting.
>
> java.io.IOException: read past EOF: 
> NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
>  buffer: java.nio.HeapByteBuffer[pos=0 lim=1024 cap=1024] chunkLen: 1024 end: 
> 2331: 
> NIOFSIndexInput(path="/bitnami/elasticsearch/data/indices/mS2bUbLtSeG0FSAMuKX7JQ/0/index/_ldsn_1.fnm")
>   at 
> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:200)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:291)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:55)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.store.BufferedChecksumIndexInput.readByte(BufferedChecksumIndexInput.java:39)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.codecs.CodecUtil.readBEInt(CodecUtil.java:667) 
> ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:184) 
> ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:253) 
> ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read(Lucene90FieldInfosFormat.java:128)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.SegmentReader.initFieldInfos(SegmentReader.java:205) 
> ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.SegmentReader.(SegmentReader.java:156) 
> ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.ReadersAndUpdates.createNewReaderWithLatestLiveDocs(ReadersAndUpdates.java:738)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.ReadersAndUpdates.swapNewReaderWithLatestLiveDocs(ReadersAndUpdates.java:754)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.ReadersAndUpdates.writeFieldUpdates(ReadersAndUpdates.java:678)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.ReaderPool.writeAllDocValuesUpdates(ReaderPool.java:251)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3743) 
> ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:591) 
> ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
>  ~[lucene-core-9.3.0.jar:?]
>   at 
> org.elasticsearch.index.en

Re: Is there a way to customize segment names?

2022-12-17 Thread Robert Muir

No, you can't control them. And we must not open up anything to try to
support this.

On Fri, Dec 16, 2022 at 7:28 PM Patrick Zhai  wrote:
>
> Hi Mike, Robert
>
> Thanks for replying, the system is almost like what Mike has described: one 
> writer is primary,
> and the other is trying to catch up and wait, but in our internal discussion 
> we found there might
> be small chances where the secondary mistakenly think itself as primary (due 
> to errors of other component)
> while primary is still alive and thus goes into the situation I described.
> And because we want to tolerate the error in case we can't prevent it from 
> happening, we're looking for customizing
> filenames.
>
> Thanks again for discussing this with me and I've learnt that playing with 
> filenames can become quite
> troublesome, but still, even out of my own curiosity, I want to understand 
> whether we're able to control
> the segment names in some way?
>
> Best
> Patrick
>
>
> On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov  wrote:
>>
>> +1 trying to coordinate multiple writers running independently will
>> not work. My 2c for availability: you can have a single primary active
>> writer with a backup one waiting, receiving all the segments from the
>> primary. Then if the primary goes down, the secondary one has the most
>> recent commit replicated from the primary (identical commit, same
>> segments etc) and can pick up from there. You would need a mechanism
>> to replay the writes the primary never had a chance to commit.
>>
>> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir  wrote:
>> >
>> > You are still talking "Multiple writers". Like i said, going down this
>> > path (playing tricks with filenames) isn't going to work out well.
>> >
>> > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai  wrote:
>> > >
>> > > Hi Robert,
>> > >
>> > > Maybe I didn't explain it clearly but we're not going to constantly 
>> > > switch
>> > > between writers or share effort between writers, it's purely for
>> > > availability: the second writer only kicks in when the first writer is 
>> > > not
>> > > available for some reason.
>> > > And as far as I know the replicator/nrt module has not provided a 
>> > > solution
>> > > on when the primary node (main indexer) is down, how would we recover 
>> > > with
>> > > a back up indexer?
>> > >
>> > > Thanks
>> > > Patrick
>> > >
>> > >
>> > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir  wrote:
>> > >
>> > > > This multiple-writer isn't going to work and customizing names won't
>> > > > allow it anyway. Each file also contains a unique identifier tied to
>> > > > its commit so that we know everything is intact.
>> > > >
>> > > > I would look at the segment replication in lucene/replicator and not
>> > > > try to play games with files and mixing multiple writers.
>> > > >
>> > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  
>> > > > wrote:
>> > > > >
>> > > > > Hi Folks,
>> > > > >
>> > > > > We're trying to build a search architecture using segment replication
>> > > > (indexer and searcher are separated and indexer shipping new segments 
>> > > > to
>> > > > searchers) right now and one of the problems we're facing is: for
>> > > > availability reason we need to have multiple indexers running, and 
>> > > > when the
>> > > > searcher is switching from consuming one indexer to another, there are
>> > > > chances where the segment names collide with each other (because 
>> > > > segment
>> > > > names are count based) and the searcher have to reload the whole index.
>> > > > > To avoid that we're looking for a way to name the segments so that
>> > > > Lucene is able to tell the difference and load only the difference (by
>> > > > calling `openIfChanged`). I've checked the IndexWriter and the
>> > > > DocumentsWriter and it seems it is controlled by a private final method
>> > > > `newSegmentName()` so likely not possible there. So I wonder whether
>> > > > there's any other ways people are aware of

Re: Is there a way to customize segment names?

2022-12-16 Thread Robert Muir

You are still talking "Multiple writers". Like i said, going down this
path (playing tricks with filenames) isn't going to work out well.

On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai  wrote:
>
> Hi Robert,
>
> Maybe I didn't explain it clearly but we're not going to constantly switch
> between writers or share effort between writers, it's purely for
> availability: the second writer only kicks in when the first writer is not
> available for some reason.
> And as far as I know the replicator/nrt module has not provided a solution
> on when the primary node (main indexer) is down, how would we recover with
> a back up indexer?
>
> Thanks
> Patrick
>
>
> On Thu, Dec 15, 2022 at 7:16 PM Robert Muir  wrote:
>
> > This multiple-writer isn't going to work and customizing names won't
> > allow it anyway. Each file also contains a unique identifier tied to
> > its commit so that we know everything is intact.
> >
> > I would look at the segment replication in lucene/replicator and not
> > try to play games with files and mixing multiple writers.
> >
> > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  wrote:
> > >
> > > Hi Folks,
> > >
> > > We're trying to build a search architecture using segment replication
> > (indexer and searcher are separated and indexer shipping new segments to
> > searchers) right now and one of the problems we're facing is: for
> > availability reason we need to have multiple indexers running, and when the
> > searcher is switching from consuming one indexer to another, there are
> > chances where the segment names collide with each other (because segment
> > names are count based) and the searcher have to reload the whole index.
> > > To avoid that we're looking for a way to name the segments so that
> > Lucene is able to tell the difference and load only the difference (by
> > calling `openIfChanged`). I've checked the IndexWriter and the
> > DocumentsWriter and it seems it is controlled by a private final method
> > `newSegmentName()` so likely not possible there. So I wonder whether
> > there's any other ways people are aware of that can help control the
> > segment names?
> > >
> > > A example of the situation described above:
> > > Searcher previously consuming from indexer 1, and have following
> > segments: _1, _2, _3, _4
> > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> > segments, and produced its own 4th segments (notioned as _4', but it shares
> > the same "_4" name): _1, _2, _3, _4'
> > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
> > 2, then when it finished downloading the segments and trying to refresh the
> > reader, it will likely hit the exception here, and seems all we can do
> > right now is to reload the whole index and that could be potentially a high
> > cost.
> > >
> > > Sorry for the long email and thank you in advance for any replies!
> > >
> > > Best
> > > Patrick
> > >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Is there a way to customize segment names?

2022-12-15 Thread Robert Muir

This multiple-writer isn't going to work and customizing names won't
allow it anyway. Each file also contains a unique identifier tied to
its commit so that we know everything is intact.

I would look at the segment replication in lucene/replicator and not
try to play games with files and mixing multiple writers.

On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai  wrote:
>
> Hi Folks,
>
> We're trying to build a search architecture using segment replication 
> (indexer and searcher are separated and indexer shipping new segments to 
> searchers) right now and one of the problems we're facing is: for 
> availability reason we need to have multiple indexers running, and when the 
> searcher is switching from consuming one indexer to another, there are 
> chances where the segment names collide with each other (because segment 
> names are count based) and the searcher have to reload the whole index.
> To avoid that we're looking for a way to name the segments so that Lucene is 
> able to tell the difference and load only the difference (by calling 
> `openIfChanged`). I've checked the IndexWriter and the DocumentsWriter and it 
> seems it is controlled by a private final method `newSegmentName()` so likely 
> not possible there. So I wonder whether there's any other ways people are 
> aware of that can help control the segment names?
>
> A example of the situation described above:
> Searcher previously consuming from indexer 1, and have following segments: 
> _1, _2, _3, _4
> Indexer 2 previously sync'd from indexer 1, sharing the first 3 segments, and 
> produced its own 4th segments (notioned as _4', but it shares the same "_4" 
> name): _1, _2, _3, _4'
> Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer 2, 
> then when it finished downloading the segments and trying to refresh the 
> reader, it will likely hit the exception here, and seems all we can do right 
> now is to reload the whole index and that could be potentially a high cost.
>
> Sorry for the long email and thank you in advance for any replies!
>
> Best
> Patrick
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir

https://github.com/apache/lucene/pull/11955

On Sat, Nov 19, 2022 at 10:43 PM Robert Muir  wrote:
>
> Hi,
>
> Is this 'synchronized' really needed?
>
> 1. Lucene tokenstreams are only used by a single thread. If you index
> with 10 threads, 10 tokenstreams are used.
> 2. These OpenNLP Factories make a new *Op for each tokenstream that
> they create. so there's no thread hazard.
> 3. If i remove 'synchronized' keyword everywhere from opennlp module
> (NLPChunkerOp, NLPNERTaggerOp, NLPPOSTaggerOp, NLPSentenceDetectorOp,
> NLPTokenizerOp), then all the tests pass.
>
> On Sat, Nov 19, 2022 at 10:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919
> 3RD A)  wrote:
> >
> > Greetings,
> > I would greatly appreciate anyone sharing their experience doing 
> > NLP/lemmatization and am also very curious to gauge the opinion of the 
> > lucene community regarding open-nlp. I know there are a few other libraries 
> > out there, some of which can’t be directly included in the lucene project 
> > because of licensing issues. If anyone has any suggestions/experiences, 
> > please do share them :-)
> > As a side note I’ll add that I’ve been experimenting with open-nlp’s 
> > PoS/lemmatization capabilities via lucene’s integration. During the process 
> > I uncovered some issues which made me question whether open-nlp is the 
> > right tool for the job. The first issue was a “low-hanging bug”, which 
> > would have most likely been addressed sooner if this solution was popular, 
> > this simple bug was at least 5 years old -> 
> > https://github.com/apache/lucene/issues/11771
> >
> > Second issue has more to do with the open-nlp library itself. It is not 
> > thread-safe in some very unexpected ways. Looking at the library internals 
> > reveals unsynchronized lazy initialization of shared components. 
> > Unfortunately the lucene integration kind of sweeps this under the rug by 
> > wrapping everything in a pretty big synchronized block, here is an example 
> > https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
> >  . This itself is problematic because these functions run in really tight 
> > loops and probably shouldn’t be blocking. Even if one did decide to do 
> > blocking initialization, it can still be done at a much lower level than 
> > currently. From what I gather, the functions that are synchronized at the 
> > lucene-level could be made thread-safe in a much more performant way if 
> > they were fixed in open-nlp. But I am also starting to doubt if this is 
> > worth pursuing since I don't know whether anyone would find this useful, 
> > hence the original inquiry.
> > I’ll add that I have separately used the open-nlp sentence break iterator 
> > (which suffers from the same problem 
> > https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
> >  ) at production scale and discovered really bad performance during certain 
> > conditions which I attribute to this unnecessary synching. I suspect this 
> > may have impacted others as well 
> > https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> > Many thanks,
> > Luke Kot-Zaniewski
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Integrating NLP into Lucene Analysis Chain

2022-11-19 Thread Robert Muir

Hi,

Is this 'synchronized' really needed?

1. Lucene tokenstreams are only used by a single thread. If you index
with 10 threads, 10 tokenstreams are used.
2. These OpenNLP Factories make a new *Op for each tokenstream that
they create. so there's no thread hazard.
3. If i remove 'synchronized' keyword everywhere from opennlp module
(NLPChunkerOp, NLPNERTaggerOp, NLPPOSTaggerOp, NLPSentenceDetectorOp,
NLPTokenizerOp), then all the tests pass.

On Sat, Nov 19, 2022 at 10:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919
3RD A)  wrote:
>
> Greetings,
> I would greatly appreciate anyone sharing their experience doing 
> NLP/lemmatization and am also very curious to gauge the opinion of the lucene 
> community regarding open-nlp. I know there are a few other libraries out 
> there, some of which can’t be directly included in the lucene project because 
> of licensing issues. If anyone has any suggestions/experiences, please do 
> share them :-)
> As a side note I’ll add that I’ve been experimenting with open-nlp’s 
> PoS/lemmatization capabilities via lucene’s integration. During the process I 
> uncovered some issues which made me question whether open-nlp is the right 
> tool for the job. The first issue was a “low-hanging bug”, which would have 
> most likely been addressed sooner if this solution was popular, this simple 
> bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771
>
> Second issue has more to do with the open-nlp library itself. It is not 
> thread-safe in some very unexpected ways. Looking at the library internals 
> reveals unsynchronized lazy initialization of shared components. 
> Unfortunately the lucene integration kind of sweeps this under the rug by 
> wrapping everything in a pretty big synchronized block, here is an example 
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
>  . This itself is problematic because these functions run in really tight 
> loops and probably shouldn’t be blocking. Even if one did decide to do 
> blocking initialization, it can still be done at a much lower level than 
> currently. From what I gather, the functions that are synchronized at the 
> lucene-level could be made thread-safe in a much more performant way if they 
> were fixed in open-nlp. But I am also starting to doubt if this is worth 
> pursuing since I don't know whether anyone would find this useful, hence the 
> original inquiry.
> I’ll add that I have separately used the open-nlp sentence break iterator 
> (which suffers from the same problem 
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
>  ) at production scale and discovered really bad performance during certain 
> conditions which I attribute to this unnecessary synching. I suspect this may 
> have impacted others as well 
> https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> Many thanks,
> Luke Kot-Zaniewski
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: pagination with searchAfter

2022-09-24 Thread Robert Muir

You don't need a server-side cache as the searchAfter value has all
the information, it is just your "current position".
For example if you are sorting by ID and you return IDs 1,2,3,4,5, the
searchAfter value is basically 5.
So when you query the next time with that searchAfter=5, it skips over
any value <= to that, and returns 6,7,8.9,10.

So depending on what you are doing, refreshing the index might indeed
give some inconsistencies in pagination. But this is nothing new, you
may also see inconsistencies in pagination across refreshes with
ordinary searches that don't use searchAfter, too.

If you want to reduce these, have a look at SearcherLifetimeManager,
which provides a way to keep "older points in time" open for
consistency:
https://blog.mikemccandless.com/2011/11/searcherlifetimemanager-prevents-broken.html

On Sat, Sep 24, 2022 at 1:58 AM  wrote:
>
> I’ve never used searchAfter before so looking for some tips and hints.
>
> I understand that I need to maintain a server side cache with the relevant 
> ScoreDocs, right?
>
> The index is refreshed every couple of minutes. How will that affect the 
> cached ScoreDocs?
>
> I don’t mind too much having some inconsistencies but I want to make sure 
> that following queries will still return reasonable results.
>
>
>
> Thanks,
>
> Erel
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene 9.2.0 build fails on Windows

2022-09-14 Thread Robert Muir

I opened an issue with one idea of how we can fix this, for
discussion: https://github.com/apache/lucene/issues/11772

On Wed, Sep 14, 2022 at 11:27 AM Uwe Schindler  wrote:
>
> Hi,
>
> do you have Microsoft Visual Studio installed? It looks like Gradle
> tries to detect it and fails with some NullPointerException while
> parsing a JSON file from its instalation.
>
> The misc module contains some (optional) native code that will get
> compiled (optionally) with Visual C++. It looks like thiat breaks.
>
> I have no idea how to fix this. Dawid: Maybe we can also make the
> configuration of that native stuff only opt-in? So only detect Visual
> Studio when you actively activate native code compilation?
>
> Uwe
>
> Am 13.09.2022 um 21:00 schrieb Rahul Goswami:
> > Hi Dawid,
> > I believe you. Just that for some reason I have never been able to get it
> > to work on Windows. Also, being a complete newbie to gradle doesn't help
> > much. So would appreciate some help on this while I find my footing. Here
> > is the link to the diagnostics that you requested (since attachments/images
> > won't make it through):
> >
> > https://drive.google.com/file/d/15pt9Qt1H98gOvA5e0NrtY8YYHao0lgdM/view?usp=sharing
> >
> >
> > Thanks,
> > Rahul
> >
> > On Tue, Sep 13, 2022 at 1:18 PM Dawid Weiss  wrote:
> >
> >> Hi Rahul,
> >>
> >> Well, that's weird.
> >>
> >>> "releases/lucene/9.2.0"  -> Run "gradlew help"
> >>>
> >>> If you need additional stacktrace or other diagnostics I am happy to
> >>> provide the same.
> >> Could you do the following:
> >>
> >> 1) run: git --version so that we're on the same page as to what the
> >> git version is (I don't think this matters),
> >> 2) run: gradlew help --stacktrace
> >>
> >> Step (2) should provide the exact place that fails. Something is
> >> definitely wrong because I'm on Windows and it works for me like a
> >> charm.
> >>
> >> Dawid
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene 9.2.0 build fails on Windows

2022-09-13 Thread Robert Muir

Looks to me like a gradle bug, detecting and trying to run some visual
studio command (vswhere.exe) elsewhere on your system, and it does the
wrong thing parsing its output.

On Tue, Sep 13, 2022 at 3:00 PM Rahul Goswami  wrote:
>
> Hi Dawid,
> I believe you. Just that for some reason I have never been able to get it
> to work on Windows. Also, being a complete newbie to gradle doesn't help
> much. So would appreciate some help on this while I find my footing. Here
> is the link to the diagnostics that you requested (since attachments/images
> won't make it through):
>
> https://drive.google.com/file/d/15pt9Qt1H98gOvA5e0NrtY8YYHao0lgdM/view?usp=sharing
>
>
> Thanks,
> Rahul
>
> On Tue, Sep 13, 2022 at 1:18 PM Dawid Weiss  wrote:
>
> > Hi Rahul,
> >
> > Well, that's weird.
> >
> > > "releases/lucene/9.2.0"  -> Run "gradlew help"
> > >
> > > If you need additional stacktrace or other diagnostics I am happy to
> > > provide the same.
> >
> > Could you do the following:
> >
> > 1) run: git --version so that we're on the same page as to what the
> > git version is (I don't think this matters),
> > 2) run: gradlew help --stacktrace
> >
> > Step (2) should provide the exact place that fails. Something is
> > definitely wrong because I'm on Windows and it works for me like a
> > charm.
> >
> > Dawid
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index corruption and repair

2022-04-29 Thread Robert Muir

The most helpful thing would be the full stacktrace of the exception.
This exception should be chaining the original exception and call
site, and maybe tell us more about this error you hit.

To me, it looks like a windows-specific issue where the filesystem is
returning an unexpected error. So it would be helpful to see exactly
which one that is, and the full trace of where it comes from, to chase
it further

On Thu, Apr 28, 2022 at 12:10 PM Antony Joseph
 wrote:
>
> Thank you for your reply.
>
> This isn't happening in a single environment. Our application is being used
> by various clients and this has been reported by multiple users - all of
> whom were running the earlier pylucene (v4.10) - without issues.
>
> One thing to mention is that our earlier version used Python 2.7.15 (with
> pylucene 4.10) and now we are using Python 3.8.10 with Pylucene 6.5.0 - the
> indexing logic is the same...
>
> One other thing to note is that the issue described has (so far!) only
> occurred on MS Windows - none of our Linux customers have complained about
> this.
>
> Any ideas?
>
> Regards,
> Antony
>
> On Thu, 28 Apr 2022 at 17:00, Adrien Grand  wrote:
>
> > Hi Anthony,
> >
> > This isn't something that you should try to fix programmatically,
> > corruptions indicate that something is wrong with the environment,
> > like a broken disk or corrupt RAM. I would suggest running a memtest
> > to check your RAM and looking at system logs in case they have
> > anything to tell about your disks.
> >
> > Can you also share the full stack trace of the exception?
> >
> > On Thu, Apr 28, 2022 at 10:26 AM Antony Joseph
> >  wrote:
> > >
> > > Hello,
> > >
> > > We are facing a strange situation in our application as described below:
> > >
> > > *Using*:
> > >
> > >- Python 3.8.10
> > >- Pylucene 6.5.0
> > >- Java 8 (1.8.0_181)
> > >- Runs on Linux and Windows (error seen on Windows)
> > >
> > > We suddenly get the following *error*:
> > >
> > > 2022-02-10 09:58:09.253215: ERROR : writer | Failed to get index
> > > (D:\i\202202) writer, Exception:
> > > org.apache.lucene.index.CorruptIndexException: Unexpected file read error
> > > while reading index.
> > >
> > (resource=BufferedChecksumIndexInput(MMapIndexInput(path="D:\i\202202\segments_fo")))
> > >
> > >
> > > After this, no further indexing happens - trying to open the index for
> > > writing throws the above error - and the index writer does not open.
> > >
> > > FYI, our code contains the following *settings*:
> > >
> > > index_path = "D:\i\202202"
> > > index_directory = FSDirectory.open(Paths.get(index_path))
> > > iconfig = IndexWriterConfig(wrapper_analyzer)
> > > iconfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND)
> > > iconfig.setRAMBufferSizeMB(16.0)
> > > writer = IndexWriter(index_directory, iconfig)
> > >
> > >
> > > *Repairing*
> > > We tried 'repairing' the index with the following command / tool:
> > >
> > > java -cp lucene-core-6.5.0.jar:lucene-backward-codecs-6.5.0.jar
> > > org.apache.lucene.index.CheckIndex "D:\i\202202" -exorcise
> > >
> > > This however returns saying "No problems found with the index."
> > >
> > >
> > > *Work around*
> > > We have to manually delete the problematic segment file:
> > > D:\i\202202\segments_fo
> > > after which the application starts again... until the next corruption. We
> > > can't spot a specific pattern.
> > >
> > >
> > > *Two questions:*
> > >
> > >1. Can we handle this situation programmatically, so that no manual
> > >intervention is needed?
> > >2. Any reason why we are facing the corruption issue in the first
> > place?
> > >
> > >
> > > Before this we were using Pylucene 4.10 and we didn't face this problem -
> > > the application logic is the same.
> > >
> > > Also, while the application runs on both Linux and Windows, so far we
> > have
> > > observed this situation only on various Windows platforms.
> > >
> > > Would really appreciate some assistance. Thanks in advance.
> > >
> > > Regards,
> > > Antony
> >
> >
> >
> > --
> > Adrien
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to handle corrupt Lucene index

2022-04-13 Thread Robert Muir

If you are looking at the files in hex, you can see the file format
docs online for your version:
https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/index/SegmentInfos.html
SegID is written right after SegName, it is 16 bytes (128-bit number)

On Wed, Apr 13, 2022 at 10:59 PM Robert Muir  wrote:
>
> Honestly the only time i've seen the mixed up files before (and the
> motivation for the paranoid checks in lucene), was bugs in some
> distributed replication code. In this case code that was copying files
> across the network had some bugs (e.g. used hashing of file contents
> to try to reduce network chatter but didn't handle hash collisions
> properly). So it would actually most commonly happen for .si file
> simply because it is typically a tiny file and more likely to cause
> hash collisions in some distributed code doing that. This was the
> motivation for adding unique id to each segment and all files
> corresponding to that segment... basically as a library, we can't
> trust filenames to be what they claim.
>
> segments_N doesn't just reference your segments by names like _8w and
> _94 but it also has segment's unique IDs, too. Would have to look at
> its file format to tell you how to see this with your hex editor. But
> in general, the segment unique ID is referenced everywhere, starting
> from segments_N. This way, when loading any index files for that
> segment (including *.si), lucene checks they have matching ID so that
> we know they really do belong to that segment. Because we can't trust
> filenames when users may manipulate them :)
>
> If the file really belongs to another segment (e.g. because files got
> mixed up), there's a clear error this way that files are mixed up.
> otherwise, without this check, you get pure insanity trying to debug
> problems when files get mixed up.
>
> On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington  wrote:
> >
> > Using a known-broken Lucene index directory, I dropped down to the Lucene
> > API and tracked this down a bit further.
> >
> > My directory listing is this:
> >
> > 
> > 17 Mar 13:39 _8w.fdt
> > 17 Mar 13:39 _8w.fdx
> > 17 Mar 13:39 _8w.fnm
> > 17 Mar 13:39 _8w.nvd
> > 17 Mar 13:39 _8w.nvm
> > 17 Mar 13:39 _8w.si
> > 17 Mar 13:39 _8w_Lucene50_0.doc
> > 17 Mar 13:39 _8w_Lucene50_0.pos
> > 17 Mar 13:39 _8w_Lucene50_0.tim
> > 17 Mar 13:39 _8w_Lucene50_0.tip
> > 17 Mar 13:39 _8w_Lucene70_0.dvd
> > 17 Mar 13:39 _8w_Lucene70_0.dvm
> > 17 Mar 14:33 _8x.cfe
> > 17 Mar 14:33 _8x.cfs
> > 20 Mar 21:19 _8x.fdt
> > 20 Mar 21:19 _8x.fdx
> > 20 Mar 21:19 _8x.fnm
> > 20 Mar 21:19 _8x.nvd
> > 20 Mar 21:19 _8x.nvm
> > 20 Mar 21:19 _8x.si
> > 20 Mar 21:19 _8x_Lucene50_0.doc
> > 20 Mar 21:19 _8x_Lucene50_0.pos
> > 20 Mar 21:19 _8x_Lucene50_0.tim
> > 20 Mar 21:19 _8x_Lucene50_0.tip
> > 20 Mar 21:19 _8x_Lucene70_0.dvd
> > 20 Mar 21:19 _8x_Lucene70_0.dvm
> > 20 Mar 21:19 _8y.cfe
> > 20 Mar 21:19 _8y.cfs
> > 20 Mar 21:19 _8y.si
> > 20 Mar 21:19 _8z.cfe
> > 20 Mar 21:19 _8z.cfs
> > 20 Mar 21:19 _8z.si
> > 20 Mar 21:19 _90.cfe
> > 20 Mar 21:19 _90.cfs
> > 20 Mar 21:19 _90.si
> > 20 Mar 21:19 _91.cfe
> > 20 Mar 21:19 _91.cfs
> > 20 Mar 21:19 _91.si
> > 20 Mar 21:19 _92.cfe
> > 20 Mar 21:19 _92.cfs
> > 20 Mar 21:19 _92.si
> > 20 Mar 21:19 _93.cfe
> > 20 Mar 21:19 _93.cfs
> > 20 Mar 21:19 _93.si
> > 20 Mar 21:19 _94.cfe
> > 20 Mar 21:19 _94.cfs
> > 20 Mar 21:19 _94.si
> > 20 Mar 21:19 _95.cfe
> > 20 Mar 21:19 _95.cfs
> > 20 Mar 21:19 _95.si
> > 18 Mar 06:49 segments_93
> > 20 Mar 21:19 segments_96
> > 6 Mar 21:22 write.lock
> >
> > 
> >
> > When I load SegmentInfos for segments_96 directly, it succeeds, and I can
> > see it's referencing all the SegmentInfo except for _8w.
> > If I try to load SegmentInfos for segments_93, it gets past loading _8w and
> > fails on _8x.
> > Checking with a hex editor, segments_93 is referencing _8w ... _94 and
> > segments_96 is referencing _8x ... _95
> >
> > The IndexWriter failure is due to the IndexFileDeleter attempting to load
> > segments_93 to track referenced commit infos.
> >
> > Is this a state an IndexWriter could get the directory into, or does it
> > involve higher level interference (like copying files around)?
> >
> > Tim
> >
> > On Thu, 14 Apr 2022 at 13:20, Baris Kazar  wrote:
> >
> > > yes that is a great point to look at first and that would eliminate any
> > > jdbc related i

Re: How to handle corrupt Lucene index

2022-04-13 Thread Robert Muir

Honestly the only time i've seen the mixed up files before (and the
motivation for the paranoid checks in lucene), was bugs in some
distributed replication code. In this case code that was copying files
across the network had some bugs (e.g. used hashing of file contents
to try to reduce network chatter but didn't handle hash collisions
properly). So it would actually most commonly happen for .si file
simply because it is typically a tiny file and more likely to cause
hash collisions in some distributed code doing that. This was the
motivation for adding unique id to each segment and all files
corresponding to that segment... basically as a library, we can't
trust filenames to be what they claim.

segments_N doesn't just reference your segments by names like _8w and
_94 but it also has segment's unique IDs, too. Would have to look at
its file format to tell you how to see this with your hex editor. But
in general, the segment unique ID is referenced everywhere, starting
from segments_N. This way, when loading any index files for that
segment (including *.si), lucene checks they have matching ID so that
we know they really do belong to that segment. Because we can't trust
filenames when users may manipulate them :)

If the file really belongs to another segment (e.g. because files got
mixed up), there's a clear error this way that files are mixed up.
otherwise, without this check, you get pure insanity trying to debug
problems when files get mixed up.

On Wed, Apr 13, 2022 at 10:39 PM Tim Whittington  wrote:
>
> Using a known-broken Lucene index directory, I dropped down to the Lucene
> API and tracked this down a bit further.
>
> My directory listing is this:
>
> 
> 17 Mar 13:39 _8w.fdt
> 17 Mar 13:39 _8w.fdx
> 17 Mar 13:39 _8w.fnm
> 17 Mar 13:39 _8w.nvd
> 17 Mar 13:39 _8w.nvm
> 17 Mar 13:39 _8w.si
> 17 Mar 13:39 _8w_Lucene50_0.doc
> 17 Mar 13:39 _8w_Lucene50_0.pos
> 17 Mar 13:39 _8w_Lucene50_0.tim
> 17 Mar 13:39 _8w_Lucene50_0.tip
> 17 Mar 13:39 _8w_Lucene70_0.dvd
> 17 Mar 13:39 _8w_Lucene70_0.dvm
> 17 Mar 14:33 _8x.cfe
> 17 Mar 14:33 _8x.cfs
> 20 Mar 21:19 _8x.fdt
> 20 Mar 21:19 _8x.fdx
> 20 Mar 21:19 _8x.fnm
> 20 Mar 21:19 _8x.nvd
> 20 Mar 21:19 _8x.nvm
> 20 Mar 21:19 _8x.si
> 20 Mar 21:19 _8x_Lucene50_0.doc
> 20 Mar 21:19 _8x_Lucene50_0.pos
> 20 Mar 21:19 _8x_Lucene50_0.tim
> 20 Mar 21:19 _8x_Lucene50_0.tip
> 20 Mar 21:19 _8x_Lucene70_0.dvd
> 20 Mar 21:19 _8x_Lucene70_0.dvm
> 20 Mar 21:19 _8y.cfe
> 20 Mar 21:19 _8y.cfs
> 20 Mar 21:19 _8y.si
> 20 Mar 21:19 _8z.cfe
> 20 Mar 21:19 _8z.cfs
> 20 Mar 21:19 _8z.si
> 20 Mar 21:19 _90.cfe
> 20 Mar 21:19 _90.cfs
> 20 Mar 21:19 _90.si
> 20 Mar 21:19 _91.cfe
> 20 Mar 21:19 _91.cfs
> 20 Mar 21:19 _91.si
> 20 Mar 21:19 _92.cfe
> 20 Mar 21:19 _92.cfs
> 20 Mar 21:19 _92.si
> 20 Mar 21:19 _93.cfe
> 20 Mar 21:19 _93.cfs
> 20 Mar 21:19 _93.si
> 20 Mar 21:19 _94.cfe
> 20 Mar 21:19 _94.cfs
> 20 Mar 21:19 _94.si
> 20 Mar 21:19 _95.cfe
> 20 Mar 21:19 _95.cfs
> 20 Mar 21:19 _95.si
> 18 Mar 06:49 segments_93
> 20 Mar 21:19 segments_96
> 6 Mar 21:22 write.lock
>
> 
>
> When I load SegmentInfos for segments_96 directly, it succeeds, and I can
> see it's referencing all the SegmentInfo except for _8w.
> If I try to load SegmentInfos for segments_93, it gets past loading _8w and
> fails on _8x.
> Checking with a hex editor, segments_93 is referencing _8w ... _94 and
> segments_96 is referencing _8x ... _95
>
> The IndexWriter failure is due to the IndexFileDeleter attempting to load
> segments_93 to track referenced commit infos.
>
> Is this a state an IndexWriter could get the directory into, or does it
> involve higher level interference (like copying files around)?
>
> Tim
>
> On Thu, 14 Apr 2022 at 13:20, Baris Kazar  wrote:
>
> > yes that is a great point to look at first and that would eliminate any
> > jdbc related issues that may lead to such problems.
> > Best regards
> > 
> > From: Tim Whittington 
> > Sent: Wednesday, April 13, 2022 9:17:44 PM
> > To: java-user@lucene.apache.org 
> > Subject: Re: How to handle corrupt Lucene index
> >
> > Thanks for this - I'll have a look at the database server code that is
> > managing the Lucene indexes and see if I can track it down.
> >
> > Tim
> >
> > On Thu, 14 Apr 2022 at 12:41, Robert Muir  wrote:
> >
> > > On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
> > >  wrote:
> > > >
> > > > I'm working with/on a database system that uses Lucene for full text
> > > > indexes (currently using 7.3.0).
> > > > We're enco

Re: How to handle corrupt Lucene index

2022-04-13 Thread Robert Muir

On Wed, Apr 13, 2022 at 8:24 PM Tim Whittington
 wrote:
>
> I'm working with/on a database system that uses Lucene for full text
> indexes (currently using 7.3.0).
> We're encountering occasional problems that occur after unclean shutdowns
> of the database , resulting in
> "org.apache.lucene.index.CorruptIndexException: file mismatch" errors when
> the IndexWriter is constructed.
>
> In all of the cases this has occurred, CheckIndex finds no issues with the
> Lucene index.
>
> The database has write-ahead-log and recovery facilities, so making the
> Lucene indexes durable wrt database operations is doable, but in this case
> the IndexWriter itself is failing to initialise, so it looks like there
> needs to be a lower-level validation/recovery operation before reconciling
> transactions can take place.
>
> Can anyone provide any advice about how the database can detect and recover
> from this situation?
>

File mismatch means files are getting mixed up. It is the equivalent
of swapping say, /etc/hosts and /etc/passwd on your computer.

In your case you have a .si file (lets say it is named _79.si) that
really belongs to another segment (e.g. _42).

This isn't a lucene issue, this is something else you must be using
that is "transporting files around", and it is mixing the files up.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Java 17 and Lucene

2021-10-27 Thread Robert Muir

gt; >
> > > > > On Tue, Oct 19, 2021 at 5:07 AM Michael Sokolov 
> > > > > wrote:
> > > > >
> > > > > > > I would a bit careful: On our Jenkins server running with AMD 
> > > > > > > Ryzen
> > > > CPU
> > > > > > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during 
> > > > > > tests
> > > > > and
> > > > > > stay unkillable (only a hard kill with" kill -9"). Previous Java
> > > > versions
> > > > > > don't hang. It happens not all the time (about 1/4th of all builds) 
> > > > > > and
> > > > > due
> > > > > > to the fact that the JVM is unresponsible it is not possible to get 
> > > > > > a
> > > > > stack
> > > > > > trace with "jstack". If you know a way to get the stack trace, I'd
> > > > happy
> > > > > to
> > > > > > get help.
> > > > > >
> > > > > > ooh that sounds scary. I suppose one could maybe get core dumps 
> > > > > > using
> > > > > > the right signal and debug that way? Oh wait you said only 9 works,
> > > > > > darn! How about attaching using gdb? Do we maintain GC logs for 
> > > > > > these
> > > > > > Jenkins builds? Maybe something suspicious would show up there.
> > > > > >
> > > > > > By the way the JDK is absolutely "responsible" in this situation! 
> > > > > > Not
> > > > > > responsive maybe ...
> > > > > >
> > > > > > On Tue, Oct 19, 2021 at 4:46 AM Uwe Schindler 
> > > wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > > Hey,
> > > > > > > >
> > > > > > > > Our team at Amazon Product Search recently ran our internal
> > > > > benchmarks
> > > > > > with
> > > > > > > > JDK 17.
> > > > > > > > We saw a ~5% increase in throughput and are in the process of
> > > > > > > > experimenting/enabling it in production.
> > > > > > > > We also plan to test the new Corretto Generational Shenandoah 
> > > > > > > > GC.
> > > > > > >
> > > > > > > I would a bit careful: On our Jenkins server running with AMD 
> > > > > > > Ryzen
> > > > CPU
> > > > > > it happens quite often that JDK 16, JDK 17 and JDK 18 hang during 
> > > > > > tests
> > > > > and
> > > > > > stay unkillable (only a hard kill with" kill -9"). Previous Java
> > > > versions
> > > > > > don't hang. It happens not all the time (about 1/4th of all builds) 
> > > > > > and
> > > > > due
> > > > > > to the fact that the JVM is unresponsible it is not possible to get 
> > > > > > a
> > > > > stack
> > > > > > trace with "jstack". If you know a way to get the stack trace, I'd
> > > > happy
> > > > > to
> > > > > > get help.
> > > > > > >
> > > > > > > Once I figured out what makes it hang, I will open issues in 
> > > > > > > OpenJDK
> > > > (I
> > > > > > am OpenJDK member/editor). I have now many stuck JVMs running to
> > > > analyze
> > > > > on
> > > > > > the server, so you're invited to help! At the moment, I have no 
> > > > > > time to
> > > > > > take care, so any help is useful.
> > > > > > >
> > > > > > > > On a side note, the Lucene codebase still uses the deprecated 
> > > > > > > > (as
> > > > of
> > > > > > > > JDK17) AccessController
> > > > > > > > in the RamUsageEstimator class.
> > > > > > > > We suppressed the warning for now (based on recommendations
> > > > > > > > <http://mail-archives.apache.org/mod_mbox/db-derby-
> > > > > > > >
> > > dev/202106.mbox/%3CJIRA.13369440.1617476525000.615331.16239514800
> > > > > > > > 5...@atlassian.jir

Re: Java 17 and Lucene

2021-10-18 Thread Robert Muir

We test different releases on different platforms (e.g. Linux, Windows, Mac).
We also test EA (Early Access) releases of openjdk versions during the
development process.
This finds bugs before they get released.

More information about versions/EA testing: https://jenkins.thetaphi.de/

On Mon, Oct 18, 2021 at 5:33 PM Kevin Rosendahl
 wrote:
>
> Hello,
>
> We are using Lucene 8 and planning to upgrade from Java 11 to Java 17. We
> are curious:
>
>- How lucene is testing against java versions. Are there correctness and
>performance tests using java 17?
>   - Additionally, besides Java 17, how are new Java releases tested?
>- Are there any other orgs using Java 17 with Lucene?
>- Any other considerations we should be aware of?
>
>
> Best,
> Kevin Rosendahl

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Search while typing (incremental search)

2021-10-06 Thread Robert Muir

TLDR: use the lucene suggest/ package. Start with building suggester
from your query logs (either a file or index them).
These have a lot of flexibility about how the matches happen, for
example pure prefixes, edit distance typos, infix matching, analysis
chain, even now Japanese input-method integration :)

Run that suggester on the user input, retrieving say, the top 5-10
matches of relevant query suggestions.
return those in the UI (typical autosuggest-type field), but also run
a search on the first one.

The user gets the instant-search experience, but when they type 'tes',
you search on 'tesla' (if that's the top-suggested query, the
highlighted one in the autocomplete). if they arrow-down to another
suggestion such as 'test' or type a 't' or use the mouse or whatever,
then the process runs again and they see the results for that.

IMO for most cases this leads to a saner experience than trying to
rank all documents based on a prefix 'tes': the problem is there is
still too much query ambiguity, not really any "keywords" yet, so
trying to rank those documents won't be very useful. Instead you try
to "interact" with the user to present results in a useful way that
they can navigate.

On the other hand if you really want to just search on prefixes and
jumble up the results (perhaps because you are gonna just sort by some
custom document feature instead of relevance), then you can do that if
you really want. You can use the n-gram/edge-ngram/shingle filters in
the analysis package for that.

On Wed, Oct 6, 2021 at 5:37 PM Michael Wechner
 wrote:
>
> Hi
>
> I am trying to implement a search with Lucene similar to what for
> example various "Note Apps" (e.g. "Google Keep" or "Samsung Notes") are
> offering, that with every new letter typed a new search is being executed.
>
> For example when I type "tes", then all documents are being returned
> containing the word "test" or "tesla" and when I continue typing, for
> example "tesö" and there are no documents containing the string "tesö",
> then the app will tell me that there are no matches.
>
> I have found a couple of articles related to this kind of search, for
> example
>
> https://stackoverflow.com/questions/10828825/incremental-search-using-lucene
>
> https://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene
>
> but would be great to know whether there exist other possibilities or
> what the best practice is?
>
> I am even not sure what the right term for this kind of search is, is it
> really "incremental search" or something else?
>
> Looking forward to your feedback and will be happy to extend the Lucene
> FAQ once I understand better :-)
>
> Thanks
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: NRT readers and overall indexing/querying throughput

2021-08-08 Thread Robert Muir

On Tue, Aug 3, 2021 at 10:43 PM Alexander Lukyanchikov
 wrote:
>
> Maybe I have wrong expectations, and less frequent commits with NRT refresh
> were not intended to improve overall performance?
>
> Some details about the tests -
> Base implementation commits and refreshes a regular reader every second.
> NRT implementation commits every 60 seconds and refreshes NRT reader every
> second.

fyi: if you really want to delay the sync of the data to a long value
such as every 60 seconds, you may also have to modify filesystem mount
options to get that.

For example, with ext4, it will do the sync itself every 5 seconds by
default, see 'commit=' option:
https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Tuning MoreLikeThis scoring algorithm

2021-05-28 Thread Robert Muir

See https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages
which has some broken nabble links, but is still valid.

TLDR: Scoring just doesn't work the way you think. Don't try to
interpret it as an absolute value, it is a relative one.

On Fri, May 28, 2021 at 1:36 PM TK Solr  wrote:
>
> I'd like to have suggestions on changing the scoring algorithm
> of MoreLikeThis.
>
> When I feed the identical string as the content of a document in the index
> to MoreLikeThis.like("field", new StringReader(docContent)),
> I get a score less than 1.0 (0.944 in one of my test cases) that I expect.
>
> What is the easiest way to change this so that the score is 1.0 when
> all the terms in the query matches with all the terms of a document?
> The score should be less than 1.0 if the query contains only a part of the 
> terms
> from the document. (Needless to say, the score should also be less than 1.0
> if only part of the query terms are found in the document.)
>
> For my purpose, I don't need a sophisticated search relevancy technique
> like TF-IDF. I'd like it work faster/cheaper.
>
> I tried using BooleanSimilarity, but that ended up returning a score above 
> 1.0.
> Also the score is the same as long as all the terms in the query are matched.
> For example, querying "quick brown fox" and "quick brown" yield the same score
> against
> the doc that has the famous test string.
>
>
> TK
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene 8 causing app server threads to hang due to high rate of network usage

2021-04-28 Thread Robert Muir

Don't use filesystems such as NFS (that is what EFS is) with lucene! This
is really bad design, and it is the root cause of your issue.

On Tue, Apr 27, 2021 at 1:21 PM Hilston, Kathleen <
kathleen.hils...@snapon.com> wrote:

> Hello,
>
>
>
> My name is Kathleen Hilston, and I am a Software Engineer Sr working for
> Snap-on Business Solutions (SBS).
>
>
>
> We hope you can help us with a problem that we are facing.
>
>
>
> *Issue*: Lucene 8 causing app server threads to hang due to high rate of
> network usage.
>
>
>
> *Further details*: Recently we migrated from Lucene 7.5.0 to Lucene 8.6.3
> and we have encountered severe performance issues after this upgrade.  Our
> Lucene index has multilingual terms, is large in size, and is hosted on a
> network file storage (EFS at AWS).  Our Lucene queries construct a lot of
> Boolean term queries, and we suspect the off-heap FST introduced with
> Lucene 8 could be the root cause.  The specific issue we are facing after
> the Lucene upgrade is that, when a user searches for any given term, the
> tomcat server thread will hang while reading the bytes from an unexpectedly
> huge inbound flow of data from the Lucene Index on network storage.  We
> have seen inbound data flows ranging from 5% up to 45% of the total index
> size for a single search, primarily when searching for a term in a
> different language.  This issue does not occur with Lucene 7.
>
>
>
> Here is a typical call stack highlighting the point of contention in the
> Tomcat threads when we encounter this performance issue:
>
>
>
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:432)
>
> org.apache.lucene.search.IndexSearcher.searchAfter(IndexSearcher.java:421)
>
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:574)
>
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:445)
>
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:658)
>
> org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:330)
>
> org.apache.lucene.search.Weight.bulkScorer(Weight.java:181)
>
> org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:344)
>
>
> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
>
>
> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
>
>
> org.apache.lucene.search.BooleanWeight.scorerSupplier(BooleanWeight.java:379)
>
> org.apache.lucene.search.Weight.scorerSupplier(Weight.java:147)
>
> org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:115)
>
>
> org.apache.lucene.codecs.blocktree.SegmentTermsEnum.impacts(SegmentTermsEnum.java:1017)
>
>
> org.apache.lucene.codecs.lucene84.Lucene84PostingsReader.impacts(Lucene84PostingsReader.java:272)
>
>
> org.apache.lucene.codecs.lucene84.Lucene84PostingsReader$BlockImpactsDocsEnum.(Lucene84PostingsReader.java:1061)
>
>
> org.apache.lucene.codecs.lucene84.Lucene84SkipReader.init(Lucene84SkipReader.java:103)
>
>
> org.apache.lucene.codecs.MultiLevelSkipListReader.init(MultiLevelSkipListReader.java:208)
>
>
> org.apache.lucene.codecs.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:229)
>
> org.apache.lucene.store.DataInput.readVLong(DataInput.java:190)
>
> org.apache.lucene.store.DataInput.readVLong(DataInput.java:205)
>
>
> org.apache.lucene.store.ByteBufferIndexInput.readByte(ByteBufferIndexInput.java:80)
>
> org.apache.lucene.store.ByteBufferGuard.getByte(ByteBufferGuard.java:99)
>
>
>
> When researching found the LUCENE JIRA LUCENE-8635
>  (which is referenced
> in https://www.elastic.co/blog/whats-new-in-lucene-8 section ‘Moving the
> terms dictionary off-heap’).  Would this help the issue?
>
>
>
> Please advise.
>
>
>
> Thank you
>
>
>
> *Kathleen Hilston* | Software Engineer Sr
>
> Snap-on Business Solutions
>
>
>
> [image: http://rich-iweb-20-rv.ipa.snapbs.com:9001/sbs-sig/i/sbs-100.png]
>
> 4025 Kinross Lakes Parkway | Richfield, OH 44286
>
> Office: 330-659-1818
>
> kathleen.hils...@snapon.com
>
>
>
>
>

Re: CorruptIndexException after failed segment merge caused by No space left on device

2021-03-24 Thread Robert Muir

On Wed, Mar 24, 2021 at 1:41 AM Alexander Lukyanchikov <
alexanderlukyanchi...@gmail.com> wrote:

> Hello everyone,
>
> Recently we had a failed segment merge caused by "No space left on device".
> After restart, Lucene failed with the CorruptIndexException.
> The expectation was that Lucene automatically recovers in such
> case, because there was no succesul commit. Is it a correct assumption, or
> I am missing something?
> It would be great to know any recommendations to avoid such situations
> in future and be able to recover automatically after restart.
>

I don't think you are missing something. It should not happen.

Can you please open a issue: https://issues.apache.org/jira/projects/LUCENE

If you don't mind, please supply all relevant info you are able to provide
on the issue: OS, filesystem, JDK version, any hints as to how you are
using lucene (e.g. when you are committing / how you are indexing). There
are a lot of tests in lucene's codebase designed to simulate the disk full
condition and guarantee that stuff like this never happens, but maybe some
case is missing, or some other unknown bug causing the missing files.

Thanks

Re: Incorrect CollectionStatistics if IndexWriter.close is not called

2021-03-03 Thread Robert Muir

Marc, you don't need to reindex to have less deletes and less impact from
this. merging will get rid of the deletes.

if updates are coming in batches, you could consider calling
IndexWriter.html#forceMergeDeletes after updating a batch to keep things
tidy.

Otherwise, if updates are coming in continuously at all hours, it gets
trickier, but you can still adjust things such as merge policy parameters
so that deletes are more aggressively merged away in a continuous fashion
(at the cost of increased merging of course): Look at stuff such as
setDeletesPctAllowed on TieredMergePolicy.

On Tue, Mar 2, 2021 at 4:35 AM Marc F  wrote:

> I'm indexing medical documents, and documents on emerging topics are
> updated quite often.
> For example, right now, "COVID" will be overrepresented in my index
> because deleted documents are still counted, and then a "COVID" query
> will have a lower score than a query on a "unfashionable" topic,
> because the idf also takes into account the "number of documents
> containing term".
>
> I did not expect this behaviour but I can understand that it's needed
> for performance reasons and the only thing I can think of to have
> accurate scoring, it's to reindex my documents more often and I don't
> always have this luxury.
>
> Thanks for your replies :)
>
> Le lun. 1 mars 2021 à 20:47, Diego Ceccarelli (BLOOMBERG/ LONDON)
>  a écrit :
> >
> > I'm not sure that closing and opening the index writer will always work
> - I think the 'problem' will be solved once the segment with the deleted
> document  will be merged with another segment - that might happen during
> the close but might also *not* happen (e.g., if you have only one segment,
> and you delete, probably closing/opening won't fix).
> >
> > Can you describe your problem that you are trying to solve? why do you
> need such accuracy? if this is for some type of scoring the ranking
> shouldn't be affected if you have X or X-1 documents in the collection...
> >
> > Cheers,
> > diego
> >
> > From: java-user@lucene.apache.org At: 03/01/21 16:23:48To:  Diego
> Ceccarelli (BLOOMBERG/ LONDON ) ,  java-user@lucene.apache.org
> > Subject: Re: Incorrect CollectionStatistics if IndexWriter.close is not
> called
> >
> > Hi,
> >
> > You're right the documentation of Terms.getDocCount says that "this
> > measure does not take deleted documents into account".
> > So if we want correct counts and correct query scores, the IndexWriter
> > has to be closed after documents are deleted/updated and a new one has
> > to be created when new documents arrive.
> >
> > Thanks
> >
> > Le dim. 28 févr. 2021 à 17:04, Diego Ceccarelli (BLOOMBERG/ LONDON)
> >  a écrit :
> > >
> > > I *guess* it's due to the fact that the update is implemented as
> remove and
> > reinsert the document. Deletes in Lucene are lazy: the deleted document
> is just
> > flagged as deleted in a bitmap and then removed from the index only when
> > segments are merged.  Did you check IndexSearcher.collectionStatistic
> > documentation? it should mention something about that..
> > >
> > > Cheers,
> > > diego
> > >
> > >
> > > From: java-user@lucene.apache.org At: 02/28/21 11:09:52To:
> > java-user@lucene.apache.org
> > > Subject: Incorrect CollectionStatistics if IndexWriter.close is not
> called
> > >
> > > Hi,
> > >
> > > I don't understand if I'm doing something wrong or if it is the
> > > expected behaviour.
> > >
> > > My problem is when a document is updated the collectionStatistics
> > > returns counts as if a new document is added in the index, even after
> > > a call to IndexWriter.commit and to
> > > SearcherManager.maybeRefreshBlocking.
> > > If I call the IndexWriter.close, the counts are correct again, but the
> > > documentation of IndexWriter.close says to try to reuse the
> > > IndexWriter so I'm a bit confused.
> > >
> > > Ex:
> > > If I add two documents to an empty index
> > >
> > > IndexSearcher.collectionStatistics("TEXT")) returns
> > > "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
> > > OK
> > >
> > > then I update one of the document and call commit()
> > >
> > > IndexSearcher.collectionStatistics("TEXT")) returns
> > > "field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
> > > NOK
> > >
> > > If I call close() now
> > >
> > > IndexSearcher.collectionStatistics("TEXT")) returns
> > > "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
> > > OK
> > >
> > > Note that the counts are correct if the index contains only one
> document.
> > >
> > >
> > > I attached a test case.
> > >
> > > Am I doing something wrong somewhere?
> > >
> > >
> > > Julien
> > >
> > >
> > > 
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
> >
> >

Re: BigIntegerPoint

2021-02-26 Thread Robert Muir

It was added to the sandbox originally (along with InetAddressPoint for ip
addresses) and just never graduated from there:
https://issues.apache.org/jira/browse/LUCENE-7043

The InetAddressPoint was moved to core because it seems pretty common that
people want to do range queries on IP hosts and so on. So it got love.

Not many people need 128-bit range queries I suppose?

On Fri, Feb 26, 2021 at 1:25 PM Michael Kleen  wrote:

> Hello,
>
> I am interested in using BigIntegerPoint. What is the reason that it is
> part of the sandbox ? Is it ready for use ?
>
> Many thanks,
>
> Michael
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir

The preload isn't magical.
It only "reads in the whole file" to get it cached, same as if you did that
yourself with 'cat' or 'dd'.
It "warms" the file.

It just does this in an efficient way at the low level to make the warming
itself efficient. It madvise()s kernel to announce some read-ahead and then
reads the first byte of every mmap'd page (which is enough to fault it in).

At the end of the day it doesn't matter if you wrote a shitty shell script
that uses 'dd' to read in each index file and send it to /dev/null, or
whether you spent lots of time writing fancy java code to call this preload
thing: you get the same result, same end state.

Maybe the preload takes 18 seconds to "warm" the index, vs. your crappy
shell script which takes 22 seconds. It is mainly more important for
servers and portability (e.g. it will work fine on windows, but obviously
will not call madvise).

On Tue, Feb 23, 2021 at 4:18 PM  wrote:

> Thanks again, Robert. Could you please explain "preload"? Which
> functionality is that? we discussed in this thread before about a preload.
>
> Is there a Lucene url / site that i can look at for preload?
>
> Thanks for the explanations. This thread will be useful for many folks i
> believe.
>
> Best regards
>
>
> On 2/23/21 4:15 PM, Robert Muir wrote:
>
>
>
> On Tue, Feb 23, 2021 at 4:07 PM  wrote:
>
>> What i want to achieve: Problem statement:
>>
>> base case is disk based Lucene index with FSDirectory
>>
>> speedup case was supposed to be in memory Lucene index with MMapDirectory
>>
> On 64-bit systems, FSDirectory just invokes MMapDirectory already. So you
> don't need to do anything.
>
> Either way MMapDirectory or NIOFSDirectory are doing the same thing:
> reading your index as a normal file and letting the operating system cache
> it.
> The MMapDirectory is just better because it avoids some overheads, such as
> read() system call, copying and buffering into java memory space, etc etc.
> Some of these overheads are only getting worse, e.g. spectre/meltdown-type
> fixes make syscalls 8x slower on my computer. So it is good that
> MMapDirectory avoids it.
>
> So I suggest just stop fighting the operating system, don't give your J2EE
> container huge amounts of ram, let the kernel do its job.
> If you want to "warm" a cold system because nothing is in kernel's cache,
> then look into preload and so on. It is just "reading files" to get them
> cached.
>
>

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir

On Tue, Feb 23, 2021 at 4:07 PM  wrote:

> What i want to achieve: Problem statement:
>
> base case is disk based Lucene index with FSDirectory
>
> speedup case was supposed to be in memory Lucene index with MMapDirectory
>
On 64-bit systems, FSDirectory just invokes MMapDirectory already. So you
don't need to do anything.

Either way MMapDirectory or NIOFSDirectory are doing the same thing:
reading your index as a normal file and letting the operating system cache
it.
The MMapDirectory is just better because it avoids some overheads, such as
read() system call, copying and buffering into java memory space, etc etc.
Some of these overheads are only getting worse, e.g. spectre/meltdown-type
fixes make syscalls 8x slower on my computer. So it is good that
MMapDirectory avoids it.

So I suggest just stop fighting the operating system, don't give your J2EE
container huge amounts of ram, let the kernel do its job.
If you want to "warm" a cold system because nothing is in kernel's cache,
then look into preload and so on. It is just "reading files" to get them
cached.

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir

speedup over what? You are probably already using MMapDirectory (it is the
default). So I don't know what you are trying to achieve, but giving lots
of memory to your java process is not going to help.

If you just want to prevent the first few queries to a fresh cold machine
instance from being slow, you can use the preload for that before you make
it available. You could also use 'cat' or 'dd'.

On Tue, Feb 23, 2021 at 3:45 PM  wrote:

> Thanks but then how will MMapDirectory help gain speedup?
>
> i will try tmpfs and see what happens. i was expecting to get on order of
> magnitude of speedup from already very fast on disk Lucene indexes.
>
> So i was expecting really really really fast response with MMapDirectory.
>
> Thanks
>
>
> On 2/23/21 3:40 PM, Robert Muir wrote:
>
> Don't give gobs of memory to your java process, you will just make things
> slower. The kernel will cache your index files.
>
> On Tue, Feb 23, 2021 at 1:45 PM  wrote:
>
>> Ok, but how is this MMapDirectory used then?
>>
>> Best regards
>>
>>
>> On 2/23/21 7:03 AM, Robert Muir wrote:
>> >
>> >
>> > On Tue, Feb 23, 2021 at 2:30 AM > > <mailto:baris.ka...@oracle.com>> wrote:
>> >
>> > Hi,-
>> >
>> >   I tried MMapDirectory and i allocated as big as index size on my
>> > J2EE
>> > Container but
>> >
>> >
>> > Don't allocate java heap memory for the index, MMapDirectory does not
>> > use java heap memory!
>>
>

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir

Don't give gobs of memory to your java process, you will just make things
slower. The kernel will cache your index files.

On Tue, Feb 23, 2021 at 1:45 PM  wrote:

> Ok, but how is this MMapDirectory used then?
>
> Best regards
>
>
> On 2/23/21 7:03 AM, Robert Muir wrote:
> >
> >
> > On Tue, Feb 23, 2021 at 2:30 AM  > <mailto:baris.ka...@oracle.com>> wrote:
> >
> > Hi,-
> >
> >   I tried MMapDirectory and i allocated as big as index size on my
> > J2EE
> > Container but
> >
> >
> > Don't allocate java heap memory for the index, MMapDirectory does not
> > use java heap memory!
>

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2021-02-23 Thread Robert Muir

On Tue, Feb 23, 2021 at 2:30 AM  wrote:

> Hi,-
>
>   I tried MMapDirectory and i allocated as big as index size on my J2EE
> Container but
>
>
Don't allocate java heap memory for the index, MMapDirectory does not use
java heap memory!

Re: MMapDirectory vs In Memory Lucene Index (i.e., ByteBuffersDirectory)

2020-12-14 Thread Robert Muir

On Mon, Dec 14, 2020 at 1:59 PM Uwe Schindler  wrote:
>
> Hi,
>
> as writer of the original bog post, here my comments:
>
> Yes, MMapDirectory.setPreload() is the feature mentioned in my blog post is
> to load everything into memory - but that does not guarantee anything!
> Still, I would not recommend to use that function, because all it does is to
> just touch every page of the file, so the linux kernel puts it into OS cache
> - nothing more; IMHO very ineffective as it slows down openining index for a
> stupid for-each-page-touch-loop. It will do this with EVERY page, if it is
> later used or not! So this may take some time until it is done. Lateron,
> still Lucene needs to open index files, initialize its own data
> structures,...
>
> In general it is much better to open index, with MMAP directory and execute
> some "sample" queries. This will do exactly the same like the preload
> function, but it is more "selective". Parts of the index which are not used
> won't be touched, and on top, it will also load ALL the required index
> structures to heap.
>

The main purpose of this thing is a fast warming option for random
access files such as "i want to warm all my norms in RAM" or "i want
to warm all my docvalues in RAM"... really it should only be used with
the FileSwitchDirectory for a targeted purpose such as that: it is
definitely a waste to set it for your entire index. It is just
exposing the 
https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html#load()
which first calls madvise(MADV_WILLNEED) and then touches every page.
If you want to "warm" an ENTIRE very specific file for a reason like
this (e.g. per-doc scoring value, ensuring it will be hot for all
docs), it is hard to be more efficient than that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Multi-IDF for a single term possible?

2019-12-03 Thread Robert Muir

it is enough to give each its own field.

On Tue, Dec 3, 2019 at 7:57 AM Adrien Grand  wrote:

> Is there any reason why you are not storing each DOC_TYPE in its own index?
>
> On Tue, Dec 3, 2019 at 1:50 PM Ravikumar Govindarajan
>  wrote:
> >
> > Hello,
> >
> > We are using TF-IDF for scoring (Yet to migrate to BM25). Different
> > entities (DOC_TYPES) are crunched & stored together in a single index.
> >
> > When it comes to IDF, I find that there is a single value computed across
> > documents & stored as part of TermStats, whereas our documents are not
> > homogeneous. So, a single IDF value doesn't work for us
> >
> > We would like to compute IDF for each  pair, store it &
> > later use the paired-IDF values during query time. Is something like this
> > possible via Codecs or other mechanisms?
> >
> > Any help is much appreciated
> >
> > --
> > Ravi
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-02 Thread Robert Muir

I think instead of using range queries for custom pagination, it would
be best to simply sort by the field normally (with or without
searchAfter)

On Tue, Apr 2, 2019 at 7:34 AM Stamatis Zampetakis  wrote:
>
> Thanks for the explanation regarding IntersectTermsEnum.next(). I
> understand better now.
>
> How about indexing a string field containing the firstname and lastname of
> a person or possibly its address.
> Values for these fields have variable length and usually they have more
> than 16 bytes. If my understanding is correct then they should (can) not be
> indexed as points.
> Range queries on this fields appear mostly when users implement custom
> pagination (without using searchAfter) on the result set.
>
> The alternative that I consider and seems to work well for the use-cases I
> mentioned is using doc values and SortedDocValuesField.newSlowRangeQuery.
>
>
>
>
>
> Στις Τρί, 2 Απρ 2019 στις 3:01 μ.μ., ο/η Robert Muir 
> έγραψε:
>
> > Can you explain a little more about your use-case? I think that's the
> > biggest problem here for term range query. Pretty much all range
> > use-cases are converted over to space partitioned data structures
> > (Points) so its unclear why this query would even be used for anything
> > serious.
> >
> > To answer your question, IntersectTermsEnum.next() doesn't iterate
> > over the whole index, just the term dictionary. But the lines are
> > blurred since if there's only one posting for a term, that single
> > posting is inlined directly into the term dictionary and often a
> > ton of terms fit into this category, so seeing 80% time in the term
> > dictionary wouldn't surprise me.
> >
> > On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis 
> > wrote:
> > >
> > > Thanks Robert and Uwe for your feedback!
> > >
> > > I am not sure what you mean that the slowness does not come from the
> > iteration over the term index.
> > > I did a small profiling (screenshot attached) of sending repeatedly a
> > TermRangeQuery that matches almost the whole index and I observe that 80%
> > of the time is spend on IntersectTermsEnum.next() method.
> > > Isn't this the method which iterates over the index?
> > >
> > >
> > >
> > > Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler 
> > έγραψε:
> > >>
> > >> Hi again,
> > >>
> > >> > The problem with TermRangeQueries is actually not the iteration over
> > the
> > >> > term index. The slowness comes from the fact that all terms between
> > start
> > >> > and end have to be iterated and their postings be fetched and those
> > postings
> > >> > be merged together. If the "source of terms" for doing this is just a
> > simple
> > >> > linear iteration of all terms from/to or the automaton intersection
> > does not
> > >> > really matter for the query execution. The change to prefer the
> > automaton
> > >> > instead of a simple term iteration is just to allow further
> > optimizations, for
> > >> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
> > >>
> > >> So the above issue brings no further improvements (currently). So as
> > said before, a simple enumeration of all all terms and fetching their
> > postings will be as fast. By using an automaton, the idea here is that the
> > term dictionary may have some "percalculated" postings list for prefix
> > terms. So instead of iterating over all terms and merging their postings,
> > the terms dictionary could return a "virtual term" that contains all
> > documents for a whole prefix and store the merged postings list in the
> > index file. This was not yet implemented but is the place to hook into. You
> > could create an improved BlockTermsDict implementation, that allows to get
> > a PostingsEnum for a whole prefix of terms.
> > >>
> > >> Not sure how much of that was already implemented by LUCENE-5879, but
> > it allows to do this. So here is where you could step in and improve the
> > terms dictionary!
> > >>
> > >> Uwe
> > >>
> > >> > Uwe
> > >> >
> > >> > -
> > >> > Uwe Schindler
> > >> > Achterdiek 19, D-28357 Bremen
> > >> > http://www.thetaphi.de
> > >> > eMail: u...@thetaphi.de
> > >> >
> > >> > > -Original Message-
> > >> > > From:

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-02 Thread Robert Muir

Can you explain a little more about your use-case? I think that's the
biggest problem here for term range query. Pretty much all range
use-cases are converted over to space partitioned data structures
(Points) so its unclear why this query would even be used for anything
serious.

To answer your question, IntersectTermsEnum.next() doesn't iterate
over the whole index, just the term dictionary. But the lines are
blurred since if there's only one posting for a term, that single
posting is inlined directly into the term dictionary and often a
ton of terms fit into this category, so seeing 80% time in the term
dictionary wouldn't surprise me.

On Tue, Apr 2, 2019 at 8:03 AM Stamatis Zampetakis  wrote:
>
> Thanks Robert and Uwe for your feedback!
>
> I am not sure what you mean that the slowness does not come from the 
> iteration over the term index.
> I did a small profiling (screenshot attached) of sending repeatedly a 
> TermRangeQuery that matches almost the whole index and I observe that 80% of 
> the time is spend on IntersectTermsEnum.next() method.
> Isn't this the method which iterates over the index?
>
>
>
> Στις Δευ, 1 Απρ 2019 στις 6:57 μ.μ., ο/η Uwe Schindler  
> έγραψε:
>>
>> Hi again,
>>
>> > The problem with TermRangeQueries is actually not the iteration over the
>> > term index. The slowness comes from the fact that all terms between start
>> > and end have to be iterated and their postings be fetched and those 
>> > postings
>> > be merged together. If the "source of terms" for doing this is just a 
>> > simple
>> > linear iteration of all terms from/to or the automaton intersection does 
>> > not
>> > really matter for the query execution. The change to prefer the automaton
>> > instead of a simple term iteration is just to allow further optimizations, 
>> > for
>> > more info see https://issues.apache.org/jira/browse/LUCENE-5879
>>
>> So the above issue brings no further improvements (currently). So as said 
>> before, a simple enumeration of all all terms and fetching their postings 
>> will be as fast. By using an automaton, the idea here is that the term 
>> dictionary may have some "percalculated" postings list for prefix terms. So 
>> instead of iterating over all terms and merging their postings, the terms 
>> dictionary could return a "virtual term" that contains all documents for a 
>> whole prefix and store the merged postings list in the index file. This was 
>> not yet implemented but is the place to hook into. You could create an 
>> improved BlockTermsDict implementation, that allows to get a PostingsEnum 
>> for a whole prefix of terms.
>>
>> Not sure how much of that was already implemented by LUCENE-5879, but it 
>> allows to do this. So here is where you could step in and improve the terms 
>> dictionary!
>>
>> Uwe
>>
>> > Uwe
>> >
>> > -
>> > Uwe Schindler
>> > Achterdiek 19, D-28357 Bremen
>> > http://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>> >
>> > > -Original Message-
>> > > From: Robert Muir 
>> > > Sent: Monday, April 1, 2019 6:30 PM
>> > > To: java-user 
>> > > Subject: Re: Clarification regarding BlockTree implementation of
>> > > IntersectTermsEnum
>> > >
>> > > The regular TermsEnum is really designed for walking terms in linear 
>> > > order.
>> > > it does have some ability to seek/leapfrog. But this means paths in a 
>> > > query
>> > > automaton that match no terms result in a wasted seek and cpu, because
>> > the
>> > > api is designed to return the next term after regardless.
>> > >
>> > > On the other hand the intersect() is for intersecting two automata: query
>> > > and index. Presumably it can also remove more inefficiencies than just 
>> > > the
>> > > wasted seeks for complex wildcards and fuzzies and stuff, since it can
>> > > "see" the whole input as an automaton. so for example it might be able to
>> > > work on blocks of terms at a time and so on.
>> > >
>> > > On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis
>> > 
>> > > wrote:
>> > >
>> > > > Yes it is used.
>> > > >
>> > > > I think there are simpler and possibly more efficient ways to 
>> > > > implement a
>> > > > TermRangeQuery and that is why I am looking into this.
>> > > &

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-01 Thread Robert Muir

The regular TermsEnum is really designed for walking terms in linear order.
it does have some ability to seek/leapfrog. But this means paths in a query
automaton that match no terms result in a wasted seek and cpu, because the
api is designed to return the next term after regardless.

On the other hand the intersect() is for intersecting two automata: query
and index. Presumably it can also remove more inefficiencies than just the
wasted seeks for complex wildcards and fuzzies and stuff, since it can
"see" the whole input as an automaton. so for example it might be able to
work on blocks of terms at a time and so on.

On Mon, Apr 1, 2019, 12:17 PM Stamatis Zampetakis  wrote:

> Yes it is used.
>
> I think there are simpler and possibly more efficient ways to implement a
> TermRangeQuery and that is why I am looking into this.
> But I am also curious to understand what IntersectTermsEnum is supposed to
> do.
>
> Στις Δευ, 1 Απρ 2019 στις 5:34 μ.μ., ο/η Robert Muir 
> έγραψε:
>
> > Is this IntersectTermsEnum really being used for term range query? Seems
> > like using a standard TermsEnum, seeking to the start of the range, then
> > calling next until the end would be easier.
> >
> > On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis 
> > wrote:
> >
> > > Hi all,
> > >
> > > I am currently working on improving the performance of range queries on
> > > strings. I've noticed that using TermRangeQuery with low-selective
> > queries
> > > is a very bad idea in terms of performance but I cannot clearly explain
> > why
> > > since it seems related with how the IntersectTermsEnum#next method is
> > > implemented.
> > >
> > > The Javadoc of the class says that the terms index (the burst-trie
> > > datastructure) is not used by this implementation of TermsEnum.
> However,
> > > when I see the implementation of the next method I get the impression
> > that
> > > this is not accurate. Aren't we using the trie structure to skip parts
> of
> > > the data when  the automaton states do not match?
> > >
> > > Can somebody provide a high-level intutition of what
> > > IntersectTermsEnum#next does? Initially, I thought that it is
> traversing
> > > the whole trie structure (skipping some branches when necessary) but I
> > may
> > > be wrong.
> > >
> > > Thanks in advance,
> > > Stamatis
> > >
> >
>

Re: Clarification regarding BlockTree implementation of IntersectTermsEnum

2019-04-01 Thread Robert Muir

Is this IntersectTermsEnum really being used for term range query? Seems
like using a standard TermsEnum, seeking to the start of the range, then
calling next until the end would be easier.

On Mon, Apr 1, 2019, 10:05 AM Stamatis Zampetakis  wrote:

> Hi all,
>
> I am currently working on improving the performance of range queries on
> strings. I've noticed that using TermRangeQuery with low-selective queries
> is a very bad idea in terms of performance but I cannot clearly explain why
> since it seems related with how the IntersectTermsEnum#next method is
> implemented.
>
> The Javadoc of the class says that the terms index (the burst-trie
> datastructure) is not used by this implementation of TermsEnum. However,
> when I see the implementation of the next method I get the impression that
> this is not accurate. Aren't we using the trie structure to skip parts of
> the data when  the automaton states do not match?
>
> Can somebody provide a high-level intutition of what
> IntersectTermsEnum#next does? Initially, I thought that it is traversing
> the whole trie structure (skipping some branches when necessary) but I may
> be wrong.
>
> Thanks in advance,
> Stamatis
>

Re: Getting Exception : java.nio.channels.ClosedByInterruptException

2019-04-01 Thread Robert Muir

Some code interrupted (Thread.interrupt) a java thread while it was
blocked on I/O. This is not safe to do with lucene, because
unfortunately in this situation java's NIO code closes file
descriptors and releases locks.

The second exception is because the indexwriter tried to write when it
no longer actually held the lock.

See the "NOTE" at the beginning of NIOFSDirectory and MMapDirectory's
javadocs for more information:
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/store/NIOFSDirectory.html

On Sun, Mar 31, 2019 at 3:04 PM Chellasamy G
 wrote:
>
> Hi All,
>
>
>
> I am committing an index periodically using a scheduler. On a rare case I got 
> the below exception in the committing thread,
>
>
>
> Lucene Version : 7.4.0
>
>
>
> java.nio.channels.ClosedByInterruptException
>
> at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>
> at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:314)
>
> at 
> org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:182)
>
> at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:67)
>
> at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:466)
>
> at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:772)
>
> at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4739)
>
> at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3281)
>
> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3449)
>
> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3414)
>
>
>
>
>
>
>
> After this exception I am getting the below exception subsequently while 
> trying to commit the index,
>
>
>
>
>
> org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an 
> external force: 
> NativeFSLock(path=C:\luc-index\write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807
>  exclusive invalid],creationTime=2019-03-29T10:40:15.601535Z)
>
> at 
> org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:178)
>
> at 
> org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
>
> at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:771)
>
> at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4739)
>
> at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3281)
>
> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3449)
>
> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3414)
>
>
>
>
>
> Has anybody faced this exception before and know the root cause of this issue?
>
>
>
>
>
> Thanks in Advance,
>
> Satyan

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: prorated early termination

2019-02-05 Thread Robert Muir

OK. Thanks for uploading the PR. I think you should definitely open an issue?

Its still worth looking at the API for this specific proposal, as this
may not return the exact top-N, correct? I think we need to be careful
and keep the user involved when it comes to inexact, not everyone's
data might meet the expected distribution and maybe others would
rather have the exact results. And of course this part of the api (the
way the user specifies such things) is way more important than how
ugly the collector code is behind the scenes. When I was looking at
this topic years ago, i looked at basically an enum being passed to
IndexSearcher with 3 possible values: exact top N with full counts
(slowest), exact top N without full counts (faster), inexact top N
(fastest). When Adrien did the actual work, the enum seemed overkill
because only 2 of the possibilties were implemented, but maybe its
worth revisiting?

On Tue, Feb 5, 2019 at 7:50 AM Michael Sokolov  wrote:
>
> Hi Robert - yeah this is a complex subject. I think there's room for some
> exciting improvement though. There is some discussion in LUCENE-8675 that
> is pointing out some potential API-level problems we may have to address in
> order to make the most efficient use of the segment structure in a sorted
> index. I think generally speaking we're trying to think of ways to search
> partial segments. I don't have concrete proposals for API changes at this
> point, but it's clear there are some hurdles to be grappled with. For
> example, Adrien's point about BKD trees having a high up-front cost points
> out one difficulty. If we want to search a single segment in multiple "work
> units" (whether threaded or not), we want a way to share that up-front cost
> without needing to pay it over again for each work unit. I think a similar
> problem also occurs with some other query types (MultiTerm can produce a
> bitset I believe?).
>
> As far as the specific (prorated early termination) proposal here .. this
> is something very specific and localized within TopFieldCollector that
> doesn't require any public-facing API change or refactoring at all. It just
> terminates a little earlier based on the segment distribution. Here's a PR
> so you can see what this is: https://github.com/apache/lucene-solr/pull/564
>
>
> On Mon, Feb 4, 2019 at 8:44 AM Robert Muir  wrote:
>
> > Regarding adding a threshold to TopFieldCollector, do you have ideas
> > on what it would take to fix the relevant collector/indexsearcher APIs
> > to make this kind of thing easier? (i know this is a doozie, but we
> > should at least try to think about it, maybe make some progress)
> >
> > I can see where things become less efficient in this parallel+sorted
> > case with large top N, but there are also many other "top k
> > algorithms" that could be better for different use cases. in your
> > case, if you throw out the parallel and just think about doing your
> > sorted case segment-by-segment, the current code there may be
> > inefficient too (not as bad, but still doesn't really take total
> > advantage of sortedness). Maybe we improve that case by scoring some
> > initial "range" of docs for each/some segments first, and then handle
> > any "tail". With a simple google search I easily find many ideas for
> > how this logic could work: exact and inexact, sorted and unsorted,
> > distributed (parallel) and sequential.  So I think there are probably
> > other improvements that could be done here, but worry about what the
> > code might look like if we don't refactor it.
> >
> > On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
> >  wrote:
> > >
> > > On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov 
> > wrote:
> > >
> > >  > > In single-threaded mode we can check against minCompetitiveScore and
> > > > terminate collection for each segment appropriately,
> > > >
> > > > > Does Lucene do this today by default?  That should be a nice
> > > > optimization,
> > > > and it'd be safe/correct.
> > > >
> > > > Yes, it does that today (in TopFieldCollector -- see
> > > >
> > > >
> > https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> > > > )
> > > >
> > >
> > > Ahh -- great, thanks for finding that.
> > >
> > >
> > > > Re: our high cost of collection in static ranking phase -- that is
> > true,
> > > > Mike, but I do also see a nice improvement on the luceneutil benchmark
> > >

Re: prorated early termination

2019-02-04 Thread Robert Muir

Regarding adding a threshold to TopFieldCollector, do you have ideas
on what it would take to fix the relevant collector/indexsearcher APIs
to make this kind of thing easier? (i know this is a doozie, but we
should at least try to think about it, maybe make some progress)

I can see where things become less efficient in this parallel+sorted
case with large top N, but there are also many other "top k
algorithms" that could be better for different use cases. in your
case, if you throw out the parallel and just think about doing your
sorted case segment-by-segment, the current code there may be
inefficient too (not as bad, but still doesn't really take total
advantage of sortedness). Maybe we improve that case by scoring some
initial "range" of docs for each/some segments first, and then handle
any "tail". With a simple google search I easily find many ideas for
how this logic could work: exact and inexact, sorted and unsorted,
distributed (parallel) and sequential.  So I think there are probably
other improvements that could be done here, but worry about what the
code might look like if we don't refactor it.

On Sun, Feb 3, 2019 at 3:14 PM Michael McCandless
 wrote:
>
> On Sun, Feb 3, 2019 at 10:41 AM Michael Sokolov  wrote:
>
>  > > In single-threaded mode we can check against minCompetitiveScore and
> > terminate collection for each segment appropriately,
> >
> > > Does Lucene do this today by default?  That should be a nice
> > optimization,
> > and it'd be safe/correct.
> >
> > Yes, it does that today (in TopFieldCollector -- see
> >
> > https://github.com/msokolov/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java#L225
> > )
> >
>
> Ahh -- great, thanks for finding that.
>
>
> > Re: our high cost of collection in static ranking phase -- that is true,
> > Mike, but I do also see a nice improvement on the luceneutil benchmark
> > (modified to have a sorted index and collect concurrently) using just a
> > vanilla TopFieldCollector. I looked at some profiler output, and it just
> > seems to be showing more time spent walking postings.
> >
>
> Yeah, understood -- I think pro-rating the N collected per segment makes a
> lot of sense.
>
> Mike McCandless
>
> http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IndexOptions & LongPoints

2018-09-18 Thread Robert Muir

On Tue, Sep 18, 2018 at 7:00 AM, Seth Utecht  wrote:
>
> My concern is that it seems like LongPoint's FieldType has an IndexOptions
> that is always NONE. It strikes me as odd, because we are in fact indexing
> and searching against these LongPoint fields.
>

Points fields don't create an inverted index: so there aren't
frequencies, positions, payloads, offsets, etc. That's why those
inverted index options do not apply.

Instead they build structures like kd-trees optimized for
range/spatial searching and so on.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Search in lines, so need to index lines?

2018-08-01 Thread Robert Muir

http://man7.org/linux/man-pages/man1/grep.1.html

On Wed, Aug 1, 2018 at 7:01 AM, Gordin, Ira  wrote:
> Hi Tomoko,
>
> I need to search in many files and we use Lucene for this purpose.
>
> Thanks,
> Ira
>
> -Original Message-
> From: Tomoko Uchida 
> Sent: Wednesday, August 1, 2018 1:49 PM
> To: java-user@lucene.apache.org
> Subject: Re: Search in lines, so need to index lines?
>
> Hi Ira,
>
>> I am trying to implement regex search in file
>
> Why are you using Lucene for regular expression search?
> You can implement this by simply using java.util.regex package?
>
> Regards,
> Tomoko
>
> 2018年8月1日(水) 0:18 Gordin, Ira :
>
>> Hi Uwe,
>>
>> I am trying to implement regex search in file the same as in editors, in
>> Notepad++ for example.
>>
>> Thanks,
>> Ira
>>
>> -Original Message-
>> From: Uwe Schindler 
>> Sent: Tuesday, July 31, 2018 6:12 PM
>> To: java-user@lucene.apache.org
>> Subject: RE: Search in lines, so need to index lines?
>>
>> Hi,
>>
>> you need to create your own tokenizer that splits tokens on \n or \r.
>> Instead of using WhitespaceTokenizer, you can use:
>>
>> Tokenizer tok = CharTokenizer. fromSeparatorCharPredicate(ch -> ch=='\r'
>> || ch=='\n');
>>
>> But I would first think of how to implement the whole thing correctly.
>> Using a regular expression as "default" query is slow and does not look
>> correct. What are you trying to do?
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -Original Message-
>> > From: Gordin, Ira 
>> > Sent: Tuesday, July 31, 2018 4:08 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Search in lines, so need to index lines?
>> >
>> > Hi all,
>> >
>> > I understand Lucene knows to find query matches in tokens. For example
>> if I
>> > use WhiteSpaceTokenizer and I am searching with /.*nice day.*/ regular
>> > expression, I'll always find nothing. Am I correct?
>> > In my project I need to find matches inside lines and not inside words,
>> so I
>> > am considering to tokenize lines. How I should to implement this idea?
>> > I'll really appriciate you have more ideas/implementations.
>> >
>> > Thanks in advance,
>> > Ira
>> >
>>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> --
> Tomoko Uchida

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: any example on FunctionScoreQuery since Field.setBoost is deprecated with Lucene 6.6.0

2018-07-31 Thread Robert Muir

Does this example help?
https://lucene.apache.org/core/7_4_0/expressions/org/apache/lucene/expressions/Expression.html


On Tue, Jul 31, 2018 at 3:56 PM,   wrote:
> The following page says:
>
> http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/document/Field.html#setBoost-float-
>
> setBoost
> @Deprecated
> public void setBoost(float boost)
> Deprecated. Index-time boosts are deprecated, please index index-time
> scoring factors into a doc value field and combine them with the score at
> query time using eg. FunctionScoreQuery.
> Sets the boost factor on this field.
> Throws:
> IllegalArgumentException - if this field is not indexed, or if it omits
> norms.
> See Also:
> boost()
>
>
> Any example on how to use FunctionScoreQuery? How will this affect the
> IndexWriter?
>
> Best regards
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: offsets

2018-07-31 Thread Robert Muir

The problem is not a performance one, its a complexity thing. Really I
think only the tokenizer should be messing with the offsets...
They are the ones actually parsing the original content so it makes
sense they would produce the pointers back to them.
I know there are some tokenfilters out there trying to be tokenizers,
but we don't need to make our apis crazy to support that.

On Mon, Jul 30, 2018 at 11:53 PM, Michael Sokolov  wrote:
> Yes, in fact Tokenizer already provides correctOffset which just delegates
> to CharFilter. We could expand on this, moving correctOffset up to
> TokenStream, and also adding correct() so that TokenFilters can add to the
> character offset data structure (two int arrays) and share it across the
> analysis chain.
>
> Implementation-wise this could continue to delegate to CharFilter I guess,
> but I think it would be better to add a character-offset-map abstraction
> that wraps the two int arrays and provides the correct/correctOffset
> methods to both TokenStream and CharFilter.
>
> This would let us preserve correct offsets in the face of manipulations
> like replacing ellipses, ligatures (like AE, OE), trademark symbols
> (replaced by "tm") and the like so that we can have the invariant that
> correctOffset(OffsetAttribute.startOffset) + CharTermAttribute.length() ==
> correctOffset(OffsetAttribute.endOffset), roughly speaking, and enable
> token-splitting with correct offsets.
>
> I can work up a proof of concept; I don't think it would be too
> API-intrusive or change performance in a significant way.  Only
> TokenFilters that actually care about this (ie that insert or remove
> characters, or split tokens) would need to change; others would continue to
> work as-is.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: offsets

2018-07-25 Thread Robert Muir

I think you see it correctly. Currently, only tokenizers can really
safely modify offsets, because only they have access to the correction
logic from the charfilter.

Doing it from a tokenfilter just means you will have bugs...

On Wed, Jul 25, 2018 at 8:27 AM, Michael Sokolov  wrote:
> I've run into some difficulties with offsets in some TokenFilters I've been
> writing, and I wonder if anyone can shed any light. Because characters may
> be inserted or removed by prior filters (eg ICUFoldingFilter does this with
> ellipses), and there is no offset-correcting data structure available to
> TokenFilters (as there is in CharFilter), there doesn't seem to be any
> reliable way to calculate the offset at a point interior to a token, which
> means that essentially the only reasonable thing to do with OffsetAttribute
> is to preserve the offsets from the input. This is means that filters that
> split their tokens (like WordDelimiterGraphFilter) have no reliable way of
> mapping their split tokens' offsets. One can try, but it seems inevitably
> to require making some arbitrary "fixup" stage in order to guarantee that
> the offsets are nondecreasing and properly bounded by the original text
> length.
>
> If this analysis is correct, it seems one should really never call
> OffsetAttribute.setOffset at all? Which makes it seem like a trappy kind of
> method to provide. (hmm now I see this comment in OffsetAttributeImpl
> suggesting making the method call-once). If that really is the case, I
> think some assertion, deprecation, or other API protection would be useful
> so the policy is clear.
>
> Alternatively, do we want to consider providing a "fixup" API as we have
> for CharFilter? OffsetAttribute, eg, could do the fixup if we provide an
> API for setting offset deltas. This would make more precise highlighting
> possible in these cases, at least. I'm not sure what other use cases folks
> have come up with for offsets?
>
> -Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir

If you customized the rules, maybe have a look at
https://issues.apache.org/jira/browse/LUCENE-8366

The rules got simpler and we also updated the customization example
used for the factory's test.

On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov  wrote:
> Yes that sounds good -- this ConditionalTokenFilter is going to be very
> helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke
> around and see about incorporating the emoji rules from there.  Thanks
> Robert
>
> On Tue, Jul 3, 2018 at 9:28 AM Robert Muir  wrote:
>
>> > Any thoughts?
>>
>> best idea I have would be to tokenize with ICUTokenizer, which will
>> tag emoji sequences as "" token type, then use
>> ConditionalTokenFilter to send all tokens EXCEPT those with token type
>> of  "" to your WordDelimiterFilter. This way
>> WordDelimiterFilter never sees the emoji at all and can't screw them
>> up.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir

> Any thoughts?

best idea I have would be to tokenize with ICUTokenizer, which will
tag emoji sequences as "" token type, then use
ConditionalTokenFilter to send all tokens EXCEPT those with token type
of  "" to your WordDelimiterFilter. This way
WordDelimiterFilter never sees the emoji at all and can't screw them
up.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: WordDelimiterGraphFilter swallows emojis

2018-07-03 Thread Robert Muir

On Tue, Jul 3, 2018 at 8:00 AM, Michael Sokolov  wrote:
> WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters
> like punctuation and thus remove them, but we would like to be able to
> search for emoji and use this filter for handling dashes, dots and other
> intra-word punctuation.
>
> These filters identify non-word and non-digit characters by two mechanisms:
> direct lookup in a character table, and fallback to Unicode class. The
> character table can't easily be used to handle emoji since it would need to
> be populated with the entire Unicode character set in order to reach
> emoji-land. On the other hand, if we change the handling of emoji by class,
> and say treat them as word-characters, this will also end up pulling in all
> the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think
> some of these other symbols are more like punctuation (this class is a grab
> bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc
> https://www.compart.com/en/unicode/category/So). On the other other hand,
> how do we even identify emoji? I don't think the Java Character API is
> adequate to the task. Perhaps we must incorporate a table.

There are several unicode properties for doing emoji (see e.g. unicode
segmentation algorithms, and tagging function in ICUTokenizer), but
its not based on general category. Additionally emoji may not be
single character but sequences so its more involved than what
WordDelimiterFilter is really ready for. I also don't think we should
start storing/maintaining unicode property tables ourselves, if we
want to fix WordDelimiterFilter, it should just depend on ICU instead.

> Suppose we come up with a good way to classify emoji; then how should they
> be treated in this class? Sometimes they may be embedded in tokens with
> other characters: I see people using emoji and other symbols as part of
> their names, and sometimes they stand alone (with whitespace separation). I
> think one way forward here would be to treat these as a special class akin
> to words and numbers, and provide similar options (SPLIT_ON_EMOJI,
> CATENATE_EMOJI) as we have for those classes.
>
> Or maybe as a convenience, we provide a way to get a table that encodes the
> default classifications of all characters up to some given limit, and then
> let the caller modify it? That would at least provide an easy way to treat
> emoji as letters.

There is already a way to provide a table to this thing. But one
bigger issue is word delimiter filter doesn't operate on unicode
codepoints, so I don't think you are gonna be able to do what you
want, since most emoji are not in the BMP. WordDelimiterFilter is
really only suitable for categorizing characters in the BMP, it just
doesn't split surrogates.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ICUFoldingFilter

2018-06-04 Thread Robert Muir

There may be a traps, e.g. if you make such a filter with UnicodeSet,
I think you really need to call .freeze() before passing it to this
thing. I have not examined the sources in a while but I think this
might be similar to "compiling a regexp" in that you'll then get good
performance when its later called millions of times.

If you use the factories, it will do this for you. But if you use the
API directly it is currently a bit of a performance trap...

On Mon, Jun 4, 2018 at 2:49 PM, Michael Sokolov  wrote:
> Ah thanks! That's very good to know. As it is I realized we already have an
> earlier component where we can handle this (we have a custom ICUTokenizer
> rbbi and can just split on "^"). So many flexibility
>
> -Mike
>
> On Mon, Jun 4, 2018 at 10:53 AM, Robert Muir  wrote:
>
>> actually, you now can choose to ignore certain characters by using
>> unicode filtering mechanism.
>>
>> This was added in https://issues.apache.org/jira/browse/LUCENE-8129
>>
>> So apply a filter such as [^\^] and the filter will ignore ^.
>>
>> On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir  wrote:
>> > This cannot be "tweaked" at runtime, it is implemented as custom
>> normalization.
>> >
>> > You can modify the sources / build your own ruleset or use a different
>> > tokenfilter to normalize characters.
>> >
>> > On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov 
>> wrote:
>> >> Hi, I'm using ICUFoldingFilter and for the most part it does exactly
>> what I
>> >> want. However there are some behaviors I'd like to tweak. For example it
>> >> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that,
>> and
>> >> whether there is any way to prevent it.
>> >>
>> >> I spent a little time with
>> >> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I
>> guess
>> >> is the basis for what this filter does (it's referenced in the
>> javadocs),
>> >> but that didn't answer my questions. As an aside, it seems this tech
>> report
>> >> was withdfrawn by the unicode consortium? Not sure what that means if
>> >> anything, but it seems ominous.
>> >>
>> >> Anyway, I would appreciate pointers to more info, and specifically,
>> whether
>> >> there are any alternatives to the utr30.nrm data file, or any
>> possibility
>> >> to select among the many transformations this filter applies.
>> >>
>> >> Thanks!
>> >>
>> >> Mike S
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ICUFoldingFilter

2018-06-04 Thread Robert Muir

actually, you now can choose to ignore certain characters by using
unicode filtering mechanism.

This was added in https://issues.apache.org/jira/browse/LUCENE-8129

So apply a filter such as [^\^] and the filter will ignore ^.

On Mon, Jun 4, 2018 at 10:41 AM, Robert Muir  wrote:
> This cannot be "tweaked" at runtime, it is implemented as custom 
> normalization.
>
> You can modify the sources / build your own ruleset or use a different
> tokenfilter to normalize characters.
>
> On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov  wrote:
>> Hi, I'm using ICUFoldingFilter and for the most part it does exactly what I
>> want. However there are some behaviors I'd like to tweak. For example it
>> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that, and
>> whether there is any way to prevent it.
>>
>> I spent a little time with
>> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I guess
>> is the basis for what this filter does (it's referenced in the javadocs),
>> but that didn't answer my questions. As an aside, it seems this tech report
>> was withdfrawn by the unicode consortium? Not sure what that means if
>> anything, but it seems ominous.
>>
>> Anyway, I would appreciate pointers to more info, and specifically, whether
>> there are any alternatives to the utr30.nrm data file, or any possibility
>> to select among the many transformations this filter applies.
>>
>> Thanks!
>>
>> Mike S

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ICUFoldingFilter

2018-06-04 Thread Robert Muir

This cannot be "tweaked" at runtime, it is implemented as custom normalization.

You can modify the sources / build your own ruleset or use a different
tokenfilter to normalize characters.

On Mon, Jun 4, 2018 at 9:07 AM, Michael Sokolov  wrote:
> Hi, I'm using ICUFoldingFilter and for the most part it does exactly what I
> want. However there are some behaviors I'd like to tweak. For example it
> maps "aaa^bbb" to "aaabbb". I am trying to understand why it does that, and
> whether there is any way to prevent it.
>
> I spent a little time with
> http://www.unicode.org/reports/tr30/tr30-4.html#UnicodeData which I guess
> is the basis for what this filter does (it's referenced in the javadocs),
> but that didn't answer my questions. As an aside, it seems this tech report
> was withdfrawn by the unicode consortium? Not sure what that means if
> anything, but it seems ominous.
>
> Anyway, I would appreciate pointers to more info, and specifically, whether
> there are any alternatives to the utr30.nrm data file, or any possibility
> to select among the many transformations this filter applies.
>
> Thanks!
>
> Mike S

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: [EXTERNAL] - Lucene 4.5.1 payload corruption - ArrayIndexOutOfBoundsException

2018-02-02 Thread Robert Muir

I agree that it may be a useful test to narrow the problem down.

But given that you have deleted docs, i'm not sure what conclusions
could be drawn from it, because lots of other changes will happen too
(e.g. docs, positions, etc will compress differently).

On Fri, Feb 2, 2018 at 9:19 AM, Tony Ma  wrote:
> Thanks Rebert.
>
> We are not going to use merge to repair corrupted index, the issue we are 
> seeing is that as a segment is already got corrupted, but merges usually run 
> automatically in background, I am trying to know that when this scenario 
> occurs, will merge stop with an exception or will merge complete with a new 
> corrupted segment.
>
> To be specific, we got a corrupted segment with following check index output,
>   1 of 5: name=_0 docCount=8341939
> codec=Lucene45
> compound=false
> numFiles=48
> size (MB)=16,446.275
> diagnostics = {os=Windows Server 2008 R2, java.vendor=Oracle Corporation, 
> java.version=1.7.0_80, lucene.version=4.5.1 1533280 - mark - 2013-10-17 
> 21:37:01, mergeMaxNumSegments=5, os.arch=amd64, source=merge, mergeFactor=6, 
> timestamp=1514627603337, os.version=6.1}
> has deletions [delGen=130]
> test: open reader.OK [4022 deleted docs]
> test: fields..OK [268 fields]
> test: field norms.OK [3 fields]
> test: terms, freq, prox...ERROR: 
> java.lang.ArrayIndexOutOfBoundsException: 105
> java.lang.ArrayIndexOutOfBoundsException: 105
> at 
> org.apache.lucene.codecs.lucene41.ForUtil.readBlock(ForUtil.java:196)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.refillPositions(Lucene41PostingsReader.java:1284)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.skipPositions(Lucene41PostingsReader.java:1505)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.nextPosition(Lucene41PostingsReader.java:1548)
> at org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:979)
> at 
> org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1232)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:623)
> at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:372)
>
>
> In checkindex, it will first check each position (all pass) and then do a 
> skip test(fail), and corruption seems to appear at skiplist. I am wondering 
> at this special case, it is possible that merge reconstruct a new skiplist 
> because each position is fine.
>
> So that at least I can know this segment is newly corrupted one or it is 
> previous corrupted and merge to a new one.
>
>
> On 2/2/18, 9:58 PM, "Robert Muir"  wrote:
>
> IMO this is not something you want to do.
>
> The only remedy CheckIndex has for a corrupted segment is to drop it
> completely: and if you choose to do that then you lose all the
> documents in that segment. So its not very useful to merge it with
> other segments into bigger corrupted segments since it will just
> spread more corruption.
>
> On Fri, Feb 2, 2018 at 3:08 AM, Tony Ma  wrote:
> > Hi experts,
> >
> > A question to corrupted index. If an index segment is already 
> corrupted, can it be merged with another segment. Or it depends on where it 
> got corrupted, for example corrupted in .pay file?
> >
> > From: 马江 
> > Date: Friday, January 19, 2018 at 9:52 AM
> > To: "java-user@lucene.apache.org" 
> > Subject: Re: [EXTERNAL] - Lucene 4.5.1 payload corruption - 
> ArrayIndexOutOfBoundsException
> >
> > Hi experts,
> >
> > Still about this issue, is there any known bug which will cause payload 
> file corruption? The stack trace indicates that the fisrt byte of input 
> should be an Integer <= 32, but actually it is 110.
> > Our customers seeing this kind of corruption several times, and all of 
> the corruption is from payload. Is there any possibility that the bytes put 
> into payload being incompatible with payload codec?
> >
> >
> >   void readBlock(IndexInput in, byte[] encoded, int[] decoded) throws 
> IOException {
> > final int numBits = in.readByte();
> > assert numBits <= 32 : numBits;
> >
> > if (numBits == ALL_VALUES_EQUAL) {
> >   final int value = in.readVInt();
> >   Arrays.fill(decoded, 0, BLOCK_SIZE, value);
> >   return;
> > }
> >
> > final int encodedSize = encodedSizes[numBits];
> > in.readBytes(encoded, 0, encodedSize);
> >
>

Re: [EXTERNAL] - Lucene 4.5.1 payload corruption - ArrayIndexOutOfBoundsException

2018-02-02 Thread Robert Muir

IMO this is not something you want to do.

The only remedy CheckIndex has for a corrupted segment is to drop it
completely: and if you choose to do that then you lose all the
documents in that segment. So its not very useful to merge it with
other segments into bigger corrupted segments since it will just
spread more corruption.

On Fri, Feb 2, 2018 at 3:08 AM, Tony Ma  wrote:
> Hi experts,
>
> A question to corrupted index. If an index segment is already corrupted, can 
> it be merged with another segment. Or it depends on where it got corrupted, 
> for example corrupted in .pay file?
>
> From: 马江 
> Date: Friday, January 19, 2018 at 9:52 AM
> To: "java-user@lucene.apache.org" 
> Subject: Re: [EXTERNAL] - Lucene 4.5.1 payload corruption - 
> ArrayIndexOutOfBoundsException
>
> Hi experts,
>
> Still about this issue, is there any known bug which will cause payload file 
> corruption? The stack trace indicates that the fisrt byte of input should be 
> an Integer <= 32, but actually it is 110.
> Our customers seeing this kind of corruption several times, and all of the 
> corruption is from payload. Is there any possibility that the bytes put into 
> payload being incompatible with payload codec?
>
>
>   void readBlock(IndexInput in, byte[] encoded, int[] decoded) throws 
> IOException {
> final int numBits = in.readByte();
> assert numBits <= 32 : numBits;
>
> if (numBits == ALL_VALUES_EQUAL) {
>   final int value = in.readVInt();
>   Arrays.fill(decoded, 0, BLOCK_SIZE, value);
>   return;
> }
>
> final int encodedSize = encodedSizes[numBits];
> in.readBytes(encoded, 0, encodedSize);
>
>
> From: 马江 
> Reply-To: "java-user@lucene.apache.org" 
> Date: Tuesday, January 16, 2018 at 11:16 AM
> To: "java-user@lucene.apache.org" 
> Subject: [EXTERNAL] - Lucene 4.5.1 payload corruption - 
> ArrayIndexOutOfBoundsException
>
> Hi experts,
>
> Recently one of our customer continuously seeing 
> ArrayIndexOutOfBoundsException which is thrown from Lucene.
>
> Our production is full-text search engine built on top of Lucene, following 
> is the stack traces. The customer saying that they can reproduce the issue 
> even after re-index everything from scratch.
>
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 110
> at 
> org.apache.lucene.codecs.lucene41.ForUtil.readBlock(ForUtil.java:196)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.refillPositions(Lucene41PostingsReader.java:1284)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.skipPositions(Lucene41PostingsReader.java:1505)
> at 
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum.nextPosition(Lucene41PostingsReader.java:1548)
> at 
> org.apache.lucene.search.spans.TermSpans.skipTo(TermSpans.java:82)
> at 
> org.apache.lucene.search.spans.SpanScorer.advance(SpanScorer.java:63)
> at 
> org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:69)
> at 
> org.apache.lucene.search.ConjunctionScorer.nextDoc(ConjunctionScorer.java:100)
> at org.apache.lucene.search.Scorer.score(Scorer.java:64)
> at 
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:627)
> at com.xhive.lucene.executor.f.a(xdb:158)
> at com.xhive.lucene.executor.f.search(xdb:145)
> at com.xhive.lucene.subpath.e.a(xdb:313)
> at com.xhive.lucene.subpath.e.a(xdb:264)
> at com.xhive.lucene.subpath.e.a(xdb:183)
> at com.xhive.lucene.executor.v.executeExternally(xdb:253)
> at com.xhive.kernel.ay.externalIndexExecute(xdb:2791)
> at 
> com.xhive.core.index.ExternalIndex.executeExternally(xdb:485)
> at com.xhive.core.index.XhiveMultiPathIndex.a(xdb:306)
> at com.xhive.xquery.pathexpr.v$a.ci(xdb:124)
> at com.xhive.xquery.pathexpr.ad$a.cp(xdb:104)
> at com.xhive.xquery.pathexpr.ax.awP(xdb:39)
> at com.xhive.xquery.pathexpr.ax.(xdb:32)
> at com.xhive.xquery.pathexpr.av.a(xdb:424)
> at com.xhive.xquery.pathexpr.al$a.awk(xdb:61)
> at com.xhive.xquery.pathexpr.ag.awj(xdb:28)
> at com.xhive.xquery.pathexpr.al.Xo(xdb:26)
> at com.xhive.xquery.pathexpr.aj.(xdb:33)
> at com.xhive.xquery.pathexpr.al.(xdb:20)
> at com.xhive.xquery.pathexpr.av.a(xdb:462)
> at com.xhive.xquery.pathexpr.av.a(xdb:413)
> at com.xhive.xquery.pathexpr.av.a(xdb:276)
> at com.xhive.xquery.pathexpr.av.a(xdb:220)
>
>
> ==
> following is CheckIndex output of corrupted segment. The full output is 
> attached.
>
>
> Checkin

Re: indexing performance 6.6 vs 7.1

2018-01-18 Thread Robert Muir

Erick I don't think solr was mentioned here.

On Thu, Jan 18, 2018 at 8:03 AM, Erick Erickson  wrote:
> My first question is always "are you running the Solr CPUs flat out?".
> My guess in this case is that the indexing client is the same and the
> problem is in Solr, but it's worth checking whether the clients are
> just somehow not delivering docs as fast as they were before.
>
> My suspicion is that the indexing client hasn't changed, but it's
> worth checking.
>
> Best,
> Erick
>
> On Thu, Jan 18, 2018 at 2:23 AM, Rob Audenaerde
>  wrote:
>> Hi all,
>>
>> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
>> indexing performace.
>>
>> We have a-typical use of Lucene, as we (also) index some database tables
>> and add all the values as AssociatedFacetFields as well. This allows us to
>> create pivot tables on search results really fast.
>>
>> These tables have some overlapping columns, but also disjoint ones.
>>
>> We anticipated a decrease in index size because of the sparse docvalues. We
>> see this happening, with decreases to ~50%-80% of the original index size.
>> But we did not expect an drop in indexing performance (client systems
>> indexing time increased with +50% to +250%).
>>
>> (Our indexing-speed used to be mainly bound by the speed the Taxonomy could
>> deliver new ordinals for new values, currently we are investigating if this
>> is still the case, will report later when a profiler run has been done)
>>
>> Does anyone know if this increase in indexing time is to be expected as
>> result of the sparse docvalues change?
>>
>> Kind regards,
>>
>> Rob Audenaerde
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Help regarding BM25Similarity

2018-01-04 Thread Robert Muir

You don't need to do any subclassing for this: just pass parameter b=0
to the constructor.

On Thu, Jan 4, 2018 at 10:58 AM, Parit Bansal  wrote:
> Hi,
>
> I am trying to tweak BM25Similarity for my use case wherein, I want to avoid
> the effects of field-length normalization for certain fields (return a
> constant value irrespective of how long was the document). Currently, both
> computeWeight and computeNorm methods are defined final in BM25Similarity.
>
> In ClassicSimilarity, same was possible by overriding the lengthNorm method.
> Is there a way around in BM25Similarity? Is there a possibility to change it
> to non-final methods in new releases?
>
> - Best
>
> Parit Bansal
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to use Hunspell dictionary to do the reverse of stemming ?

2017-10-24 Thread Robert Muir

On Tue, Oct 24, 2017 at 11:04 AM, julien Blaize  wrote:
> Hello,
>
> i am lookingfor a way to efficiently do the reverse of stemming.
> Example : if i give to the program the verb "drug" it will give me
> "drugged', "drugging", "drugs", "drugstore" etc...

To generate the list up-front (for all words), maybe look at the morph
generation code here and modify to your needs:
https://github.com/hunspell/hunspell/blob/master/src/tools/analyze.cxx
Then maybe try adding this to a lucene SynonymMap which will store
this in an FST with deduplication etc and may be reasonably efficient
(its just a large synonym dictionary at that point).
If you generate to wordnet or solr synonyms format there are already
parsers for those, so that may be easiest.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Accent insensitive search for greek characters

2017-10-24 Thread Robert Muir

Your greek transform stuff does not work because you use "Lower"
instead of casefolding.

If ICUFoldingFilter works for what you want, but you want to restrict
it to greek, then just restrict it to the greek region. See
FilteredNormalizer2 and UnicodeSet documentation. And look at how
ICUFoldingFilter is implemented in source code so you understand how
to instantiate an equivalent ICUNormalizer2Filter just with the greek
restriction.

On Tue, Oct 24, 2017 at 8:16 AM, Chitra  wrote:
> Hi,
>ICUTransformFilter is working fine for greek characters
> alone as per requirement. but one case it's breaking( σ & ς are the lower
> forms of Σ Sigma).
>
> *Example:*
>
> I indexed the terms πελάτης (indexed as πελατης) & πελάτηΣ (indexed as
> πελατης).I get the expected search results if I perform the search for
> πελάτηΣ (or) πελάτης (or) any combinations of upper case & lower case Greek
> characters. But if I search as πελατησ I won't get any search results.
>
> In Greek, σ & ς are the lower forms of Σ Sigma. And this case is solved in
> ICUFoldingFilter.
>
>
> Is ICU Transliterator rule formed right? Kindly look at the below code
>
>
> TokenStream tok = new ICUTransformFilter(tok,
> Transliterator.getInstance("Greek;
>> Lower; NFD; [:Nonspacing Mark:] Remove; NFC;"));
>
>
>
> Kindly help me to resolve this.
>
>
> Regards,
> Chitra

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ClassicAnalyzer Behavior on accent character

2017-10-19 Thread Robert Muir

easy, don't use classictokenizer: use standardtokenizer instead.

On Thu, Oct 19, 2017 at 9:37 AM, Chitra  wrote:
> Hi,
>   I indexed a term 'ⒶeŘꝋꝒɫⱯŋɇ' (aeroplane) and the term was
> indexed as "er l n", some characters were trimmed while indexing.
>
> Here is my code
>
> protected Analyzer.TokenStreamComponents createComponents(final String
>> fieldName, final Reader reader)
>> {
>> final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
>> reader);
>> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>>
>> TokenStream tok = new ClassicFilter(src);
>> tok = new LowerCaseFilter(getVersion(), tok);
>> tok = new StopFilter(getVersion(), tok, stopwords);
>> tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
>> search
>>
>> return new Analyzer.TokenStreamComponents(src, tok)
>> {
>> @Override
>> protected void setReader(final Reader reader) throws
>> IOException
>> {
>>
>> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>> super.setReader(reader);
>> }
>> };
>> }
>
>
>
> Am I missing anything? Is that expected behavior for my input or any reason
> behind such abnormal behavior?
>
> --
> Regards,
> Chitra

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to regulate native memory?

2017-08-30 Thread Robert Muir

>From the lucene side, it only uses file mappings for reads and doesn't
allocate any anonymous memory.
The way lucene uses cache for reads won't impact your OOM
(http://www.linuxatemyram.com/play.html)

At the end of the day you are running out of memory on the system
either way, and your process might just look like a large target based
for the oom-killer, but that doesn't mean its necessarily your problem
at all.

I advise sticking with basic operating system tools like /proc and
free -m, reproduce the OOM kill situation, just like in that example
link above, and try to track down the real problem.


On Wed, Aug 30, 2017 at 11:43 PM, Erik Stephens
 wrote:
> Yeah, apologies for that long issue - the netty comments aren't related.  My 
> two comments near the end might be more interesting here:
>
> 
> https://github.com/elastic/elasticsearch/issues/26269#issuecomment-326060213
>
> To try to summarize, I looked to `/proc/$pid/smaps | grep indices` to 
> quantify what I think is mostly lucene usage.  Is that an accurate way to 
> quantify that?  It shows 51G with `-XX:MaxDirectMemorySize=15G`.  The heap is 
> 30G and the resident memory is reported as 82.5G.  That makes a bit of sense: 
> 30G + 51G + miscellaneous.
>
> `top` reports roughly 51G as shared which is suspiciously close to what I'm 
> seeing in /proc/$pid/smaps. Is it correct to think that if a process requests 
> memory and there is not enough "free", then the kernel will purge from its 
> cache in order to allocate that requested memory?  I'm struggling to see how 
> the kernel thinks there isn't enough free memory when so much is in its 
> cache, but that concern is secondary at this point.  My primary concern is 
> trying to regulate the overall footprint (shared with file system cache or 
> not) so that OOM killer not even part of the conversation in the first place.
>
> # grep Vm /proc/$pid/status
> VmPeak: 982739416 kB
> VmSize: 975784980 kB
> VmLck: 0 kB
> VmPin: 0 kB
> VmHWM:  86555044 kB
> VmRSS:  86526616 kB
> VmData: 42644832 kB
> VmStk:   136 kB
> VmExe: 4 kB
> VmLib: 18028 kB
> VmPTE:275292 kB
> VmPMD:  3720 kB
> VmSwap:0 kB
>
> # free -g
>   totalusedfree  shared  buff/cache   
> available
> Mem:125  54   1   1  69  
> 69
> Swap: 0   0   0
>
> Thanks for the reply!  Apologies if not apropos to this forum - just working 
> my way down the rabbit hole :)
>
> --
> Erik
>
>
>> On Aug 30, 2017, at 8:04 PM, Robert Muir  wrote:
>>
>> Hello,
>>
>> From the thread linked there, its not clear to me the problem relates
>> to lucene (vs being e.g. a bug in netty, or too many threads, or
>> potentially many other problems).
>>
>> Can you first try to determine to breakdown your problematic "RSS"
>> from the operating system? Maybe this helps determine if your issue is
>> with an anonymous mapping (ByteBuffer.allocateDirect) or file mapping
>> (FileChannel.map).
>>
>> WIth recent kernels you can break down RSS with /proc/pid/XXX/status
>> (RssAnon vs RssFile vs RssShmem):
>>
>>http://man7.org/linux/man-pages/man5/proc.5.html
>>
>> If your kernel is old you may have to go through more trouble (summing
>> up stuff from smaps or whatever)
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to regulate native memory?

2017-08-30 Thread Robert Muir

Hello,

>From the thread linked there, its not clear to me the problem relates
to lucene (vs being e.g. a bug in netty, or too many threads, or
potentially many other problems).

Can you first try to determine to breakdown your problematic "RSS"
from the operating system? Maybe this helps determine if your issue is
with an anonymous mapping (ByteBuffer.allocateDirect) or file mapping
(FileChannel.map).

WIth recent kernels you can break down RSS with /proc/pid/XXX/status
(RssAnon vs RssFile vs RssShmem):

http://man7.org/linux/man-pages/man5/proc.5.html

If your kernel is old you may have to go through more trouble (summing
up stuff from smaps or whatever)

On Wed, Aug 30, 2017 at 9:58 PM, Erik Stephens  wrote:
> Our elasticsearch processes have been slowly consuming memory until a kernel 
> OOM kills it.  Details are here:
>
> https://github.com/elastic/elasticsearch/issues/26269 
> 
>
> To summarize:
>
> - Explicit GC is enabled
> - MaxDirectMemorySize is set
> - Total memory usage for the process is roughly heap (30G) + mmap'd 
> (unbounded) + 1-2G (meta space, threads, etc)
>
> The crowd is suggesting "Don't worry. You want to use all that memory."  I 
> understand that sentiment except for:
>
> - The process eventually gets OOM killed.
> - I need to support multiple processes on same machine and need a more 
> predictable footprint.
>
> It seems to be relatively common knowledge that direct byte buffers require a 
> GC to trigger their freedom.  However, full GC's are happening but not 
> resulting in a reduction of resident mmap'd memory.
>
> Any pointers to source code, settings, or tools are much appreciated.  Thanks!
>
> --
> Erik
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Automata and Transducer on Lucene 6

2017-04-18 Thread Robert Muir

On Tue, Apr 18, 2017 at 5:16 PM, Michael McCandless
 wrote:

>
> +1 to use the tests to learn how things work; I don't know of any guide /
> high level documentation for these low level classes, sorry.  Maybe write
> it up yourself and set it free somewhere online ;)

For the FST case there are some simple examples in the package level
docs, but i think the layout of javadocs html does not make this
obvious and they can be missed. Maybe they help to get started with /
can be fixed if they are out of date.

See 
http://lucene.apache.org/core/6_5_0/core/org/apache/lucene/util/fst/package-summary.html
(scroll to the bottom).

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Altering Term Frequency in Similarity

2016-12-15 Thread Robert Muir

Maybe have a look at SynonymQuery:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/SynonymQuery.java

I think it does a similar thing to what you want, it sums up the
frequencies of the synonyms and passes that sum to the similarity
class as TF.

On Wed, Dec 14, 2016 at 2:27 PM, Mossaab Bagdouri
 wrote:
> Hi,
>
>
> I'm using Lucene 6.3.0, and trying to handle synonyms at query time.
>
> I think I've handled DF correctly with BlendedTermQuery (by returning the
> max DF of the synonyms). TTF is also handled by the same class.
>
> Now, I want to handle the term frequency. As far as I can tell, raw TF is
> given to the similarity class by score(int doc, float freq). Which class
> does provide that freq? Or what can I change to provide a different freq
> value, practically changing the document representation (e.g., freq[0] =
> freq[0] + freq[1]; freq[1] = 0);
>
> Regards,
> Mossaab

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Segment Corruption - ForUtil.readBlock AIOBE

2016-08-08 Thread Robert Muir

Can you run checkindex and include the output?

On Mon, Aug 8, 2016 at 2:36 AM, Ravikumar Govindarajan
 wrote:
> For some of the segments we received the following exception during merge
> as well as search. They look to be corrupt [Lucene 4.6.1 & Sun JDK
>  1.7.0_55]
>
> Is this a known bug? Any help is much appreciated
>
> The offending line of code is in ForUtil.readBlock() method...
>
> *final int encodedSize = encodedSizes[numBits];*
>
> java.lang.ArrayIndexOutOfBoundsException: 34 at
> org.apache.lucene.codecs.lucene41.ForUtil.readBlock(ForUtil.java:201) at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.refillDocs(Lucene41PostingsReader.java:411)
> at
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsEnum.advance(Lucene41PostingsReader.java:536)
> at org.apache.lucene.search.TermScorer.advance(TermScorer.java:85) at
> org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:82)
> at
> org.apache.lucene.search.ConjunctionScorer.nextDoc(ConjunctionScorer.java:100)
> at
> org.apache.lucene.search.ConjunctionScorer.nextDoc(ConjunctionScorer.java:99)
> at org.apache.lucene.search.Scorer.score(Scorer.java:64)
>
> The numBits value also looks to be varied across different corrupt segments
>
> --
> Ravi

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Problems with Lat / Long searches at minimum and maximum latitude and longitude

2016-06-12 Thread Robert Muir

See this part of the documentation:
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/geo/Polygon.java#L30

APIs take newPolygonQuery(Polygon...) which is treated efficiently as
a "multipolygon".

This is also what many standards (e.g. geojson) recommend, otherwise
the polygon can be ambiguous.

You can see this visually for examples like this one:
http://www.geonames.org/2017370/russian-federation.html

On Sun, Jun 12, 2016 at 1:43 PM, Randy Tidd  wrote:
> Thanks for the help you’ve already given me on getting search with 
> LatLonPoint working, the basics are working great and with great performance.
>
> I did some testing on some edge cases and discovered that indexing and 
> searching for points at the minimum and maximum latitude and longitude are 
> not working as I expected.  I’d appreciate some feedback on this.
>
> Take for example the point (0.0, 180.0) which is in the Pacific ocean at 
> latitude = 0.0 and longitude = 180.0.  If we create a grid of 9 points 
> centered on this location (“tic tac toe” layout) with an offset of 0.1 the 
> coordinates would be:
>
> (0.1, -179.9), (0.0, -179.9), (-0.1, -179.9), (0.1, 180.0), (0.0, 180.0), 
> (-0.1, 180.0), (0.1, 179.9), (0.0, 179.9), (-0.1, 179.9)
>
> The max longitude is 180, so if you move 0.1 degree east the next longitude 
> is actually -179.9 rather than 180.1, it wraps around.
>
> If I then create a square that encompasses these points using an offset of 
> 1.0 (larger than the 0.1 for the indexed points) that polygon would have 
> these endpoints:
>
> (-1.0, 179.0}, (1.0, 179.0), (1.0, -179.0), (-1.0, -179.0), (-1.0, 179.0)
>
> The first and last point are the same so that this is a closed polygon.
>
> When I index these points with LatLonPoint and do a search with 
> LatLonPoint.newPolygonQuery() it does not return any results, which isn’t 
> what I expected.
>
> However, if I index a grid of points around (0.0, 0.0) and search with this 
> polygon, they are returned by this search, which I don’t believe is correct.  
> It is as if the search thinks that the point (0.0, 0.0) is between the points 
> (0.0, 179) and (0.0, -179) which I don’t think is correct.  Longitude 0.0 and 
> 180.0 are on opposite sides of the globe.
>
> I see the same behavior with data centered at these points:
>
> (90.0, -180.0)
> (90.0, 0.0)
> (90.0, 180.0)
> (0.0, -180.0)
> (0.0, 180.0)
> (-90.0, -180.0)
> (-90.0, 0.0)
> (-90.0, 180.0)
>
> However, using more “normal” points such as (44.4605, -110.8281) this test 
> works fine.
>
> So I’m wondering if there is a problem with the search handling polygons that 
> span the latitude and longitude boundaries, or if maybe I am just not 
> thinking about this right to begin with.  I can post sample code if that 
> would be helpful.  Thanks for any assistance / input.
>
> Randy
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene DirectSpellChecker strange behavior

2016-06-07 Thread Robert Muir

Its just a heuristic: that it does not allow 2 edits
(insertion/deletion/substitution/transposition) to the word if the first
character differs (
https://github.com/apache/lucene-solr/blob/master/lucene/suggest/src/java/org/apache/lucene/search/spell/DirectSpellChecker.java#L411).
So when it goes back for n=2, it requires the first character to match.

At least at the time the thing was written, this has a very large impact on
performance, because otherwise too much of the term dictionary must be
inspected and its much slower. The idea is, it won't hurt too much on
quality, for the same reasons that many of these string distance functions
incorporate a bias towards the matching prefix (e.g. jaro winkler).

On Tue, Jun 7, 2016 at 5:20 AM, Caroline Collet  wrote:

> Hello,
>
> I have a very strange behavior when I use the DirectSpellChecker of
> Lucene. I have set the prefixLength to 0. I have indexed only one item with
> one field : brand=samsung.
> I have tried to make requests with spelling mistakes inside.
>
> When I search for "smsng" I obtain "samsung" which is logical since I only
> have 2 corrections to make to obtain "samsung"
> When I search for "amsung" I obtain "samsung" since I have set the
> prefixLenght to 0
> But when I search "amung" which only has 2 errors, I do not obtain
> "samsung", I obtain nothing.
>
> I don't understand this behaviour, it is like no other correction is
> permitted if the first letter is misspelled.
>
> Did I miss some parameters of the spellchecker that could explain this
> behavior?
>
> I precise that I use :
> - Lucene 5.5.0
> - JRE 1.8
>
> Thank you in advance for taking time to answer my question,
> Bests regards,
> --
> [image: PERTIMM] 
>
> Caroline Collet
> Ingénieur développement
>
> Tel : +33 (0)1 80 04 82 89
> caroline.col...@pertimm.com
> http://www.pertimm.com/fr/
>
> Pertimm
> 51, boulevard Voltaire
> 92600 Asnières-Sur-Seine, France
>
>
>

Re: Cannot comment on Jira issues

2016-04-23 Thread Robert Muir

OK should really work now!

On Sat, Apr 23, 2016 at 10:37 AM, Andres de la Peña
 wrote:
> I'm still not able to comment, although I have tried to logout and login
> again.
>
> 2016-04-23 15:31 GMT+01:00 Robert Muir :
>
>> Can you try now? I added you to contributors groups.
>>
>> On Sat, Apr 23, 2016 at 10:26 AM, Andres de la Peña
>>  wrote:
>> > Hi,
>> >
>> > I would like to reply to the answer to my comment on LUCENE-7086
>> > <https://issues.apache.org/jira/browse/LUCENE-7086>. Could I be
>> temporary
>> > added to a group with more permissions? My username is adelapena.
>> >
>> > Thanks in advance,
>> >
>> > 2016-04-22 14:21 GMT+01:00 Steve Rowe :
>> >
>> >> Mạnh,
>> >>
>> >> I’ve added you to the LUCENE and SOLR projects as a contributor, so you
>> >> should now be able to create and comment on issues.
>> >>
>> >> --
>> >> Steve
>> >> www.lucidworks.com
>> >>
>> >> > On Apr 22, 2016, at 6:18 AM, Đạt Cao Mạnh 
>> >> wrote:
>> >> >
>> >> > Thanks uwe, my account at jira is : "caomanhdat"
>> >> >
>> >> > On Fri, Apr 22, 2016 at 5:16 PM Uwe Schindler 
>> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> there was a spam flood last night. Because of this, Apache
>> >> Infrastructure
>> >> >> disabled creating new issues and adding comments to existing issues
>> for
>> >> >> non-committers. We have temporary workarounds (e.g. add users
>> manually
>> >> to a
>> >> >> group with more permissions), but we first have to verify our
>> options as
>> >> >> this also does not scale.
>> >> >>
>> >> >> We have no idea about how long the Infrastructure team will keep the
>> >> >> current state.
>> >> >>
>> >> >> Can you send me you "correct" username in JIRA, so I can temporarily
>> add
>> >> >> you to the "Contributors" Group in the JIRA system? It would be good
>> to
>> >> >> close the additional account or mark it as "unused".
>> >> >>
>> >> >> Sorry,
>> >> >> Uwe
>> >> >>
>> >> >> -
>> >> >> Uwe Schindler
>> >> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> >> http://www.thetaphi.de
>> >> >> eMail: u...@thetaphi.de
>> >> >>
>> >> >>> -Original Message-
>> >> >>> From: Đạt Cao Mạnh [mailto:caomanhdat...@gmail.com]
>> >> >>> Sent: Friday, April 22, 2016 12:13 PM
>> >> >>> To: java-user@lucene.apache.org
>> >> >>> Subject: Cannot comment on Jira issues
>> >> >>>
>> >> >>> Recently, I cant comment on any jira issues include the one that i
>> >> >> created (
>> >> >>> https://issues.apache.org/jira/browse/LUCENE-6968).
>> >> >>> I tried to create a new account but the new one cannot comment too.
>> >> >>
>> >> >>
>> >> >> -
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>> >
>> >
>> > --
>> > Andrés de la Peña
>> >
>> > Vía de las dos Castillas, 33, Ática 4, 3ª Planta
>> > 28224 Pozuelo de Alarcón, Madrid
>> > Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
>> > <https://twitter.com/StratioBD>*
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Andrés de la Peña
>
> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
> <https://twitter.com/StratioBD>*

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Cannot comment on Jira issues

2016-04-23 Thread Robert Muir

Can you try now? I added you to contributors groups.

On Sat, Apr 23, 2016 at 10:26 AM, Andres de la Peña
 wrote:
> Hi,
>
> I would like to reply to the answer to my comment on LUCENE-7086
> . Could I be temporary
> added to a group with more permissions? My username is adelapena.
>
> Thanks in advance,
>
> 2016-04-22 14:21 GMT+01:00 Steve Rowe :
>
>> Mạnh,
>>
>> I’ve added you to the LUCENE and SOLR projects as a contributor, so you
>> should now be able to create and comment on issues.
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>> > On Apr 22, 2016, at 6:18 AM, Đạt Cao Mạnh 
>> wrote:
>> >
>> > Thanks uwe, my account at jira is : "caomanhdat"
>> >
>> > On Fri, Apr 22, 2016 at 5:16 PM Uwe Schindler  wrote:
>> >
>> >> Hi,
>> >>
>> >> there was a spam flood last night. Because of this, Apache
>> Infrastructure
>> >> disabled creating new issues and adding comments to existing issues for
>> >> non-committers. We have temporary workarounds (e.g. add users manually
>> to a
>> >> group with more permissions), but we first have to verify our options as
>> >> this also does not scale.
>> >>
>> >> We have no idea about how long the Infrastructure team will keep the
>> >> current state.
>> >>
>> >> Can you send me you "correct" username in JIRA, so I can temporarily add
>> >> you to the "Contributors" Group in the JIRA system? It would be good to
>> >> close the additional account or mark it as "unused".
>> >>
>> >> Sorry,
>> >> Uwe
>> >>
>> >> -
>> >> Uwe Schindler
>> >> H.-H.-Meier-Allee 63, D-28213 Bremen
>> >> http://www.thetaphi.de
>> >> eMail: u...@thetaphi.de
>> >>
>> >>> -Original Message-
>> >>> From: Đạt Cao Mạnh [mailto:caomanhdat...@gmail.com]
>> >>> Sent: Friday, April 22, 2016 12:13 PM
>> >>> To: java-user@lucene.apache.org
>> >>> Subject: Cannot comment on Jira issues
>> >>>
>> >>> Recently, I cant comment on any jira issues include the one that i
>> >> created (
>> >>> https://issues.apache.org/jira/browse/LUCENE-6968).
>> >>> I tried to create a new account but the new one cannot comment too.
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> --
> Andrés de la Peña
>
> Vía de las dos Castillas, 33, Ática 4, 3ª Planta
> 28224 Pozuelo de Alarcón, Madrid
> Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
> *

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BlendedTermQuery causing negative IDF?

2016-04-19 Thread Robert Muir

The scoring algorithm can't be expected to deal with totally bogus
(e.g. mathematically impossible) statistics, such as docFreq >
docCount. Many of them may fall apart. We should try to improve that
about BlendedTermQuery!

SynonymQuery should not really exist. It exists because of problems
like that: what BlendedTermQuery tries to do (fuse terms from multiply
fields) is more complicated. SynonymQuery only works on one field.
Statistics are always in-bounds and all statistics exception for
docFreq are just what the situation would look like if all the terms
were synonyms at index-time: that is the whole goal. It sums up raw TF
values from the postings lists across all the synonyms into one
integer value before passing to the similarity. The only sheisty part
is really the docFreq = max(docFreq), but its always in-bounds at
least and a consistent value. Otherwise it is scored exactly as an
index-time synonym with respect to all other stats. So e.g. this is a
lot closer to the motivation behind what BM25F does, but it should
behave well with any similarity since the task is easier.

Across fields makes things more complex: seems like we should try to improve it.

On Tue, Apr 19, 2016 at 11:33 AM, Ahmet Arslan
 wrote:
> Thanks Dough for letting us know that Lucene's BM25 avoids negative IDF 
> values.
> I didn't know that.
>
> Markus, out of curiosity, why do you need BlendedTermQuery?
> I knew SynonymQuery is now part of query parser base, I think they do similar 
> things?
>
> Ahmet
>
>
>
>
> On Tuesday, April 19, 2016 5:33 PM, Doug Turnbull 
>  wrote:
> Lucene's BM25 avoids negatives scores for this by adding 1 inside the log
> term of BM25's IDF
>
> Compare this:
> https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/java/org/apache/lucene/search/similarities/BM25Similarity.java#L71
>
> to the Wikipedia article's BM25 IDF
> https://en.wikipedia.org/wiki/Okapi_BM25
>
> Markus another thing to add is that when Elasticsearch uses
> BlendedTermQuery, they add a lot of invariants that must be true. For
> example the fields must share the same analyzer. You may need to research
> what else happens in Elasticsearch outside BlendedTermQuery to fet this
> behavior to work.
>
> Another testing philosophy point: when I do this kind of work I like to
> isolate the Lucene behavior seperate from the Solr behavior. I might
> suggest creating a Lucene unit test to validate your assumptions around
> BlendedTermQuery. Just to help isolate the issues. Here's Lucene's tests
> for BlendedTermQuery as a basis
>
> https://github.com/apache/lucene-solr/blob/5e5fd662575105de88d8514b426bccdcb4c76948/lucene/core/src/test/org/apache/lucene/search/TestBlendedTermQuery.java
>
>
>
>
>
>
>
>
>
> On Tue, Apr 19, 2016 at 10:16 AM Ahmet Arslan 
> wrote:
>
>>
>>
>> Hi Markus,
>>
>> It is a known property of BM25. It produces negative scores for common
>> terms.
>> Most of the term-weighting models are developed for indices in which stop
>> words are eliminated.
>> Therefore, most of the term-weighting models have problems scoring common
>> terms.
>> By the way, DFI model does a decent job when handling common terms.
>>
>> Ahmet
>>
>>
>>
>> On Tuesday, April 19, 2016 4:48 PM, Markus Jelsma <
>> markus.jel...@openindex.io> wrote:
>> Hello,
>>
>> I just made a Solr query parser for BlendedTermQuery on Lucene 6.0 using
>> BM25 similarity and i have a very simple unit test to see if something is
>> working at all. But to my surprise, one of the results has a negative
>> score, caused by a negative IDF because docFreq is higher than docCount for
>> that term on that field. Here are the test documents:
>>
>> assertU(adoc("id", "1", "text", "rare term"));
>> assertU(adoc("id", "2", "text_nl", "less rare term"));
>> assertU(adoc("id", "3", "text_nl", "rarest term"));
>> assertU(commit());
>>
>> My query parser creates the following Lucene query:
>> BlendedTermQuery(Blended(text:rare text:term text_nl:rare text_nl:term))
>> which looks fine to me. But this is what i am getting back for issueing
>> that query on the above set of documents, the third document is the one
>> with a negative score.
>>
>> 
>>   
>> 3
>> 0.1805489
>>   
>> 2
>> 0.14785346
>>   
>> 1
>> -0.004004207
>> 
>> 
>>   {!blended fl=text,text_nl}rare term
>>   {!blended fl=text,text_nl}rare term
>>   BlendedTermQuery(Blended(text:rare text:term
>> text_nl:rare text_nl:term))
>>   Blended(text:rare text:term
>> text_nl:rare text_nl:term)
>>   
>> 
>> 0.1805489 = max plus 0.01 times others of:
>>   0.1805489 = weight(text_nl:term in 2) [], result of:
>> 0.1805489 = score(doc=2,freq=1.0 = termFreq=1.0
>> ), product of:
>>   0.18232156 = idf(docFreq=2, docCount=2)
>>   0.9902773 = tfNorm, computed from:
>> 1.0 = termFreq=1.0
>> 1.2 = parameter k1
>> 0.75 = parameter b
>> 2.5 = avgFieldLength
>> 2.56 = fieldLength
>> 
>> 
>> 0.14785345 = max p

Re: Lucene indexing throughput (and Mike's lucenebench charts)

2016-04-15 Thread Robert Muir

you won't see indexing improvements there because the dataset in
question is wikipedia and mostly indexing full text. I think it may
have one measly numeric field.

On Thu, Apr 14, 2016 at 6:25 PM, Otis Gospodnetić
 wrote:
> (replying to my original email because I didn't get people's replies, even
> though I see in the archives people replied)
>
> Re BJ and beast2 upgrade.  Yeah, I saw that, but
> * if there is no indexing throughput improvement after that, does that mean
> that those particular indexing tests happen to be disk bound and not CPU
> bound? (I'm assuming beast2 has more cores than the previous hardware
> oh, I see, 72 cores vs. only 20 indexing threads)
> * the metrics for GC times are sums across all CPUs, not averages per CPU?
> Would the latter be more useful?
>
> What I was fishing for was something in that indexing chart that would show
> me this little nugget:
>
> *Lucene 6 brings a major new feature called Dimensional Points: a new
> tree-based data structure which will be used for numeric, date, and
> geospatial fields. Compared to the existing field format, this new
> structure uses half the disk space, is twice as fast to index, and
> increases search performance by 25%.*
>
> How come the charts on
> http://home.apache.org/~mikemccand/lucenebench/indexing.html don't show the
> 2x faster indexing and various query performance charts don't show 25%
> improvement in search performance?
>
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
> On Thu, Apr 14, 2016 at 1:13 PM, Otis Gospodnetić <
> otis.gospodne...@gmail.com> wrote:
>
>> Hi,
>>
>> I was looking at Mike's
>> http://home.apache.org/~mikemccand/lucenebench/indexing.html secretly
>> hoping to spot some recent improvements in indexing throughput but
>> instead it looks like:
>>
>> * indexing throughput hasn't really gone up in the last ~5 years
>> * indexing was faster in 2014, but then dropped to pre-2014 speed in early
>> 2015
>> * indexing rate dropped some more in early 2016, and that seems to roughly
>> correlate to a *big* jump in Young GC in late 2015
>>
>> Does anyone know what happened in late 2015 that causes that big Young GC
>> jump?
>> Or does that big jump just look scary in that chart, but is not actually a
>> big concern in practice?
>>
>> Thanks,
>> Otis
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Depreciated IntField field in v6

2016-04-15 Thread Robert Muir

On Fri, Apr 15, 2016 at 11:48 AM, Greg Huber  wrote:
> Hello,
>
> I was using the IntField field to set the weight on my suggester.
> (LegacyIntField works)
>
> old:
>
> document.add(new IntField(
> FieldConstants.LUCENE_WEIGHT_LINES,
> catalogue.getSearchWeight(), Field.Store.YES));
>
> I tried to use the IntPoint but it does not seem to work:
>
> new:
>
> document.add(new IntPoint(FieldConstants.LUCENE_WEIGHT_LINES,
> catalogue.getSearchWeight()));
>
> Is this the correct way t do it now?  There is no Field.Store.YES?

It is equivalent to Field.Store.NO: it just indexes it for search. If
you want to store separately you need to also add a StoredField (can
be the same field name of course).

I opened this issue to improve the docs:
https://issues.apache.org/jira/browse/LUCENE-7223

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SloppyMath license

2015-09-19 Thread Robert Muir

call it public domain, call it attribution-only, whatever you like.

there is nothing incompatible with Apache 2, fdlibm was also used by
apache harmony for its math code.

On Sat, Sep 19, 2015 at 12:33 PM, Earl Hood  wrote:
> On Sat, Sep 19, 2015 at 11:14 AM, Robert Muir wrote:
>
>> There is nothing unusual about public domain code. If your lawyers do
>> not understand that, tell them to go back to school.
>
> Actually, the code in question is not in the public domain, despite that
> the term "public domain" is in the comments of SloppyMath.java:
>
>
> /* some code derived from jodk: http://code.google.com/p/jodk/ (apache 2.0)
>  * asin() derived from fdlibm: http://www.netlib.org/fdlibm/e_asin.c
> (public domain):
>  * 
> =
>  * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
>  *
>  * Developed at SunSoft, a Sun Microsystems, Inc. business.
>  * Permission to use, copy, modify, and distribute this
>  * software is freely granted, provided that this notice
>  * is preserved.
>  * 
> =
>  */
>
> Nowhere in e_asin.c is it mentioned the file has been released to the
> public domain.
>
> The 1993 block is the actual license from e_asin.c, but examination of
> it indicates that it should be perfectly fine to use in a Apache 2.0
> licensed or non-open-source/free software project.  Just leave the
> notice as-is.
>
> If the OP is still concerned, then the OP will have to follow-up with
> the specific legal concern SloppyMath.java allegedly raises.  Saying
> there is a licensing concern is of little help without providing
> specifics.
>
> --ewh
>
> P.S. IANAL
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SloppyMath license

2015-09-19 Thread Robert Muir

There is nothing unusual about public domain code. If your lawyers do
not understand that, tell them to go back to school.

On Sat, Sep 19, 2015 at 11:31 AM, Sergii Kabashniuk
 wrote:
> Hello
> Right now I'm working on approval to use lucene-core  in Eclipse projects.
> One of the things that worry Eclipse "Intellectual property team" is that
> SloppyMath.java
> class has unusual license. I mean may be it fits Apache 2 license
> requirements, but potentially it can be a source of licensing problems. For
> instance lucene-core 5.2.1 is approved to use in Eclipse projects but
>  without this class. So we need to repackage lucene-core and remove
> SloppyMath from it.
>
> My question is it possible to move this class out from lucene-core? I found
> that lucene-spatial3d (correct me if I'm wrong) can be close by spirit to
> this module.
>
> Sergii Kabashniuk

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Problems with toString at TermsQuery

2015-09-09 Thread Robert Muir

I think its a bug: https://issues.apache.org/jira/browse/LUCENE-6792

On Tue, Sep 8, 2015 at 10:35 AM, Ruslan Muzhikov  wrote:
> Hi!
> Sometimes TermsQuery.toString() method falls with exception:
>
> *Exception in thread "main" java.lang.AssertionError*
> * at org.apache.lucene.util.UnicodeUtil.UTF8toUTF16(UnicodeUtil.java:546)*
> * at org.apache.lucene.util.BytesRef.utf8ToString(BytesRef.java:149)*
> * at org.apache.lucene.queries.TermsQuery.toString(TermsQuery.java:190)*
> * at org.apache.lucene.search.Query.toString(Query.java:67)*
> * ...*
>
>
> Here is the example of such program:
>
> *public static void main(String[] args) {*
> *System.out.print(new TermsQuery(new Term("DATA", new
> BytesRef(toBytes(128.toString());*
> *}*
>
> *public static byte[] toBytes(int val) {*
> *byte[] b = new byte[4];*
> *for(int i = 3; i > 0; i--) {*
> *b[i] = (byte) val;*
> *val >>>= 8;*
> *}*
> *b[0] = (byte) val;*
> *return b;*
> *}*
>
>
> Is there any limits on BytesRef content?
>
> Thanks,
> Ruslan Muzhikov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Robert Muir

That makes no sense at all, it would make it slow as shit.

I am tired of repeating this:
Don't use BINARY docvalues
Don't use BINARY docvalues
Don't use BINARY docvalues

Use types like SORTED/SORTED_SET which will compress the term
dictionary and make use of ordinals in your application instead.



On Sat, Aug 8, 2015 at 10:19 AM, Olivier Binda  wrote:
> Greetings
>
> are there any plans to implement compression of the variable length bites[]
> binary doc Values,
> say in blocks of 16k like for stored values ?
>
> my .cfs file goes from 2MB to like 400k when I zip it
>
> Best regards,
> Olivier
>
>
>
> On 08/08/2015 02:32 PM, jamie wrote:
>>
>> Greetings
>>
>> Our app primarily uses Lucene for its intended purpose i.e. to search
>> across large amounts of unstructured text. However, recently our requirement
>> expanded to perform look-ups on specific documents in the index based on
>> associated custom defined unique keys. For our purposes, a unique key is the
>> string representation of a 128 bit murmur hash, stored in a Lucene field
>> named uid.  We are currently using the TermsFilter to lookup Documents in
>> the Lucene index as follows:
>>
>> List terms = new LinkedList<>();
>> for (String id : ids) {
>> terms.add(new Term("uid", id));
>> }
>> TermsFilter idFilter = new TermsFilter(terms);
>> ... search logic...
>>
>> At any time we may need to lookup say a couple of thousand documents. Our
>> problem is one of performance. On very large indexes with 30 million records
>> or more, the lookup can be excruciatingly slow. At this stage, its not
>> practical for us to move the data over to fit for purpose database, nor
>> change the uid field to a numeric type. I fully appreciate the fact that
>> Lucene is not designed to be a database, however, is there anything we can
>> do to improve the performance of these look-ups?
>>
>> Much appreciate
>>
>> Jamie
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: scanning whole index stored fields while using best compression mode

2015-06-03 Thread Robert Muir

On Wed, Jun 3, 2015 at 4:59 PM, Anton Zenkov  wrote:
> Reindexing. If I want to add new fields or change existing fields in the
> index I need to go through all documents of the index.
>

if your reindexing process needs all the docs, i dont think i can
really recommend a better way. a merge reader is the correct thing,
might be slower to initialize but really geared at doing just what you
are doing. in general across the codec api, merge readers will avoid
polluting internal caches (maybe one day OS cache too...), they expose
checkIntegrity() method and other things that seem appropriate for a
process like this.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: scanning whole index stored fields while using best compression mode

2015-06-03 Thread Robert Muir

On Wed, Jun 3, 2015 at 4:00 PM, Anton Zenkov  wrote:

>
> for (int i = 0; i < leafReader.maxDoc(); i++) {
> DocumentStoredFieldVisitor visitor = new DocumentStoredFieldVisitor();
> fieldsReader.visitDocument(i, visitor);
> visitor.getDocument();
>
> }
>
> }
>
> I was wondering if there is better way of doing this and if there are plans
> to make access to the faster document loading through some API. Should I
> try to come up with a patch for this?
>
> Thanks!
> Anton

I agree its slow, but what process other than merging really needs to
loop through all documents and read their stored fields? For merging,
having a 64KB buffer around doesn't impact users, because there is
just one thread and its short-lived. Keeping around 64KB ram
per-segment per-thread in general seems too heavy IMO.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IllegalArgumentException: docID must be >= 0 and < maxDoc=48736112 (got docID=2147483647)

2015-05-29 Thread Robert Muir

Hi Ahmet,

Its due to the use of sentinel values by your collector in its
priority queue by default.

TopScoreDocCollector warns about this, and if you turn on assertions
(-ea) you will hit them in your tests:

 * NOTE: The values {@link Float#NaN} and
 * {@link Float#NEGATIVE_INFINITY} are not valid scores.  This
 * collector will not properly collect hits with such
 * scores.
 */
public abstract class TopScoreDocCollector extends TopDocsCollector {

I don't think a fix is simple, I only know of the following ideas:
* somehow sneaky use of NaN as sentinels instead of -Inf, to allow
-Inf to be used. It seems a bit scary!
* remove the sentinels optimization. I am not sure if collectors could
easily have the same performance without them.

To me, such scores seem always undesirable and only bugs, and the
current assertions are a good tradeoff.

On Fri, May 29, 2015 at 8:18 AM, Ahmet Arslan  wrote:
> Hello List,
>
> When a similarity returns NEGATIVE_INFINITY, hits[i].doc becomes 2147483647.
> Thus, exception is thrown in the following code:
>
> for (int i = 0; i < hits.length; i++) {
> int docId = hits[i].doc;
> Document doc = searcher.doc(docId);
> }
>
> I know it is an awkward to return infinity (comes from log(0)), but exception 
> looks like equally
> awkward and uniformative.
>
> Do you think is this something improvable? Can we do better handling here?
>
> Thanks,
> Ahmet
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: SortingAtomicReader alternate to Tim-Sort...

2015-04-22 Thread Robert Muir

On Tue, Apr 21, 2015 at 4:00 AM, Ravikumar Govindarajan
 wrote:

> b) CompressingStoredFieldsReader did not store the last decoded 32KB chunk.
> Our segments are already sorted before participating in a merge. On mostly
> linear merge, we ended up decoding the same chunk again and again. Simply
> storing the last chunk resulted in good speed-ups for us...

See also https://issues.apache.org/jira/browse/LUCENE-6131 where this
is solved for 5.0.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Customizing Regexp syntax in Lucene

2015-04-05 Thread Robert Muir

On Sun, Apr 5, 2015 at 5:08 PM, code fx9  wrote:
> Hi,
> We are using Lucene indirectly via ElasticSearch. We would like to use RE2
> syntax for running regex queries against Lucene. We are already using RE2
> syntax for other parts of our system, so not ability to use the same syntax
> is a deal-breaker for us.
>
> Recently Google has released a pure Java implementation of this library on
> GitHub. Will it be possible to actually use RE2/J library to run regex
> queries in Lucene? I understand that it might require customizing Lucene
> source code. Can you give me any idea how complex and time consuming such
> endeavor might be.
>
> RE2 Syntax: https://re2.googlecode.com/hg/doc/syntax.html
> RE2/J :https://github.com/google/re2j
>
> Thanks.

The only place in lucene that "knows" about syntax is RegexpQuery. It
only has logic for parsing that syntax into a state machine (Automaton
class), otherwise AutomatonQuery takes care of the execution.

Maybe you could create an Re2Query class that works in a similar way:
e.g. uses RE2/J library to parse the syntax into its state machine
representation and translates that to Automaton representation used by
Lucene.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: write.lock is not removed

2015-02-23 Thread Robert Muir

Thats why locking didnt work correctly back then.

On Mon, Feb 23, 2015 at 8:18 AM, Just Spam  wrote:
> Any reason?
> I remember in 3.6 the lock was removed/deleted?
>
>
> 2015-02-23 14:13 GMT+01:00 Robert Muir :
>
>> It should not be deleted. Just don't mess with it.
>>
>> On Mon, Feb 23, 2015 at 7:57 AM, Just Spam  wrote:
>> > Hello,
>> >
>> > i am trying to index a file (Lucene 4.10.3) – in my opinion in the
>> correct
>> > way – will say:
>> >
>> > get the IndexWriter, Index the Doc and add them, prepare commit, commit
>> and
>> > finally{ close}.
>> >
>> >
>> >
>> >
>> >
>> > My writer is generated like so:
>> >
>> >
>> >
>> > private IndexWriter getDataIndexWriter() throws CorruptIndexException,
>> > LockObtainFailedException, IOException {
>> >
>> >if (dataWriter == null) {
>> >
>> >File f = new
>> > File(„C:/temp/index“);
>> >
>> >if (!f.exists()) {
>> >
>> >
>> f.mkdirs();
>> >
>> >}
>> >
>> >Directory indexDir =
>> > MMapDirectory.open(f);
>> >
>> >KeywordAnalyzer analyzer =
>> > AnalyzerFactory.getDataAnalyzer();
>> >
>> >IndexWriterConfig config =
>> > new IndexWriterConfig(AnalyzerFactory.LUCENE_VERSION, analyzer);
>> >
>> >dataWriter = new
>> > IndexWriter(indexDir, config);
>> >
>> > }
>> >
>> >return dataWriter;
>> >
>> > }
>> >
>> >
>> >
>> > Finally i check if static!=null and close the static (dataWriter).
>> >
>> >
>> >
>> > My problem is, the write.lock file just wont be deleted and i cant figure
>> > out why.
>> >
>> > Is there any way i can debug this further?
>> >
>> > Is there any possible way to see if processes are still accessing the
>> file?
>> >
>> > Am i entirely missing something?
>> >
>> >
>> > Kind Regards and thank you in advance
>> >
>> >
>> >
>> > Matthias
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: write.lock is not removed

2015-02-23 Thread Robert Muir

It should not be deleted. Just don't mess with it.

On Mon, Feb 23, 2015 at 7:57 AM, Just Spam  wrote:
> Hello,
>
> i am trying to index a file (Lucene 4.10.3) – in my opinion in the correct
> way – will say:
>
> get the IndexWriter, Index the Doc and add them, prepare commit, commit and
> finally{ close}.
>
>
>
>
>
> My writer is generated like so:
>
>
>
> private IndexWriter getDataIndexWriter() throws CorruptIndexException,
> LockObtainFailedException, IOException {
>
>if (dataWriter == null) {
>
>File f = new
> File(„C:/temp/index“);
>
>if (!f.exists()) {
>
>f.mkdirs();
>
>}
>
>Directory indexDir =
> MMapDirectory.open(f);
>
>KeywordAnalyzer analyzer =
> AnalyzerFactory.getDataAnalyzer();
>
>IndexWriterConfig config =
> new IndexWriterConfig(AnalyzerFactory.LUCENE_VERSION, analyzer);
>
>dataWriter = new
> IndexWriter(indexDir, config);
>
> }
>
>return dataWriter;
>
> }
>
>
>
> Finally i check if static!=null and close the static (dataWriter).
>
>
>
> My problem is, the write.lock file just wont be deleted and i cant figure
> out why.
>
> Is there any way i can debug this further?
>
> Is there any possible way to see if processes are still accessing the file?
>
> Am i entirely missing something?
>
>
> Kind Regards and thank you in advance
>
>
>
> Matthias

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Document Ordering

2015-02-16 Thread Robert Muir

Have a look at SortingMergePolicy:

http://lucene.apache.org/core/4_10_0/misc/org/apache/lucene/index/sorter/SortingMergePolicy.html

On Mon, Feb 16, 2015 at 9:47 PM, Elliott Bradshaw  wrote:
> Hi,
>
> I'm interested in using Lucene to index binary objects with a specific
> document order, such that documents with the same key will be adjacent in
> the indexes.  This would be done with the intent to maximize the use of the
> disk block cache, since adjacent documents are more likely to be accessed
> together.  I realize that the document number is typically internal to
> Lucene, but is it possible to customize this ordering based on say the ID
> of the document?
>
> Just curious.  Thanks.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: A codec moment or pickle

2015-02-13 Thread Robert Muir

heh, i just don't think thats the typical case. Its definitely extreme.

Even still, in many cases using the filesystem (properly warmed) with
compression might still be better. It depends how you are measuring
latency. storing your whole index in gigabytes of heap ram without any
compression on a huge heap has consequences too.

On Thu, Feb 12, 2015 at 4:52 PM, Benson Margulies  wrote:
> WHOOPS.
>
> First sentence was, until just before I clicked 'send',
>
> "Hardware has .5T of RAM. Index is relatively small  (20g) ..."
>
>
> On Thu, Feb 12, 2015 at 4:51 PM, Benson Margulies  
> wrote:
>> Robert,
>>
>> Let me lay out the scenario.
>>
>> Hardware has .5T of Index is relatively small. Application profiling
>> shows a significant amount of time spent codec-ing.
>>
>> Options as I see them:
>>
>> 1. Use DPF complete with the irritation of having to have this
>> spurious codec name in the on-disk format that has nothing to do with
>> the on-disk format.
>> 2. 'Officially' use the standard codec, and then use something like
>> AOP to intercept and encapsulate it with the DPF or something else
>> like it -- essentially, a do-it-myself alternative to convincing the
>> community here that this is a use case worthy of support.
>> 3. Find some way to move a significant amount of the data in question
>> out of Lucene altogether into something else which fits nicely
>> together with filling memory with a cache so that the amount of
>> codeccing drops below the threshold of interest.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)

2015-02-12 Thread Robert Muir

On Thu, Feb 12, 2015 at 11:58 AM, McKinley, James T
 wrote:
> Hi Robert,
>
> Thanks for responding to my message.  Are you saying that you or others have 
> encountered problems running Lucene 4.8+ on the 64-bit Java SE 1.7 JVM with 
> G1 and was it on Windows or on Linux?  If so, where can I find out more?  I 
> only looked into the one bug because that was the only bug I saw on the 
> https://wiki.apache.org/lucene-java/JavaBugs page that was related to G1.  If 
> there are other Lucene on Java 1.7 with G1 related bugs how can I find them?  
> Also, are these failures something that would be triggered by running the 
> standard Lucene 4.8.1 test suite or are there other tests I should run in 
> order to reproduce these bugs?

You can't reproduce them easily. That is the nature of such bugs. When
i see the crashes, i generally try to confirm its not a lucene bug.
E.g. ill run it a thousand times with/without g1 and if only g1 fails,
i move on with life. There just isnt time.

Occasionally G1 frustrates me enough, ill go and open an issue, like
this one: https://issues.apache.org/jira/browse/LUCENE-6098

Thats a perfect example of what these bugs look like, horribly scary
failures that can cause bad things, and reproduce like 1/1000 times
with G1, essentially impossible to debug. They happen quite often in
our various jenkins servers, on both 32-bit and 64-bit, and even with
the most recent (e.g. 1.8.0_25 or 1.8.0_40-ea) jvms.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: A codec moment or pickle

2015-02-12 Thread Robert Muir

On Thu, Feb 12, 2015 at 8:51 AM, Benson Margulies  wrote:
> On Thu, Feb 12, 2015 at 8:43 AM, Robert Muir  wrote:
>
>> Honestly i dont agree. I don't know what you are trying to do, but if
>> you want file format backwards compat working, then you need a
>> different FilterCodec to match each lucene codec.
>>
>> Otherwise your codec is broken from a back compat standpoint. Wrapping
>> the latest is an antipattern here.
>>
>
> I understand this logic. It leaves me wandering between:
>
> 1: My old desire to convince you that there should be a way to do
> DirectPostingFormat's caching without being a codec at all. Unfortunately,
> I got dragged away from the benchmarking that might have been persuasive.

Honestly, benchmarking won't persuade me. I think this is a trap and I
don't want more of these traps.
We already have RAMDirectory(Directory other) which is this exact same
trap. We don't need more duplicates of it.
But this Direct, man oh man is it even worse by far, because it uses
32 and 64 bits for things that really should typically only be like 8
bits with compression, so it just hogs up RAM.

There isnt a benchmark on this planet that can convince me it should
get any higher status. On the contrary, I want to send it into a deep
dark dungeon in siberia.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: A codec moment or pickle

2015-02-12 Thread Robert Muir

Honestly i dont agree. I don't know what you are trying to do, but if
you want file format backwards compat working, then you need a
different FilterCodec to match each lucene codec.

Otherwise your codec is broken from a back compat standpoint. Wrapping
the latest is an antipattern here.


On Thu, Feb 12, 2015 at 5:33 AM, Benson Margulies  wrote:
> Based on reading the same comments you read, I'm pretty doubtful that
> Codec.getDefault() is going to work. It seems to me that this
> situation renders the FilterCodec a bit hard to to use, at least given
> the 'every release deprecates a codec' sort of pattern.
>
>
>
> On Thu, Feb 12, 2015 at 3:20 AM, Uwe Schindler  wrote:
>> Hi,
>>
>> How about Codec.getDefault()? It does indeed not necessarily return the 
>> newest one (if somebody changes the default using Codec.setDefault()), but 
>> for your use case "wrapping the current default one", it should be fine?
>>
>> I have not tried this yet, but there might be a chicken-egg problem:
>> - Your codec will have a separate name and be listed in META-INF as service 
>> (I assume this). So it gets discovered by the Codec discovery process and is 
>> instantiated by that.
>> - On loading the Codec framework the call to codec.getDefault() might get in 
>> at a time where the codecs are not yet fully initialized (because it will 
>> instantiate your codec while loading the META-INF). This happens before the 
>> Codec class is itself fully statically initialized, so the default codec 
>> might be null...
>> So relying on Codec.getDefault() in constructors of filter codecs may not 
>> work as expected!
>>
>> Maybe try it out, was just an idea :-)
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>>> -Original Message-
>>> From: Benson Margulies [mailto:bimargul...@gmail.com]
>>> Sent: Thursday, February 12, 2015 2:11 AM
>>> To: java-user@lucene.apache.org
>>> Subject: A codec moment or pickle
>>>
>>> I have a class that extends FilterCodec. Written against Lucene 4.9, it 
>>> uses the
>>> Lucene49Codec.
>>>
>>> Dropped into a copy of Solr with Lucene 4.10, it discovers that this codec 
>>> is
>>> read-only in 4.10. Is there some way to code one of these to get 'the 
>>> default
>>> codec' and not have to chase versions?
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)

2015-02-11 Thread Robert Muir

d possibly resulting in index corruption", or something along those 
> lines.  It seems a shame to possibly scare new Lucene users away from using 
> G1GC with the 64-bit JVM given that it has better performance on large heaps 
> which are becoming more common today.
>
> FWIW,
> Jim
> 
> From: McKinley, James T [james.mckin...@cengage.com]
> Sent: Monday, February 09, 2015 11:00 AM
> To: java-user@lucene.apache.org
> Subject: RE: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)
>
> OK thanks Erick, I have put a story in our jira backlog to investigate the 
> G1GC issues with the Lucene test suite.  I don't know if we'll be able to 
> shed any light on the issue, but since we're using Lucene with Java 7 G1GC, I 
> guess we better investigate it.
>
> Jim
> 
> From: Erick Erickson [erickerick...@gmail.com]
> Sent: Saturday, February 07, 2015 2:22 PM
> To: java-user
> Subject: Re: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)
>
> The G1C1 issue reference by Robert Muir on the Wiki page is at a
> Lucene level. Lucene, of course, is critically important to Solr so
> from that perspective it is about Solr too.
>
> https://wiki.apache.org/lucene-java/JavaBugs
>
> And, I assume, it also applies to your custom app.
>
> FWIW,
> Erick
>
> On Fri, Feb 6, 2015 at 12:10 PM, McKinley, James T
>  wrote:
>> Just to be clear in case there was any confusion about my previous message 
>> regarding G1GC, we do not use Solr, my team works on a proprietary 
>> Lucene-based search engine.  Consequently, I can't really give any advice 
>> regarding Solr with G1GC, but for our uses (so far anyway), G1GC seems to 
>> work well with Lucene.
>>
>> Jim
>> 
>> From: Piotr Idzikowski [piotridzikow...@gmail.com]
>> Sent: Friday, February 06, 2015 5:35 AM
>> To: java-user@lucene.apache.org
>> Subject: Re: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)
>>
>> Hello.
>> A little bit delayed question. But recently I have found this articles:
>> https://wiki.apache.org/solr/SolrPerformanceProblems
>> https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
>>
>> Especially this part from first url:
>> *Using the ConcurrentMarkSweep (CMS) collector with tuning parameters is a
>> very good option for for Solr, but with the latest Java 7 releases (7u72 at
>> the time of this writing), G1 is looking like a better option, if the
>> -XX:+ParallelRefProcEnabled option is used.*
>>
>> How does it play with *"Do not, under any circumstances, run Lucene with
>> the G1 garbage collector."*
>> from https://wiki.apache.org/lucene-java/JavaBugs?
>>
>> Regards
>> Piotr
>>
>> On Tue, Jan 27, 2015 at 9:55 PM, McKinley, James T <
>> james.mckin...@cengage.com> wrote:
>>
>>> Hi Uwe,
>>>
>>> OK, thanks for the info.  We'll see if we can download the Lucene test
>>> suite and check it out.
>>>
>>> FWIW, we use G1GC in our production runtime (~70 12-16 core Cisco UCS and
>>> HP Gen7/Gen8 nodes with 20+ GB heaps using Java 7 and Lucene 4.8.1 with
>>> pairs of 30 index partitions with 15M-23M docs each) and have not
>>> experienced any VM crashes (well, maybe a couple, but not directly
>>> traceable to G1 to my knowledge).  We have found some undocumented pauses
>>> in G1 due to very large object arrays and filed a bug report which was
>>> confirmed and also affects CMS (we worked around this in our code using
>>> memory mapping of some files whose contents we previously held all in
>>> RAM).  I think the only index corruption we've ever seen was in our index
>>> creation workflow (~30 HP Gen7 nodes with 27GB heaps) but this was using
>>> Parallel GC since it is a batch system, so that corruption (which we've not
>>> seen recently and never found a cause for) was definitely not due to G1GC.
>>>
>>> G1GC has bugs as does CMS but we've found it to work pretty well so far in
>>> our runtime system.  Of course YMMV, thanks again for the info.
>>>
>>> Jim
>>> 
>>> From: Uwe Schindler [u...@thetaphi.de]
>>> Sent: Tuesday, January 27, 2015 3:02 PM
>>> To: java-user@lucene.apache.org
>>> Subject: RE: Lucene Version Upgrade (3->4) and Java JVM Versions(6->8)
>>>
>>> Hi.,
>>>
>>> About G1GC. We consistently see problems when runn

Re: SegmentCommitInfos and live/deleted files

2015-01-11 Thread Robert Muir

files are either per-segment or per-commit.

the first only returns per-segment files. this means it won't include
any per-commit files:
* segments_N itself
* generational .liv for deletes
* generational .fnm/.dvd/etc for docvalues updates.

the second includes per-commit files, too. it doesnt include the
segments_N itself because your code passes 'false' for that.

On Sun, Jan 11, 2015 at 10:49 AM, Varun Thacker
 wrote:
> I wanted to know whats the difference betwen the two ways that I am getting
> a list of all segment files belonging to a segment?
>
> method1 never returns .liv files.
>
> https://gist.github.com/vthacker/98065232c3d2da579700
>
> --
>
>
> Regards,
> Varun Thacker
> http://www.vthacker.in/

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: manually merging Directories

2014-12-30 Thread Robert Muir

I will revisit this and look at hardlinks as an optimization. If they
aren't supported, we can just fall back to copying.

Sorry it will not help your case, but it would improve the situation
and can be done safely.

On Tue, Dec 30, 2014 at 12:55 PM, Shaun Senecal
 wrote:
> Excellent, this is pretty much exactly what I was looking for.  I agree with 
> you on the use of hard links as well.  Sadly, HDFS doesn't support hard links 
> yet (https://issues.apache.org/jira/browse/HDFS-3370), so even if this 
> feature is implemented, I wont be able to use it, but its still good to keep 
> this in mind for future reference.
>
>
> Thanks!
>
> Shaun
>
> ________
> From: Robert Muir 
> Sent: December 30, 2014 9:36 AM
> To: java-user
> Subject: Re: manually merging Directories
>
> FYI there is more discussion on
> https://issues.apache.org/jira/browse/LUCENE-4746
>
> In general, i don't like the idea that if things go wrong (which they
> will), that the input Directories would be left in a trashed state.
>
> To me, hard links would be the correct solution, but Files.createLink
> is an optional operation for a reason (I think it may require special
> privs on windows).
>
> On Tue, Dec 30, 2014 at 12:24 PM, Shaun Senecal
>  wrote:
>> Ya, I already have that set up.  Thanks for the heads-up though!
>>
>> 
>> From: Uwe Schindler 
>> Sent: December 30, 2014 5:22 AM
>> To: java-user@lucene.apache.org
>> Subject: RE: manually merging Directories
>>
>> In addition, use NoMergePolicy to prevent automatic merging once the 
>> segments were added. :-)
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>>> -Original Message-
>>> From: Uwe Schindler [mailto:u...@thetaphi.de]
>>> Sent: Tuesday, December 30, 2014 2:20 PM
>>> To: 'java-user@lucene.apache.org'
>>> Subject: RE: manually merging Directories
>>>
>>> Hi Shaun,
>>>
>>> you can actually do this relatively simple. In fact, most of the files are 
>>> indeed
>>> copied as-is, so you can theoretically change the logic to make a simple
>>> rename. Files that cannot be copied unmodified and need to be changed by
>>> IndexWriter, will be handled as usual.
>>>
>>> You don't need to patch Lucene for this: IndexWriter calls
>>> Directory#copy(Directory to, String src, String dest, IOContext context) for
>>> those files that can be copied unmodified. What you need to do is: Just 
>>> care a
>>> oal.store.FilterDirectory that wraps the original FSDirectory and implement
>>> this copy method on it to just do a rename, like:
>>>
>>> public class RenameInsteadCopyFilterDirectory extends FilterDirectory {
>>>   public RenameInsteadCopyFilterDirectory(FSDirectory dir) {
>>> super(dir);
>>>   }
>>>
>>>   public void copy(Directory to, String src, String dest, IOContext context)
>>> throws IOException {
>>> if (!(to instanceof FSDirectory)) {
>>>  throw new IOException("This only works for target FSDirectories");
>>> final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS =
>>> (FSDirectory) to;
>>> Files.move(fromFS.getDirectory().resolve(source),
>>> toFS.getDirectory().resolve(dest));
>>>   }
>>> }
>>>
>>> Please be aware that you have to wrap the "source" directory, because
>>> IndexWriter's copySegmentAsIs() call this method of the directory that’s
>>> passed to addIndexes(Directory). Something like:
>>>
>>> writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir));
>>>
>>> After that all files, that were not copied unmodified, keep alive in the 
>>> source
>>> directory, but all those that are copied as-is will move and disappear from
>>> source directory.
>>>
>>> Uwe
>>>
>>> -
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>>
>>> > -Original Message-
>>> > From: Shaun Senecal [mailto:shaun.sene...@lithium.com]
>>> > Sent: Tuesday, December 30, 2014 12:37 AM
>>> > To: Lucene Users
>>> > Subject: Re: manually merging Directories
>>> >
>>> > Hi Mike
>>> >
>>>

Re: manually merging Directories

2014-12-30 Thread Robert Muir

FYI there is more discussion on
https://issues.apache.org/jira/browse/LUCENE-4746

In general, i don't like the idea that if things go wrong (which they
will), that the input Directories would be left in a trashed state.

To me, hard links would be the correct solution, but Files.createLink
is an optional operation for a reason (I think it may require special
privs on windows).

On Tue, Dec 30, 2014 at 12:24 PM, Shaun Senecal
 wrote:
> Ya, I already have that set up.  Thanks for the heads-up though!
>
> 
> From: Uwe Schindler 
> Sent: December 30, 2014 5:22 AM
> To: java-user@lucene.apache.org
> Subject: RE: manually merging Directories
>
> In addition, use NoMergePolicy to prevent automatic merging once the segments 
> were added. :-)
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Uwe Schindler [mailto:u...@thetaphi.de]
>> Sent: Tuesday, December 30, 2014 2:20 PM
>> To: 'java-user@lucene.apache.org'
>> Subject: RE: manually merging Directories
>>
>> Hi Shaun,
>>
>> you can actually do this relatively simple. In fact, most of the files are 
>> indeed
>> copied as-is, so you can theoretically change the logic to make a simple
>> rename. Files that cannot be copied unmodified and need to be changed by
>> IndexWriter, will be handled as usual.
>>
>> You don't need to patch Lucene for this: IndexWriter calls
>> Directory#copy(Directory to, String src, String dest, IOContext context) for
>> those files that can be copied unmodified. What you need to do is: Just care 
>> a
>> oal.store.FilterDirectory that wraps the original FSDirectory and implement
>> this copy method on it to just do a rename, like:
>>
>> public class RenameInsteadCopyFilterDirectory extends FilterDirectory {
>>   public RenameInsteadCopyFilterDirectory(FSDirectory dir) {
>> super(dir);
>>   }
>>
>>   public void copy(Directory to, String src, String dest, IOContext context)
>> throws IOException {
>> if (!(to instanceof FSDirectory)) {
>>  throw new IOException("This only works for target FSDirectories");
>> final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS =
>> (FSDirectory) to;
>> Files.move(fromFS.getDirectory().resolve(source),
>> toFS.getDirectory().resolve(dest));
>>   }
>> }
>>
>> Please be aware that you have to wrap the "source" directory, because
>> IndexWriter's copySegmentAsIs() call this method of the directory that’s
>> passed to addIndexes(Directory). Something like:
>>
>> writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir));
>>
>> After that all files, that were not copied unmodified, keep alive in the 
>> source
>> directory, but all those that are copied as-is will move and disappear from
>> source directory.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>> > -Original Message-
>> > From: Shaun Senecal [mailto:shaun.sene...@lithium.com]
>> > Sent: Tuesday, December 30, 2014 12:37 AM
>> > To: Lucene Users
>> > Subject: Re: manually merging Directories
>> >
>> > Hi Mike
>> >
>> > That's actually what I was looking at doing, I was just hoping there
>> > was a way to avoid the "copySegmentAsIs" step and simply replace it with a
>> "rename"
>> > operation on the file system.  It seemed like low hanging fruit, but
>> > Uwe and Erick have now told me that the segments have dependencies
>> > embedded in them somehow, so a simple rename operation wouldn't
>> > accomplish the same thing.  In the end, it may not be a big deal anyway.
>> >
>> >
>> > Thanks
>> >
>> > Shaun
>> >
>> >
>> > 
>> > From: Michael McCandless 
>> > Sent: December 29, 2014 2:43 PM
>> > To: Lucene Users
>> > Subject: Re: manually merging Directories
>> >
>> > Why not use IW.addIndexes(Directory[])?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler 
>> > wrote:
>> > > Hi,
>> > >
>> > > Why not simply leave each index directory on the searcher nodes as is:
>> > > Move all index directories (as mentioned by you) to a local disk and
>> > > access
>> > them using a MultiReader - there is no need to merge them if you have
>> > not enough resources. If you have enough CPU and IO power, just merge
>> > them as usual with IndexWriter.addIndexes(). But I don't understand
>> > you argument with I/O: If you copy the index files from HDFS to local
>> > disks already, how can this work without I/O? So you can merge them
>> anyways.
>> > >
>> > > Merging index files, simply by copying them all in one directory, is
>> > impossible, because the files reference each other by segment name
>> > (segments_n refers to them, also the segment ids are used all over).
>> > So You would need to change some index files already for merge to make
>> > the SegmentInfos structur

Re: Building non-core jar-files from lucene sources.

2014-12-02 Thread Robert Muir

If you run ant -p it will print targets and descriptions.

you want 'ant compile'.

In my opinion the default target should not be 'jar', but print this
list of targets instead, just like the top-level build file.

On Tue, Dec 2, 2014 at 12:09 PM, Badano Andrea  wrote:
> Hello,
>
> When I build lucene from source using these instructions:
>
>   https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/BUILD.txt
>
> I end up with only
>
>   ./build/core/lucene-core-4.10.2-SNAPSHOT.jar
>
> I would like to build lucene-analyzers-common-4.10.2.jar and 
> lucene-queryparser-4.10.2.jar as well.
>
> How do I do that? I have not been able to find any documentation that 
> describes this.
>
> Thanks,
>
> Andrea
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to configure lucene 4.x to read 3.x index files

2014-09-23 Thread Robert Muir

As reported in the issue, since 4.8 we do better checks when reading
this stuff in.

Unfortunately, 3.0-3.3 indexes had bugs in the way they encode the
deleted documents.

So for those indexes, we have to ignore the trailing garbage at the
end of the file.

On Tue, Sep 23, 2014 at 9:15 PM, Patrick Mi  wrote:
> Hi Robert/Uwe,
>
> I have tried v4.8 and v4.9 - not working either.
>
> V4.7.0, V4.7.1, v4.7.2 are good.
>
> Regards,
> Patrick
>
> -Original Message-
> From: Patrick Mi [mailto:patrick...@touchpoint.co.nz]
> Sent: Wednesday, 24 September 2014 12:24 p.m.
> To: 'java-user@lucene.apache.org'
> Subject: RE: How to configure lucene 4.x to read 3.x index files
>
> Hi Robert/Uwe,
>
> Thanks very much for the quick response.
>
> I have tried again with a different set of index(28k documents) generated
> from V3 too and that worked.
>
> But the one(30k documents) I tried indeed worked for the V3 but not V4.10.
> Maybe something in that index could cause problem in V4 but not v3.
>
> Also I have tried an earlier version v4.7 as Uwe suggested and V4.7 version
> works on the V3 index that V4.10 failed to open.
>
> Regards,
>
> Patrick
>
>
>
> -Original Message-
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Tuesday, 23 September 2014 11:52 p.m.
> To: java-user
> Subject: Re: How to configure lucene 4.x to read 3.x index files
>
> You should not have to configure anything.
>
> The exception should not happen: can I have this index to debug the issue?
>
> On Mon, Sep 22, 2014 at 11:07 PM, Patrick Mi
>  wrote:
>> Hi there,
>>
>> I understood that Lucene V4 could read 3.x index files by configuring
>> Lucene3xCodec but what exactly needs to be done here?
>>
>> I used DEMO code from V4.10.0 to generate v4 index files and could read
>> them
>> without problem. When I tried to read index files generated from V3 I got
>> the following errors:
>>
>> Exception in thread "main" org.apache.lucene.index.CorruptIndexException:
>> did not read all bytes from file: read 65 vs size 66 (resource:
>> BufferedChecksumIndexInput(MMapIndexInput(path="C:\indexes\v3\_1os1_5.del")))
>> at org.apache.lucene.codecs.CodecUtil.checkEOF(CodecUtil.java:252)
>> at
>> org.apache.lucene.codecs.lucene40.BitVector.(BitVector.java:363)
>> at
>> org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:91)
>> at
>> org.apache.lucene.index.SegmentReader.(SegmentReader.java:116)
>> at
>> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
>> at
>> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:913)
>> at
>> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
>> at
>> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
>> at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:95)
>>
>> My classpath includes the following jars from V4:
>> lucene-core-4.10.0.jar
>> lucene-analyzers-common-4.10.0.jar
>> lucene-queries-4.10.0.jar
>> lucene-queryparser-4.10.0.jar
>> lucene-facet-4.10.0.jar
>> lucene-expressions-4.10.0.jar
>>
>> Noticed that META-INF/services/org.apache.lucene.codecs.Codec ( part of
>> lucene-core-4.10.0.jar) contains the following lines:
>> org.apache.lucene.codecs.lucene40.Lucene40Codec
>> org.apache.lucene.codecs.lucene3x.Lucene3xCodec
>> org.apache.lucene.codecs.lucene41.Lucene41Codec
>> org.apache.lucene.codecs.lucene42.Lucene42Codec
>> org.apache.lucene.codecs.lucene45.Lucene45Codec
>> org.apache.lucene.codecs.lucene46.Lucene46Codec
>> org.apache.lucene.codecs.lucene49.Lucene49Codec
>> org.apache.lucene.codecs.lucene410.Lucene410Codec
>>
>> Does that mean Lucene3xCodec will be picked up automatically based on the
>> index files itself?
>>
>> Where is the API I could force the code to use V3 setting? IndexReader and
>> IndexSearcher don’t seem to have anywhere I can pass that in?
>>
>> Did some search but couldn't find the useful resources covered that. Much
>> appreciated if someone could point out the right direction.
>>
>> Regards,
>> Patrick
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to configure lucene 4.x to read 3.x index files

2014-09-23 Thread Robert Muir

I opened an issue with a patch for this:

https://issues.apache.org/jira/browse/LUCENE-5975

Thanks for reporting it!

On Mon, Sep 22, 2014 at 11:07 PM, Patrick Mi
 wrote:
> Hi there,
>
> I understood that Lucene V4 could read 3.x index files by configuring
> Lucene3xCodec but what exactly needs to be done here?
>
> I used DEMO code from V4.10.0 to generate v4 index files and could read them
> without problem. When I tried to read index files generated from V3 I got
> the following errors:
>
> Exception in thread "main" org.apache.lucene.index.CorruptIndexException:
> did not read all bytes from file: read 65 vs size 66 (resource:
> BufferedChecksumIndexInput(MMapIndexInput(path="C:\indexes\v3\_1os1_5.del")))
> at org.apache.lucene.codecs.CodecUtil.checkEOF(CodecUtil.java:252)
> at 
> org.apache.lucene.codecs.lucene40.BitVector.(BitVector.java:363)
> at
> org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:91)
> at 
> org.apache.lucene.index.SegmentReader.(SegmentReader.java:116)
> at
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:913)
> at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
> at 
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
> at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:95)
>
> My classpath includes the following jars from V4:
> lucene-core-4.10.0.jar
> lucene-analyzers-common-4.10.0.jar
> lucene-queries-4.10.0.jar
> lucene-queryparser-4.10.0.jar
> lucene-facet-4.10.0.jar
> lucene-expressions-4.10.0.jar
>
> Noticed that META-INF/services/org.apache.lucene.codecs.Codec ( part of
> lucene-core-4.10.0.jar) contains the following lines:
> org.apache.lucene.codecs.lucene40.Lucene40Codec
> org.apache.lucene.codecs.lucene3x.Lucene3xCodec
> org.apache.lucene.codecs.lucene41.Lucene41Codec
> org.apache.lucene.codecs.lucene42.Lucene42Codec
> org.apache.lucene.codecs.lucene45.Lucene45Codec
> org.apache.lucene.codecs.lucene46.Lucene46Codec
> org.apache.lucene.codecs.lucene49.Lucene49Codec
> org.apache.lucene.codecs.lucene410.Lucene410Codec
>
> Does that mean Lucene3xCodec will be picked up automatically based on the
> index files itself?
>
> Where is the API I could force the code to use V3 setting? IndexReader and
> IndexSearcher don’t seem to have anywhere I can pass that in?
>
> Did some search but couldn't find the useful resources covered that. Much
> appreciated if someone could point out the right direction.
>
> Regards,
> Patrick
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to configure lucene 4.x to read 3.x index files

2014-09-23 Thread Robert Muir

You should not have to configure anything.

The exception should not happen: can I have this index to debug the issue?

On Mon, Sep 22, 2014 at 11:07 PM, Patrick Mi
 wrote:
> Hi there,
>
> I understood that Lucene V4 could read 3.x index files by configuring
> Lucene3xCodec but what exactly needs to be done here?
>
> I used DEMO code from V4.10.0 to generate v4 index files and could read them
> without problem. When I tried to read index files generated from V3 I got
> the following errors:
>
> Exception in thread "main" org.apache.lucene.index.CorruptIndexException:
> did not read all bytes from file: read 65 vs size 66 (resource:
> BufferedChecksumIndexInput(MMapIndexInput(path="C:\indexes\v3\_1os1_5.del")))
> at org.apache.lucene.codecs.CodecUtil.checkEOF(CodecUtil.java:252)
> at 
> org.apache.lucene.codecs.lucene40.BitVector.(BitVector.java:363)
> at
> org.apache.lucene.codecs.lucene40.Lucene40LiveDocsFormat.readLiveDocs(Lucene40LiveDocsFormat.java:91)
> at 
> org.apache.lucene.index.SegmentReader.(SegmentReader.java:116)
> at
> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:913)
> at
> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
> at 
> org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
> at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:95)
>
> My classpath includes the following jars from V4:
> lucene-core-4.10.0.jar
> lucene-analyzers-common-4.10.0.jar
> lucene-queries-4.10.0.jar
> lucene-queryparser-4.10.0.jar
> lucene-facet-4.10.0.jar
> lucene-expressions-4.10.0.jar
>
> Noticed that META-INF/services/org.apache.lucene.codecs.Codec ( part of
> lucene-core-4.10.0.jar) contains the following lines:
> org.apache.lucene.codecs.lucene40.Lucene40Codec
> org.apache.lucene.codecs.lucene3x.Lucene3xCodec
> org.apache.lucene.codecs.lucene41.Lucene41Codec
> org.apache.lucene.codecs.lucene42.Lucene42Codec
> org.apache.lucene.codecs.lucene45.Lucene45Codec
> org.apache.lucene.codecs.lucene46.Lucene46Codec
> org.apache.lucene.codecs.lucene49.Lucene49Codec
> org.apache.lucene.codecs.lucene410.Lucene410Codec
>
> Does that mean Lucene3xCodec will be picked up automatically based on the
> index files itself?
>
> Where is the API I could force the code to use V3 setting? IndexReader and
> IndexSearcher don’t seem to have anywhere I can pass that in?
>
> Did some search but couldn't find the useful resources covered that. Much
> appreciated if someone could point out the right direction.
>
> Regards,
> Patrick
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Insufficient system resources exist to complete the requested service

2014-09-15 Thread Robert Muir

SimpleFSDirectory doesn't use memory mapping.

I'd check you dont have leaks of indexreaders or similar. This error
happens in windows when it runs out of open file handles.

On Mon, Sep 15, 2014 at 3:52 AM, Michael McCandless
 wrote:
> Maybe your OS is running out of total virtual memory?  Try looking in
> task manager?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Sep 15, 2014 at 3:19 AM, Vincent Sevel  
> wrote:
>> Hi,
>>
>> I have had this issue :
>>
>>
>> java.io.FileNotFoundException: 
>> F:\logserver\index\INFRA-LOGSERVER2_UNIV_UNIV_DBIZ\cpp\cpp_D_2014-09-13\_xr_Lucene45_0.dvd
>>  (Insufficient system resources exist to complete the requested service)
>> at java.io.RandomAccessFile.open(Native Method)
>> at java.io.RandomAccessFile.(RandomAccessFile.java:216)
>> at 
>> org.apache.lucene.store.FSDirectory$FSIndexInput.(FSDirectory.java:382)
>> at 
>> org.apache.lucene.store.SimpleFSDirectory$SimpleFSIndexInput.(SimpleFSDirectory.java:103)
>> at 
>> org.apache.lucene.store.SimpleFSDirectory.openInput(SimpleFSDirectory.java:58)
>> at org.apache.lucene.store.Directory.copy(Directory.java:185)
>> at 
>> org.apache.lucene.store.TrackingDirectoryWrapper.copy(TrackingDirectoryWrapper.java:50)
>> at 
>> org.apache.lucene.index.IndexWriter.createCompoundFile(IndexWriter.java:4671)
>> at 
>> org.apache.lucene.index.DocumentsWriterPerThread.sealFlushedSegment(DocumentsWriterPerThread.java:535)
>> at 
>> org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:502)
>> at 
>> org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:508)
>> at 
>> org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:380)
>> at 
>> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:472)
>> at 
>> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1534)
>> at 
>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1204)
>> at 
>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1185)
>> ...
>>
>> the os is : microsoft windows server 2008 R2 Standard service pack 1
>> the F: drive is on the SAN.
>> The index takes only 2.4 Gb on the disk.
>> And there was 700 Gb free on the drive.
>> I have been running lucene 4.7.2 for a few weeks.
>> Before that I was running 3.X for a long long time.
>> It is the first time I have had this.
>>
>>
>> Any clue?
>>
>> Here is a "dir" on the index folder:
>>
>>
>> F:\logserver\index\INFRA-LOGSERVER2_UNIV_UNIV_DBIZ\cpp\cpp_D_2014-09-13>dir
>> Volume in drive F is Data2
>> Volume Serial Number is BCE1-C2B3
>>
>> Directory of 
>> F:\logserver\index\INFRA-LOGSERVER2_UNIV_UNIV_DBIZ\cpp\cpp_D_2014-09-13
>>
>> 15.09.2014  01:01  .
>> 15.09.2014  01:01  ..
>> 15.09.2014  00:0120 segments.gen
>> 15.09.2014  00:01 1'155 segments_fg
>> 13.09.2014  00:5229'965'939 _33.fdt
>> 13.09.2014  00:5239'350 _33.fdx
>> 13.09.2014  00:52 3'775 _33.fnm
>> 13.09.2014  00:52   925'296 _33.nvd
>> 13.09.2014  00:5257 _33.nvm
>> 13.09.2014  00:52   451 _33.si
>> 13.09.2014  00:5229'943'092 _33_Lucene41_0.doc
>> 13.09.2014  00:5220'360'820 _33_Lucene41_0.pos
>> 13.09.2014  00:52   134'078'995 _33_Lucene41_0.tim
>> 13.09.2014  00:52 2'195'230 _33_Lucene41_0.tip
>> 13.09.2014  00:52 5'632'139 _33_Lucene45_0.dvd
>> 13.09.2014  00:52   181 _33_Lucene45_0.dvm
>> 13.09.2014  01:5640'582'281 _7b.fdt
>> 13.09.2014  01:5651'013 _7b.fdx
>> 13.09.2014  01:57 3'775 _7b.fnm
>> 13.09.2014  01:57 1'273'742 _7b.nvd
>> 13.09.2014  01:5757 _7b.nvm
>> 13.09.2014  01:57   451 _7b.si
>> 13.09.2014  01:5739'450'774 _7b_Lucene41_0.doc
>> 13.09.2014  01:5726'965'828 _7b_Lucene41_0.pos
>> 13.09.2014  01:57   183'240'805 _7b_Lucene41_0.tim
>> 13.09.2014  01:57 2'987'652 _7b_Lucene41_0.tip
>> 13.09.2014  01:57 8'731'837 _7b_Lucene45_0.dvd
>> 13.09.2014  01:57   181 _7b_Lucene45_0.dvm
>> 13.09.2014  03:1624'042'612 _bs.fdt
>> 13.09.2014  03:1630'141 _bs.fdx
>> 13.09.2014  03:16 3'775 _bs.fnm
>> 13.09.2014  03:16   758'200 _bs.nvd
>> 13.09.2014  03:1657 _bs.nvm
>> 13.09.2014  03:16   451 _bs.si
>> 13.09.2014  03:1622'899'430 _bs_Lucene41_0.doc
>> 13.09.2014  03:1615'990'453 _bs_Lucene41_0.pos
>> 13.09.2014  03:16   110'022'516 _bs_Lucene41_0.tim
>> 13.09.2014  03:16 1'754'233 _bs_Lucene41_0.tip
>> 13.09.2014  03:16 4'014'752 _bs_Lucene45_0.dvd
>> 13.09.2014  03:16   181 _bs_Lucene45_0.dvm
>> 13.09.2014  04:0244'841'179 _

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-09-10 Thread Robert Muir

Yes, there is also a safety check, but IMO it should be removed.

See the patch on the issue, the test passes now.

On Wed, Sep 10, 2014 at 9:31 PM, Vitaly Funstein  wrote:
> Seems to me the bug occurs regardless of whether the passed in newer reader
> is NRT or non-NRT. This is because the user operates at the level of
> DirectoryReader, not SegmentReader and modifying the test code to do the
> following reproduces the bug:
>
> writer.commit();
> DirectoryReader latest = DirectoryReader.open(writer, true);
>
> // This reader will be used for searching against commit point 1
> DirectoryReader searchReader = DirectoryReader.openIfChanged(latest,
> ic1); //  <=== Exception/Assertion thrown here
>
>
> On Wed, Sep 10, 2014 at 6:26 PM, Robert Muir  wrote:
>
>> Thats because there are 3 constructors in segmentreader:
>>
>> 1. one used for opening new (checks hasDeletions, only reads liveDocs if
>> so)
>> 2. one used for non-NRT reopen <-- problem one for you
>> 3. one used for NRT reopen (takes a LiveDocs as a param, so no bug)
>>
>> so personally i think you should be able to do this, we just have to
>> add the hasDeletions check to #2
>>
>> On Wed, Sep 10, 2014 at 7:46 PM, Vitaly Funstein 
>> wrote:
>> > One other observation - if instead of a reader opened at a later commit
>> > point (T1), I pass in an NRT reader *without* doing the second commit on
>> > the index prior, then there is no exception. This probably also hinges on
>> > the assumption that no buffered docs have been flushed after T0, thus
>> > creating new segment files, as well... unfortunately, our system can't
>> make
>> > either assumption.
>> >
>> > On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein 
>> > wrote:
>> >
>> >> Normally, reopens only go forwards in time, so if you could ensure
>> >>> that when you reopen one reader to another, the 2nd one is always
>> >>> "newer", then I think you should never hit this issue
>> >>
>> >>
>> >> Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
>> >> the use here case is as follows: I want to be able to search the index
>> at a
>> >> particular point in time, let's call it T0. To that end, I save the
>> state
>> >> at that time via a commit and take a snapshot of the index. After that,
>> the
>> >> index is free to move on, to another point in time, say T1 - and likely
>> >> does. The optimization we have been discussing (and this is what the
>> test
>> >> code I posted does) basically asks the reader to go back to point T0,
>> while
>> >> reusing as much of the state of the index from T1, as long as it is
>> >> unchanged between the two.
>> >>
>> >> That's what DirectoryReader.openIfChanged(DirectoryReader, IndexCommit)
>> is
>> >> supposed to do internally... or am I misinterpreting the
>> >> intent/implementation of it?
>> >>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: BlockTreeTermsReader consumes crazy amount of memory

2014-09-10 Thread Robert Muir

Thats because there are 3 constructors in segmentreader:

1. one used for opening new (checks hasDeletions, only reads liveDocs if so)
2. one used for non-NRT reopen <-- problem one for you
3. one used for NRT reopen (takes a LiveDocs as a param, so no bug)

so personally i think you should be able to do this, we just have to
add the hasDeletions check to #2

On Wed, Sep 10, 2014 at 7:46 PM, Vitaly Funstein  wrote:
> One other observation - if instead of a reader opened at a later commit
> point (T1), I pass in an NRT reader *without* doing the second commit on
> the index prior, then there is no exception. This probably also hinges on
> the assumption that no buffered docs have been flushed after T0, thus
> creating new segment files, as well... unfortunately, our system can't make
> either assumption.
>
> On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein 
> wrote:
>
>> Normally, reopens only go forwards in time, so if you could ensure
>>> that when you reopen one reader to another, the 2nd one is always
>>> "newer", then I think you should never hit this issue
>>
>>
>> Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
>> the use here case is as follows: I want to be able to search the index at a
>> particular point in time, let's call it T0. To that end, I save the state
>> at that time via a commit and take a snapshot of the index. After that, the
>> index is free to move on, to another point in time, say T1 - and likely
>> does. The optimization we have been discussing (and this is what the test
>> code I posted does) basically asks the reader to go back to point T0, while
>> reusing as much of the state of the index from T1, as long as it is
>> unchanged between the two.
>>
>> That's what DirectoryReader.openIfChanged(DirectoryReader, IndexCommit) is
>> supposed to do internally... or am I misinterpreting the
>> intent/implementation of it?
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: 4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)

2014-09-10 Thread Robert Muir

Ian, its a supported version. It wouldnt matter if its 4.0 alpha or
beta anyway, because we support index back compat for those.

In your case, its actually the final version. I will open an issue.

Thank you for reporting this!

On Wed, Sep 10, 2014 at 7:54 AM, Ian Lea  wrote:
> Yes, quite possible.  I do sometimes download and test beta versions.
>
> This isn't really a problem for me - it has only happened on test
> indexes that I don't care about, but there might be live indexes out
> there that are also affected and having them made unusable would be
> undesirable, to put it mildly.  A message saying "Unsupported version"
> would be much better.
>
>
> --
> Ian.
>
>
> On Wed, Sep 10, 2014 at 12:41 PM, Uwe Schindler  wrote:
>> Hi Ian,
>>
>> this index was created with the BETA version of Lucene 4.0:
>>
>> Segments file=segments_2 numSegments=1 version=4.0.0.2 format=
>>   1 of 1: name=_0 docCount=15730
>>
>> "4.0.0.2" was the index version number of Lucene 4.0-BETA. This is not a 
>> supported version and may not open correctly. In Lucene 4.10 we changed 
>> version handling and parsing version numbers a bit, so this may be the cause 
>> for the error.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>>> -Original Message-
>>> From: Ian Lea [mailto:ian@gmail.com]
>>> Sent: Wednesday, September 10, 2014 1:01 PM
>>> To: java-user@lucene.apache.org
>>> Subject: 4.10.0: java.lang.IllegalStateException: cannot write 3x 
>>> SegmentInfo
>>> unless codec is Lucene3x (got: Lucene40)
>>>
>>> Hi
>>>
>>>
>>> On running a quick test after a handful of minor code changes to deal with
>>> 4.10 deprecations, a program that updates an existing index failed with
>>>
>>> Exception in thread "main" java.lang.IllegalStateException: cannot write 3x
>>> SegmentInfo unless codec is Lucene3x (got: Lucene40) at
>>> org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)
>>>
>>> and along the way did something to the index to make it unusable.
>>>
>>> Digging a bit deeper and working on a different old test index that was 
>>> lying
>>> around, and taking a backup first this time, this is reproducible.
>>>
>>> The working index:
>>>
>>> total 1036
>>> -rw-r--r-- 1 tril users 165291 Jan 18  2013 _0.fdt
>>> -rw-r--r-- 1 tril users 125874 Jan 18  2013 _0.fdx
>>> -rw-r--r-- 1 tril users   1119 Jan 18  2013 _0.fnm
>>> -rw-r--r-- 1 tril users 378015 Jan 18  2013 _0_Lucene40_0.frq
>>> -rw-r--r-- 1 tril users 350628 Jan 18  2013 _0_Lucene40_0.tim
>>> -rw-r--r-- 1 tril users  13988 Jan 18  2013 _0_Lucene40_0.tip
>>> -rw-r--r-- 1 tril users311 Jan 18  2013 _0.si
>>> -rw-r--r-- 1 tril users 69 Jan 18  2013 segments_2
>>> -rw-r--r-- 1 tril users 20 Jan 18  2013 segments.gen
>>>
>>> and output from 4.10 CheckIndex
>>>
>>> Opening index @ index/
>>>
>>> Segments file=segments_2 numSegments=1 version=4.0.0.2 format=
>>>   1 of 1: name=_0 docCount=15730
>>> version=4.0.0.2
>>> codec=Lucene40
>>> compound=false
>>> numFiles=7
>>> size (MB)=0.987
>>> diagnostics = {os=Linux, os.version=3.1.0-1.2-desktop, source=flush,
>>> lucene.version=4.0.0 1394950 - rmuir - 2012-10-06 02:58:12, os.arch=amd64,
>>> java.version=1.7.0_10, java.vendor=Oracle Corporation}
>>> no deletions
>>> test: open reader.OK
>>> test: check integrity.OK
>>> test: check live docs.OK
>>> test: fields..OK [13 fields]
>>> test: field norms.OK [0 fields]
>>> test: terms, freq, prox...OK [53466 terms; 217447 terms/docs pairs; 
>>> 139382
>>> tokens]
>>> test: stored fields...OK [15730 total field count; avg 1 fields per 
>>> doc]
>>> test: term vectorsOK [0 total vector count; avg 0 term/freq 
>>> vector
>>> fields per doc]
>>> test: docvalues...OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0
>>> SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
>>>
>>> No problems were detected with this index.
>>>
>>>
>>> Now run this little program
>>>
>>> public static void main(final String[] _args) throws Exception { File 
>>> index =
>>> new File(_args[0]); IndexWriterConfig iwcfg = new
>>> IndexWriterConfig(Version.LUCENE_4_10_0,
>>> new StandardAnalyzer());
>>> iwcfg.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
>>> Directory d = FSDirectory.open(index, new SimpleFSLockFactory(index));
>>> IndexWriter iw = new IndexWriter(d, iwcfg); Document doc1 = new
>>> Document(); doc1.add(new StringField("type", "test", Field.Store.NO));
>>> iw.addDocument(doc1); iw.close();
>>> }
>>>
>>> and it fails with
>>>
>>> Exception in thread "main" java.lang.IllegalStateException: cannot write 3x
>>> SegmentInfo unless codec is Lucene3x (got: Lucene40) at
>>> org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)
>>> at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:524)
>>> at
>>> org.apache.lucene.index.SegmentInfos.prepareC

Re: 4.10.0: java.lang.IllegalStateException: cannot write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)

2014-09-10 Thread Robert Muir

Ian, this looks terrible, thanks for reporting this. Is there any
possible way I could have a copy of that "working" index to make it
easier to reproduce?

On Wed, Sep 10, 2014 at 7:01 AM, Ian Lea  wrote:
> Hi
>
>
> On running a quick test after a handful of minor code changes to deal
> with 4.10 deprecations, a program that updates an existing index
> failed with
>
> Exception in thread "main" java.lang.IllegalStateException: cannot
> write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)
> at org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)
>
> and along the way did something to the index to make it unusable.
>
> Digging a bit deeper and working on a different old test index that
> was lying around, and taking a backup first this time, this is
> reproducible.
>
> The working index:
>
> total 1036
> -rw-r--r-- 1 tril users 165291 Jan 18  2013 _0.fdt
> -rw-r--r-- 1 tril users 125874 Jan 18  2013 _0.fdx
> -rw-r--r-- 1 tril users   1119 Jan 18  2013 _0.fnm
> -rw-r--r-- 1 tril users 378015 Jan 18  2013 _0_Lucene40_0.frq
> -rw-r--r-- 1 tril users 350628 Jan 18  2013 _0_Lucene40_0.tim
> -rw-r--r-- 1 tril users  13988 Jan 18  2013 _0_Lucene40_0.tip
> -rw-r--r-- 1 tril users311 Jan 18  2013 _0.si
> -rw-r--r-- 1 tril users 69 Jan 18  2013 segments_2
> -rw-r--r-- 1 tril users 20 Jan 18  2013 segments.gen
>
> and output from 4.10 CheckIndex
>
> Opening index @ index/
>
> Segments file=segments_2 numSegments=1 version=4.0.0.2 format=
>   1 of 1: name=_0 docCount=15730
> version=4.0.0.2
> codec=Lucene40
> compound=false
> numFiles=7
> size (MB)=0.987
> diagnostics = {os=Linux, os.version=3.1.0-1.2-desktop,
> source=flush, lucene.version=4.0.0 1394950 - rmuir - 2012-10-06
> 02:58:12, os.arch=amd64, java.version=1.7.0_10, java.vendor=Oracle
> Corporation}
> no deletions
> test: open reader.OK
> test: check integrity.OK
> test: check live docs.OK
> test: fields..OK [13 fields]
> test: field norms.OK [0 fields]
> test: terms, freq, prox...OK [53466 terms; 217447 terms/docs
> pairs; 139382 tokens]
> test: stored fields...OK [15730 total field count; avg 1 fields per 
> doc]
> test: term vectorsOK [0 total vector count; avg 0
> term/freq vector fields per doc]
> test: docvalues...OK [0 docvalues fields; 0 BINARY; 0
> NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
>
> No problems were detected with this index.
>
>
> Now run this little program
>
> public static void main(final String[] _args) throws Exception {
> File index = new File(_args[0]);
> IndexWriterConfig iwcfg = new IndexWriterConfig(Version.LUCENE_4_10_0,
> new StandardAnalyzer());
> iwcfg.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
> Directory d = FSDirectory.open(index, new SimpleFSLockFactory(index));
> IndexWriter iw = new IndexWriter(d, iwcfg);
> Document doc1 = new Document();
> doc1.add(new StringField("type", "test", Field.Store.NO));
> iw.addDocument(doc1);
> iw.close();
> }
>
> and it fails with
>
> Exception in thread "main" java.lang.IllegalStateException: cannot
> write 3x SegmentInfo unless codec is Lucene3x (got: Lucene40)
> at org.apache.lucene.index.SegmentInfos.write3xInfo(SegmentInfos.java:607)
> at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:524)
> at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:1017)
> at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4549)
> at 
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3062)
> at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3169)
> at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:915)
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:986)
> at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:956)
> at t.main(t.java:25)
>
> and when run CheckIndex again get
>
>
> Opening index @ index/
>
> ERROR: could not read any segments file in directory
> java.nio.file.NoSuchFileException: /tmp/lucene/index/_0.si
> at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> at 
> sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:176)
> at java.nio.channels.FileChannel.open(FileChannel.java:287)
> at java.nio.channels.FileChannel.open(FileChannel.java:334)
> at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:196)
> at 
> org.apache.lucene.codecs.lucene40.Lucene40SegmentInfoReader.read(Lucene40SegmentInfoReader.java:52)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:362)
> at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:458)
> at 
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:913)
> at 
> org.apache.lucene.i

1 2 3 4 5 6 >

1 - 100 of 527 matches

Mail list logo