Re: Using Lucene for technical documentation

2020-11-23 Thread Erick Erickson
You might be able to get something “good enough” with one of the pattern 
tokenizers, see: https://lucene.apache.org/solr/guide/8_6/tokenizers.html.

Won’t be 100% of course.

And Paul’s comments are well taken, especially since your input will be 
inconsistent I’d guess. How much you want to bet that the same document will 
have "the abort() function” in one paragraph and "the abort function” in the 
next with abort italicized?

Best,
Erick

> On Nov 23, 2020, at 2:42 AM, Trevor Nicholls  
> wrote:
> 
> the abort() function


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Migration Query

2020-11-22 Thread Erick Erickson
If you created your index with 7x, you don’t need to do anything, 8x will be 
able to operate with it. If you ever used 6x to index any docs you must reindex 
completely by deleting the entire index and starting over, or index to a new 
collection and use collection aliasing to seamlessly switch.

Best,
Erick

> On Nov 22, 2020, at 7:48 AM, Adarsh Sunilkumar  
> wrote:
> 
> Hi Team,
> 
> Currently I am using Lucene 7.3, I want to upgrade to lucene 8.5.1. Should
> I do reindexing in this case ?
> Can I make use of backward codec jar without a reindex?
> 
> 
> 
> Thanks & Regards,
> Adarsh Sunilkumar.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene Migration query

2020-11-20 Thread Erick Erickson
The IndexUpgraderTool does a forceMerge(1). If you have a large index,
that has its own problems, but will work. The threshold for the issues is
5G. See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
I should emphasize that if you have a very large single segment as a
result, it’ll eventually shrink if it accumulated deleted (or updated) 
documents,
it’ll just require a bunch of I/O amortized over time.

IndexUpgraderTool will _not_ allow you to take an index originally created with
7x to be used in 9x. (Uwe, I’ve been telling people this for a long time, if 
I’ve
been lying please let me know!). Starting with Lucene 6, a version is written 
into
each segment. Upon merge, the lowest version stamp is preserved. Lucene
will refuse to open an index where _any_ segment has a version stamp X-2 or
older.

Best,
Erick

> On Nov 20, 2020, at 7:57 AM, Michael Sokolov  wrote:
> 
> I think running the upgrade tool would also be necessary to set you up for
> the next upgrade, when 9.0 comes along.
> 
> On Fri, Nov 20, 2020, 4:25 AM Uwe Schindler  wrote:
> 
>> Hi,
>> 
>>> Currently I am using Lucene 7.3, I want to upgrade to lucene 8.5.1.
>> Should
>>> I do reindexing in this case ?
>> 
>> No, you don't need that.
>> 
>>> Can I make use of backward codec jar without a reindex?
>> 
>> Yes, just add the JAR file to your classpath and it can read the indexes.
>> Updates written to the index will use the new codecs. To force a full
>> upgrade (rewrite all segments), invoke the IndexUpgrader class either from
>> your code or using the command line. But this is not needed, it just makes
>> sure that you can get rid of the backwards-codecs jar.
>> 
>> Uwe
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-13 Thread Erick Erickson
What does “final finished sizes” mean? After optimize of just after finishing 
all indexing?
The former is what counts here.

And you provided no information on the number of deleted docs in the two cases. 
Is 
the number of deletedDocs the same (or close)? And does the q=*:* query
return the same numFound?

Finally, are you absolutely and totally sure that no other options changed. For 
instance,
you specified docValues=true for some field in one but not the other. Or 
stored=true
etc. If you’re using the same schema.

And you also haven’t provided information on what versions of Solr you’re 
talking about.
You mention 7.7.2, but not the _other_ version of solr. If you’re going from 
one major
version to another, sometimes defaults change for docValues on primitive fields
especially. I’d consider firing up Luke and examining the field definitions in
detail.

Best,
Erick

> On Nov 13, 2020, at 12:16 AM, baris.ka...@oracle.com wrote:
> 
> Hi,-
> Thanks.
> These are final finished sizes in both cases.
> Best regards
> 
> 
>> On Nov 12, 2020, at 11:12 PM, Erick Erickson  wrote:
>> 
>> Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked 
>> “fixed” and the version is 8.0
>> 
>> As for your other question, index size is a very imprecise number. How many 
>> deleted documents are there
>> in each case? Deleted documents take up disk space until the segments 
>> containing them are merged away.
>> 
>> Best,
>> Erick
>> 
>>> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
>>> 
>>> https://urldefense.com/v3/__https://issues.apache.org/jira/browse/LUCENE-8448__;!!GqivPVa7Brio!I3RsAXIoDcPmpP_sc8C29vn8DcAXSvIgH7pvcxyDaBnfhdJAk24zPpQhqP035V1IJA$
>>>  
>>> 
>>> 
>>> Hi,-
>>> 
>>> is this issue fixed please? Could You please help me figure it out?
>>> 
>>> Best regards
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: https://issues.apache.org/jira/browse/LUCENE-8448

2020-11-12 Thread Erick Erickson
Yes, that issue is fixed. The “Resolution” tag is the key, it’s marked “fixed” 
and the version is 8.0

As for your other question, index size is a very imprecise number. How many 
deleted documents are there
in each case? Deleted documents take up disk space until the segments 
containing them are merged away.

Best,
Erick

> On Nov 12, 2020, at 5:35 PM, baris.ka...@oracle.com wrote:
> 
> https://issues.apache.org/jira/browse/LUCENE-8448
> 
> 
> Hi,-
> 
>  is this issue fixed please? Could You please help me figure it out?
> 
> Best regards
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which Lucene 8.5.X is recommended?

2020-11-12 Thread Erick Erickson
Always use the most recent point release. The only time we go from x.y.z to 
x.y.z+1 is if there are _significant_ problems. This is much different than 
going from x.y to x.y+1...

> On Nov 12, 2020, at 5:49 PM, baris.ka...@oracle.com wrote:
> 
> Hi,-
> 
>  is it best to use 8.5.2?
> 
> Best regards
> 
> 
> 
> Release 8.5.2
> Bug Fixes   (1)
> LUCENE-9350: Partial reversion of LUCENE-9068; holding levenshtein automata 
> on FuzzyQuery can end up blowing up query caches which use query objects as 
> cache keys, so building the automata is now delayed to search time again.
> (Alan Woodward, Mike Drob
> 
> 
> Release 8.5.1 [2020-04-16]
> Bug Fixes   (1)
> LUCENE-9300: Fix corruption of the new gen field infos when doc values 
> updates are applied on a segment created externally and added to the index 
> with IndexWriter#addIndexes(Directory).
> (Jim Ferenczi, Adrien Grand)
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: BooleanQuery: BooleanClause.Occur.MUST_NOT seems to require at least one BooleanClause.Occur.MUST

2020-11-06 Thread Erick Erickson
Nissim:

Here’s a good explanation of why it was designed this way
if you’d like details:

https://lucidworks.com/post/why-not-and-or-and-not/

Don’t be put off by the Solr title, it’s really about
BooleanQuery and BooleanClause

Best,
Erick

> On Nov 6, 2020, at 8:17 AM, Adrien Grand  wrote:
> 
> Hi Nissim,
> 
> This is by design: boolean queries that don't have positive clauses like
> empty boolean queries or boolean queries that only consist of negative
> (MUST_NOT) clauses don't match any hits.
> 
> On Thu, Nov 5, 2020 at 9:07 PM Nissim Shiman 
> wrote:
> 
>> Hello Apache Lucene team members,
>> I have found that constructing a BooleanQuery with just
>> a BooleanClause.Occur.MUST_NOT will return no results.  It will return
>> results is if there is also a BooleanClause.Occur.MUST as part of the query
>> as well though.
>> 
>> 
>> I don't see this limitation with a BooleanQuery with just
>> a BooleanClause.Occur.MUST (i.e. results will return fine if they match).
>> 
>> Is this by design or is this an issue?
>> 
>> Thanks You,
>> Nissim Shiman
> 
> 
> 
> -- 
> Adrien


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Erick Erickson
This is going to be fairly painful. You need to keep a list 6.5M
items long, sorted.

Before diving in there, I’d really back up and ask what the use-case
is. Returning 6.5M docs to a user is useless, so are you’re doing
some kind of analytics maybe? In which case, and again
assuming you’re using Solr, Streaming Aggregation might
be a better option.

This really sounds like an XY problem. You’re trying to solve problem X
and asking how to accomplish it with Y. What I’m questioning
is whether Y (grouping) is a good approach or not. Perhaps if
you explained X there’d be a better suggestion.

Best,
Erick

> On Oct 9, 2020, at 8:19 AM, Dmitry Emets  wrote:
> 
> I have 12_000_000 documents, 6_500_000 groups
> 
> With sort: It takes around 1 sec without grouping, 2 sec with grouping and
> 12 sec with setAllGroups(true)
> Without sort: It takes around 0.2 sec without grouping, 0.6 sec with
> grouping and 10 sec with setAllGroups(true)
> 
> Thank you, Erick, I will look into it
> 
> пт, 9 окт. 2020 г. в 14:32, Erick Erickson :
> 
>> At the Solr level, CollapsingQParserPlugin see:
>> https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html
>> 
>> You could perhaps steal some ideas from that if you
>> need this at the Lucene level.
>> 
>> Best,
>> Erick
>> 
>>> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
>> dceccarel...@bloomberg.net> wrote:
>>> 
>>> Is the field that you are using to dedupe stored as a docvalue?
>>> 
>>> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:
>> java-user@lucene.apache.org
>>> Subject: Deduplication of search result with custom with custom sort
>>> 
>>> Hi,
>>> I need to deduplicate search results by specific field and I have no idea
>>> how to implement this properly.
>>> I have tried grouping with setGroupDocsLimit(1) and it gives me expected
>>> results, but has not very good performance.
>>> I think that I need something like DiversifiedTopDocsCollector, but
>>> suitable for collecting TopFieldDocs.
>>> Is there any possibility to achieve deduplication with existing lucene
>>> components, or do I need to implement my own
>> DiversifiedTopFieldsCollector?
>>> 
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Deduplication of search result with custom with custom sort

2020-10-09 Thread Erick Erickson
At the Solr level, CollapsingQParserPlugin see:
https://lucene.apache.org/solr/guide/8_6/collapse-and-expand-results.html

You could perhaps steal some ideas from that if you
need this at the Lucene level.

Best,
Erick

> On Oct 9, 2020, at 7:25 AM, Diego Ceccarelli (BLOOMBERG/ LONDON) 
>  wrote:
> 
> Is the field that you are using to dedupe stored as a docvalue? 
> 
> From: java-user@lucene.apache.org At: 10/09/20 12:18:04To:  
> java-user@lucene.apache.org
> Subject: Deduplication of search result with custom with custom sort
> 
> Hi,
> I need to deduplicate search results by specific field and I have no idea
> how to implement this properly.
> I have tried grouping with setGroupDocsLimit(1) and it gives me expected
> results, but has not very good performance.
> I think that I need something like DiversifiedTopDocsCollector, but
> suitable for collecting TopFieldDocs.
> Is there any possibility to achieve deduplication with existing lucene
> components, or do I need to implement my own DiversifiedTopFieldsCollector?
> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Exact sub-phrase matching?

2020-09-25 Thread Erick Erickson
Have you looked at edismax, pf2 and pf3?

On Fri, Sep 25, 2020, 15:07 Gregg Donovan  wrote:

> Hello!
>
> I'm wondering what the state-of-the-art for matching exact sub phrases
> within Lucene is. As a bonus, I'd love to attach a boost to each of the
> subphrases matched (if possible).
>
> For example:
>
> doc 1: "field": "tree skirt  spring skirt 
> spring dress"
> doc 2: "field": "christmas tree skirt  winter skirt  gap> christmas dress"
> doc 3: "field" "skirt  spring dress  dress"
>
> query: christmas tree skirt
>
> This should match doc 1 and 2 but not doc 3. I'd like to also to score doc
> 2 higher for having a longer match. Ideally, I'd love to add a score to
> each of these phrases and use that at scoring time, too.
>
> Thanks!
>
> Gregg Donovan
> Senior Staff Software Engineer, Etsy.com
>


Re: I resurrected a 2013 project (Lucene 4.2) and I want to convert it to 8.6

2020-08-04 Thread Erick Erickson
Well, a _lot_ has changed since 4.x. Rather than look through the code, I’d
start with the reference guide and the upgrade notes and major changes
that accompany any release. 

As for “official dictionaries”, no there aren’t. “somewhere out on the web”
there are certainly various word lists you can download. The problem is
that almost every Solr installation is specialized. An e-commerce site better
have a lot of brand names. Insurance usages need medical terms. 
Chemistry… oh my aching head.

Best,
Erick

> On Aug 4, 2020, at 12:39 AM, Ali Akhtar  wrote:
> 
> You could probably google for a dictionary and download a text file. For
> English, there is Wordnet which has a java client for accessing it.
> 
> I think you would use a FuzzyQuery or QueryParser with a tilde (-) to
> indícate the terms you’d like to do the spellcheck for. This will find
> terms within a 2 edit distance.
> 
> 
> 
> On Tue, 4 Aug 2020 at 4:17 AM, Sébastien Dionne 
> wrote:
> 
>> hello,  first, there is a google forum or other site to see the questions
>> in the mailing-list ?
>> 
>> my project was using dictionary indexed + files that I wanted to check for
>> spelling errors + suggestions.
>> 
>> I try for fun to just update the maven dependencies and my code doesn't
>> compile.. it was expected :)
>> 
>> so I'll write it from scratch ..will be cleaner too.
>> 
>> I used dictionaries from wiktionary and I used a script to convert hunspell
>> dictionaries to wordlist format at that time.
>> 
>> There must be official dictionaries that I can used directly now ?
>> 
>> I found a project languagetool that have lot of dictionaries and they use
>> lucene + hunspell wrapper (native -> java), but it doesn't work on Windows
>> 10.
>> 
>> 
>> At my starting point, I want to create a little POC that use english/french
>> dictionaries and parse a file to check the spelling error.
>> 
>> After that, add custom dictionnaries + find suggestions + highlight the
>> word in the text.  That was I had with Lucene 4.2
>> 
>> any thought on what changes since 2013 ?  I'll start looking at the code
>> from github
>> 
>> thanks
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Ulimit recommendation for Apache Lucene 6.5.1

2020-07-14 Thread Erick Erickson
You’re probably OK, but why risk it? Setting the ulimits that
high doesn’t really negatively impact much of anytyhing. Leaving
them low leaves you open to problems in future.

I see no reason to keep them low, even in this situation.

Best,
Erick

> On Jul 14, 2020, at 5:10 PM, Ali Akhtar  wrote:
> 
> If you cache the IndexSearcher and only have a couple of segments, and it’s
> a read only system (indexing is done just once), would it still open a lot
> of files?
> 
> On Tue, 14 Jul 2020 at 7:05 PM, Erick Erickson 
> wrote:
> 
>> At least 65K. Yes, 65 thousand. Ditto for processes.
>> 
>>> On Jul 14, 2020, at 8:35 AM, Archana A M 
>> wrote:
>>> 
>>> Dear Team,
>>> 
>>> We are getting "too many open files" in server while trying to access
>>> apache Lucene cache.
>>> 
>>> Could someone please suggest the recommended open file limit while using
>>> apache Lucene in the application.
>>> 
>>> Please find the relevant details below
>>> 
>>> Lucene version - 6.5.1
>>> 
>>> current ulimit soft limit - 1024
>>> 
>>> current ulimit hard limit - 4096
>>> 
>>> Server - Jbosseap 7.1
>>> 
>>> Thanks
>>> 
>>> Archana
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Ulimit recommendation for Apache Lucene 6.5.1

2020-07-14 Thread Erick Erickson
At least 65K. Yes, 65 thousand. Ditto for processes.

> On Jul 14, 2020, at 8:35 AM, Archana A M  wrote:
> 
> Dear Team,
> 
> We are getting "too many open files" in server while trying to access
> apache Lucene cache.
> 
> Could someone please suggest the recommended open file limit while using
> apache Lucene in the application.
> 
> Please find the relevant details below
> 
> Lucene version - 6.5.1
> 
> current ulimit soft limit - 1024
> 
> current ulimit hard limit - 4096
> 
> Server - Jbosseap 7.1
> 
> Thanks
> 
> Archana


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Storing Json field in Lucene

2020-04-22 Thread Erick Erickson
"Is it good idea to store complete Json as string to Lucene DB. If we store as 
separate fields then we have around 30 fields. There will be 30 seeks to get 
complete stored fields”

This is not true. Under the covers, all the stored fields are compressed and 
stored as a blob and Lucene does the magic of un-compressing that blob and 
extracting the stored field when you ask for it.

Further, while you’re right that storing lots of things will bloat the index, 
that’s not very important. Stored data is kept in separate files (*.fdx) in 
each segment and has little to no impact on search performance. That data is 
not accessed unless you ask for the field to be returned, i.e. it’s not part of 
the data used to get the top N documents. Say you have a search that has 
10,000,000 hits and return the top 10. _Only_ the stored data for those top 10 
hits is accessed, and that only after all the scoring is done.

I think this is premature optimization, try using the least-complex way 
organizing your data and measure.

Best,
Erick

> On Apr 22, 2020, at 1:00 AM, ganesh m  wrote:
> 
> Is it good idea to store complete Json as string to Lucene DB. If we store as 
> separate fields then we have around 30 fields. There will be 30 seeks to get 
> complete stored fields


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene download page

2020-02-23 Thread Erick Erickson
No, 7.7.2 was a patch fix that _was_ released after 8.1.1.

> On Feb 22, 2020, at 2:49 PM, baris.ka...@oracle.com wrote:
> 
> Hi,-
> 
>  i hope everyone is doing great.
> 
> Licene 7.7.2 is listed as released after Lucene 8.1.1 is released on this 
> page 
> https://lucene.apache.org/core/corenews.html#apache-lucenetm-841-available
> 
> I think the order may need to change there.
> 
> Best regards
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Searching number of tokens in text field

2019-12-30 Thread Erick Erickson
This comes up occasionally, it’d be a neat thing to add to Solr if you’re 
motivated. It gets tricky though.

- part of the config would have to be the name of the length field to put the 
result into, that part’s easy.

- The trickier part is “when should the count be incremented?”. For instance, 
say you add 15 synonyms for a particular word. Would that add 1 or 16 to the 
count? What about WordDelimiterGraphFilterFactory, that can output N tokens in 
place of one. Do stopwords count? What about shingles? CJK languages? The list 
goes on.

If you tackle this I suggest you open a JIRA for discussion, probably a Lucene 
JIRA ‘cause the folks who deal with Lucene would have the best feedback. And 
probably ignore most of the possible interactions with other filters and 
document that most users should just put it immediately after the tokenizer and 
leave it at that ;)

I can think of a few other options, but about the only thing that I think makes 
sense is something like “countTokensInTheSamePosition=true|false” (there’s 
_GOT_ to be a better name for that!), defaulting to false so you could control 
whether synonym expansion and WDGFF insertions incremented the count or not. 
And I suspect that if you put such a filter after WDGFF, you’d also want to 
document that it should go after FlattenGraphFilterFactory, but trust any 
feedback on a Lucene JIRA over my suspicion...

Best,
Erick

> On Dec 29, 2019, at 7:57 PM, Matt Davis  wrote:
> 
> That is a clever idea.  I would still prefer something cleaner but this
> could work.  Thanks!
> 
> On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov  wrote:
> 
>> I don't know of any pre-existing thing that does exactly this, but how
>> about a token filter that counts tokens (or positions maybe), and then
>> appends some special token encoding the length?
>> 
>> On Sat, Dec 28, 2019, 9:36 AM Matt Davis  wrote:
>> 
>>> Hello,
>>> 
>>> I was wondering if it is possible to search for the number of tokens in a
>>> text field.  For example find book titles with 3 or more words.  I don't
>>> mind adding a field that is the number of tokens to the search index but
>> I
>>> would like to avoid analyzing the text two times.   Can Lucene search for
>>> the number of tokens in a text field?  Or can I get the number of tokens
>>> after analysis and add it to the Lucene document before/during indexing?
>>> Or do I need to analysis the text myself and add the field to the
>> document
>>> (analyze the text twice, once myself, once in the IndexWriter).
>>> 
>>> Thanks,
>>> Matt Davis
>>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene index directory grows and shrinks

2019-11-04 Thread Erick Erickson
Here’s a neat visualization: 
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

The short form is this: 

- A “segment” is all the files with a particular prefix in your index 
directory, e.g.  _12ey1* is one segment
- Segments are created as documents are indexed and commits occur.
- Periodically, segments are “merged”, that is some number of segments are 
combined into a single new segment and then the old segments are deleted.
- During the merge, both the old and new segments occupy index space.
- Deleted documents continue to occupy disk space until the segment containing 
them are merged. NOTE: updating the same document deletes the old version and 
adds a new one, so that is a “deleted” document for this discussion.

So it’s quite common for deletes to accumulate until they are merged away. You 
have two sources of fluctuation:
1> deleted docs
2> the merging process.

And in your case, I see one segment around 25G. That indicates your index has 
been optimized at some point, and also I’d guess you’re on Lucene prior to 
release 7.5, so whenever you optimized again, _all_ segments will be merged 
into a single new segment, meaning your index will _at least- double in size 
temporarily.

Now, how this happens, you’d have to ask the jackrabbit folks since I don’t 
know that app either. 

For the gory details on optimize, see: 
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/.
 Even though that’s labeled Solr, it’s really about Lucene and the doc applies 
to anything that uses Lucene with the Tiered Merge Policy (which has been the 
default for some time). Although whether jackrabbit does anything with this I 
don’t have a clue.

Best,
Erick


> On Nov 4, 2019, at 11:19 AM, Raffaele Gambelli  wrote:
> 
> For what you know, is this behaviour which you defined "typical" described 
> deeply somewhere?
> 
> It is foundamental for me to better understand it even to know how big an 
> index can grow, in a way that I can allocate the right disk space.
> 
> Thank you very much
> 
> -Messaggio originale-
> Da: Raffaele Gambelli  
> Inviato: lunedì 4 novembre 2019 15:16
> A: java-user@lucene.apache.org
> Oggetto: R: Lucene index directory grows and shrinks
> 
> Thanks for your quick reply, I'm quite a beginner in Lucene concepts, 
> Jackrabbit hides almost all about the way it uses Lucene internally.
> 
> Anyway here it is the size of each sub-directory in my index, please note the 
> bigger one, 25G,  is it normal?
> 
> ...repository/workspaces/default/index$ du -h .
> 2.5G./_12ey1
> 14M ./_1dr9s
> 20M ./_1dr8d
> 2.8G./_1b9pj
> 5.8M./_1drqc
> 19M ./_1dr4q
> 2.5G./_17lmu
> 4.0M./_1drmx
> 11M ./_1drbf
> 4.3M./_1drok
> 13M ./_1drq1
> 40K ./_1drqe
> 11M ./_1drhc
> 260M./_1dr3g
> 664M./_1by44
> 2.5G./_14tet
> 281M./_1c4wj
> 25G ./_zzgq
> 274M./_1d2nc
> 638M./_1ctf0
> 580K./_1drqf
> 304K./_1drqd
> 6.5M./_1dr6m
> 325M./_1djfp
> 37G
> 
> I tried also to download index directory to my local machine, to inspect them 
> with Luke which I know a bit, but for network problem the download always 
> interrupts.
> 
>> What is your segment size limit?
> 
> I don't know, where could I see that limit?
> 
>> Have you changed the default merge frequency or max segments configuration?
> 
> Merge frequency is the mergeFactor ? If yes I'm using the default that is 10, 
> read here https://jackrabbit.apache.org/archive/wiki/JCR/Search_115513504.html
> 
> Max segment I don't know, where could I see it?
> 
> Bye
> 
> -Messaggio originale-
> Da: Sharma 
> Inviato: lunedì 4 novembre 2019 14:46
> A: java-user@lucene.apache.org
> Oggetto: Re: Lucene index directory grows and shrinks
> 
> This are typical symptoms of an index merge.
> 
> However, it is hard to predict more without knowing more data. What is your 
> segment size limit? Have you changed the default merge frequency or max 
> segments configuration? Would you have an estimate of ratio of number of 
> segments reaching max limit / total segments?
> 
> Atri
> 
> On Mon, Nov 4, 2019 at 7:12 PM Raffaele Gambelli  
> wrote:
>> 
>> Hi all,
>> 
>> I'm using Jackrabbit 2.18.0 which uses lucene-core 3.6.0.
>> 
>> I'm working on an application that has reached 37 G of directory index, a 
>> few days ago, disk occupancy has quickly reached 100% and then returned to 
>> pre-growth employment.
>> 
>> I believe that was caused by a rapid growth of Lucene index directory, 
>> looking for such an event I've found only this article describing 
>> something really similar 
>> https://helpx.adobe.com/uk/experience-manager/kb/lucene-index-director
>> y-growth.html
>> 
>> I would like to know more info about this behaviour, first of all can you 
>> confirm this growth and shrinkage?
>> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For a

Re: Getting a MaxBytesLengthExceededException for a TextField

2019-10-25 Thread Erick Erickson
Text-based fields indeed do not have that limit for the _entire_ field. They 
_do_ have that limit for any single token produced. So if your field contains, 
say, a base-64 encoded image that is not broken up into smaller tokens, you’ll 
still get this error.

Best,
Erick

> On Oct 25, 2019, at 4:28 AM, Marko Ćurlin  
> wrote:
> 
> Hi everyone,
> 
> I am getting an
> org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException, while
> trying to insert a list with 9 elements, of which one is 242905 bytes long,
> into Solr.  I am aware that StrField has a hard limit of slightly less than
> 32k. I am using a TextField that by my understanding hasn't got such a
> limit, as tested here
> 
> (taking into consideration that the field wasn't multivalued). So I'm
> wondering what is the correlation here, and how could it be solved? Below I
> have the error and the relevant part of the solr managed_schema. I am still
> new to Solr so take into account that there could be something obvious I am
> missing.
> 
> ERROR:
> 
> "error":{
>"metadata":[
>  "error-class","org.apache.solr.common.SolrException",
>  
> "root-error-class","org.apache.lucene.util.BytesRefHash$MaxBytesLengthExceededException",
>  
> "error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException",
>  
> "root-error-class","org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException"],
>"msg":"Async exception during distributed update: Error from
> server at http://solr-host:8983/solr/search_collection_xx: Bad Request
> \n\n request: http://solr-host:8983/solr/search_collection_xx \n\n
> Remote error message: Exception writing document id  to
> the index; possible analysis error: Document contains at least one
> immense term in field=\"text_field_name\" (whose UTF8 encoding is
> longer than the max length 32766), all of which were skipped.  Please
> correct the analyzer to not produce such terms.  The prefix of the
> first immense term is: '[115, 97, 115, 109, 101, 45, 100, 97, 109,
> 101, 46, 99, 111, 109, 47, 108, 121, 99, 107, 97, 47, 37, 50, 50, 37,
> 50, 48, 109, 101, 116]...', original message: bytes can be at most
> 32766 in length; got 242905. Perhaps the document has an indexed
> string field (solr.StrField) which is too large",
>"code":400}
> }
> 
> relevant managed_schema:
> 
> multiValued="true" type="case_insensitive_text" />
> 
> multiValued="false">
>  
>
>
>  
>  
>
>
>  
>
> 
> 
> Best regards,
> Marko


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Classic QueryParser, StandardQueryParser, Quotes

2019-10-10 Thread Erick Erickson
1> Add &debug=query to the query and look at the parsed query returned. That’ll 
tell you a _lot_ about this kind of question.

2> look at the analysis page of the admin UI for the core and see how your 
field definition handles the tokens once they’re through <1>.

Best,
Erick

> On Oct 10, 2019, at 11:18 AM, Jochen Barth  wrote:
> 
> Dear reader,
> 
> I'm trying to test lucene 8.2.0 as key-value store;
> 
> I know that there are specialized ones like lmdb etc...
> 
> As key I have a StringField, keys can contain space(s), e. g. "a b". I know I 
> should use TermQuery.
> 
> But I've been playing with classic QueryParser, which does not match the 
> indexed keys,
> 
> not with »a backslash space b«
> 
> nor with »quote a space b quote«.
> 
> Now the funny part: The StandardQueryParser does work when querying "a b". It 
> does not match an additional key "a a b", so StandardQueryParser seems not to 
> do a phrase query despite both query parsers refer to the same syntax 
> description.
> 
> Does StandardQueryParser takes field type into account when building the 
> query?
> 
> Kind regards, Jochen
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Beginner Question: Tokenized and full phrase

2019-09-02 Thread Erick Erickson
In the Lucene context you simply have tokens. In the analyzed case (i.e. text), 
the token is however the incoming stream is split up by the analysis chain you 
construct. In the string case the token is the entire input. That’s just the 
way it works.

You have two choices:

1> Use two fields, one text-based and one string based. Your query puts the 
search text against whichever one is appropriate. I’ll add that if you want to 
use limited analysis, say lowercasing the entire input string, use a text-based 
field with something like KeywordTokenizer + LowerCaseFilter rather than a 
string field.


2> Use a text field and do phrase searching when you want the whole thing to 
match. The flaw here is that if the text were “my dog has fleas” and you 
searched for “my dog” (as a phrase), you’d get a match. You can get around that 
by adding another field with the word count and then search something like “my 
dog” AND word_count:2.


Best,
Erick

> On Sep 2, 2019, at 4:38 AM, Roland Käser  wrote:
> 
> Hallo 
> 
> We use Lucene to index POJO's which are stored in the database. 
> The index primarily contains text fields. 
> 
> After some work with lucene I came across a strange restriction. 
> I can only assign string or text fields to the document to be indexed. 
> One only indexes the whole string, the other only the single words or tokens. 
> This results in the query finding only single words or the whole text, 
> depending on the field type used. 
> But we would need both, the search should find the whole text as well as 
> single words. 
> Even after a long analysis of the documentation and partly of the source 
> code, 
> I'm not sure how to achieve that in a clean way. 
> Could someone give me a tip on how to do this? 
> 
> Thanks 
> 
> Roland
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: find documents with big stored fields

2019-07-01 Thread Erick Erickson
Whoa.

First, it should be pretty easy to figure out what fields are large, just look 
at your input documents. The fdt files are really simple, they’re just the 
compressed raw data. Numeric fields, for instance, are just character data in 
the fdt files. We usually see about a 2:1 ratio. There’s no need to look into 
Lucene. You’d have fun getting the info from Lucene anyway since the way stored 
fields are compressed is on a document basis, not on a field basis. Oh, I guess 
you could probably get there through a bunch  of low-level Lucene code, but 
it’d be far faster to just look at your input.

Second, look at your schema. Why are  you storing  certain fields? In 
particular are you storing the _destination_ of any copyField? You don’t need 
to, nor should you.

Third, just changing stored=“true” to stored=“false” will _not_ change the 
index in any way until existing docs are re-indexed. When an existing document 
is re-indexed (or deleted), the doc is only marked as deleted in the segment it 
happens to be in. That data is not reclaimed until that segment is merged, 
which will happen sometime but not necessarily immediately.

Fourth, fdt files aren’t particularly germane to searching, just retrieving the 
result list. It’s not good to have the index be unnecessarily large, but the 
presence of all that stored data is (probably) minimally affecting search 
speed. When replicas go into full recovery moving extra data lengthens the 
process, and if you’re returning large result lists reading and decompressing 
lots of data (assuming not returning  only docValues fields) is added work, but 
in the usual  case of returning 10-20 results it’s usually not that big a deal. 
I’d still remove unnecessary stored fields, but wouldn’t consider it urgent. 
Just change the definition and continue as normal, things will get smaller over 
time.

So “bottom line”
- I claim you can look at your documents and know, with a high degree of 
accuracy, what’s contributing to your fdt file size.
- You should check your schema to see if you’re doing any copyFields where the 
destination has stored=“true” and change those.
- You’ll have to re-index your docs to see the data size shrink. Note that what 
segments are merged is opaque, don’t expect the index to shrink until you’ve 
re-indexed quite a number of docs. New segments should have much smaller fdt 
files relative to the sum of the other files in that segment.

Best,
Erick

> On Jul 1, 2019, at 2:23 AM, Rob Audenaerde  wrote:
> 
> Hello,
> 
> We are currently trying to investigate an issue where in the index-size is
> disproportionally large for the number of documents. We see that the .fdt
> file is more than 10 times the regular size.
> 
> Reading the docs, I found that this file contains the fielddata.
> 
> I would like to find the documents and/or field names/contents with extreme
> sizes, so we can delete those from the index without needing to re-index
> all data.
> 
> What would be the best approach for this?
> 
> Thanks,
> Rob Audenaerde


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: explainOther SOLR concept?

2019-06-27 Thread Erick Erickson
It’s a Solr-only param for adding to debug=true….

> On Jun 27, 2019, at 12:11 PM, baris.ka...@oracle.com wrote:
> 
> Hi,-
> 
>  is explainOther a SOLR concept/parameter?
> 
> i think i can only find it in SOLR docs but not pure Lucene docs.
> 
> Best regards
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: how to find out each score contribution from booleanquery components

2019-06-27 Thread Erick Erickson
BTW, if you have the ID of the doc you _think_ should be returned
you can see why it wasn’t by using the explainOther parameter.

> On Jun 27, 2019, at 8:11 AM, András Péteri  
> wrote:
> 
> Hi Baris,
> 
> Explanation's output is hierarchical, and the leading "0.0" values you
> are seeing are the individual contributions of each boolean clause or
> any other nested query.
> 
> Going from bottom to top:
> 
> Term query on countryDFLT = 'states', but no term matched this value
> --> score is 0.0 for the term query "countryDFLT:states"
> Term query is wrapped into a 'must' clause, but the term query scored
> 0.0 --> score is 0.0 for the 'must' boolean clause
> "+countryDFLT:states"
> Term query on countryDFLT = 'united', but no term matched this value
> --> score is 0.0 for the term query "countryDFLT:united"
> Term query is wrapped into a 'must' clause, but the term query scored
> 0.0 --> score is 0.0 for the 'must' boolean clause
> "+countryDFLT:united"
> (The two 'should' clauses with boosts have been optimized out; if a
> single 'must' clause is present, they do not need to match at all,
> unless you have minShouldMatch set on the boolean query)
> Boolean query with two 'must' clauses did not match --> score is 0.0
> for the boolean query "+countryDFLT:states +countryDFLT:united
> (countryDFLT:uniten)^0.4202 (countryDFLT:statesir)^0.56"
> 
> ...and so on.
> 
> So Atri is correct, the index you are running this query on does not
> seem to have a document where either 'united' or 'states' has been
> indexed for field 'countryDFLT' (let alone both). Do the individual
> building blocks, eg. "countryDFLT:united" return any results?
> 
> On Thu, Jun 27, 2019 at 4:33 PM  wrote:
>> 
>> Hi,-
>> 
>> Any ideas on what might be happening?
>> 
>> maybe i am missing, is there an api to look into each contribution of
>> score into total scrore from the booleanquery?
>> 
>> Best regards
>> 
>> 
>> 
>> On 6/26/19 2:29 PM, Baris Kazar wrote:
>>> All must queries (and the rest of course) work ok when i search MAINK, 
>>> MAINL, MAINQ,..., MAINT etc.. for street name
>>> with all consonants except S is used and all other fields are the same for 
>>> all queries (NASUA, HILLSBOROUGH, NEW HAMPSHIRE, UNITED STATES)
>>> 
>>> ie., working means: the top result is correct with MAIN.
>>> 
>>> But, with street name MAINS, and MAINO (with wovels) i cant get MAIN as top 
>>> result.
>>> 
>>> I have two theories:
>>> 
>>> either my query plan is too complex to handle MAINS (as there are some 
>>> other MAINS street in the index in other cities and states)
>>> so maybe i need to run each component of booleanquery separately and then 
>>> manually post process them.
>>> 
>>> or my query plan is still not good enough to catch MAIN when i search with 
>>> street MAINS, city NASUA, municipality HILLSBOROUGH, state NEW HAMPSHIRE, 
>>> cuntry UNITED STATES
>>> where the first two are fuzzy as they are have errors in them and  the rest 
>>> is phrase query as they are correct
>>> 
>>> that is why i want to see each score from each of the component of the 
>>> booleanquery.
>>> so far i checked Lucene but could not find a way to see each contributing 
>>> score to the total score for each result hit document.
>>> 
>>> Best regards
>>> 
>>> 
>>> - Original Message -
>>> From: a...@apache.org
>>> To: java-user@lucene.apache.org
>>> Sent: Wednesday, June 26, 2019 1:09:36 PM GMT -05:00 US/Canada Eastern
>>> Subject: Re: how to find out each score contribution from booleanquery 
>>> components
>>> 
>>> It seems evident that multiple of your Must clauses are not matching any
>>> document, hence no results are being returned?
>>> 
>>> On Wed, 26 Jun 2019 at 6:51 PM,  wrote:
>>> 
 Sure, here is the query plan: (i cant run explain plan as it does not
 give me anything)
 
 [+streetDFLT:maink~2 (streetDFLT:"maine")^0.35, +cityDFLT:nasua~2
 (cityDFLT:"nasuh")^0.35, ++regionDFLT:"new-hampshire"
 (regionDFLT:"new-hammpshire")^0.98, ++countryDFLT:"united"
 (countryDFLT:"uniten")^0.4202 +countryDFLT:"states"
 (countryDFLT:"statesir")^0.56]
 
 
 explain plan gives:
 
 Explanation expl = is.explain(booleanQuery.build(), 10);
 System.out.println(expl);
 
 This prints:
 
 0.0 = Failure to meet condition(s) of required/prohibited clause(s)
0.0 = no match on required clause (+regionDFLT:new-hampshire
 (regionDFLT:new-hammpshire)^0.98)
  0.0 = Failure to meet condition(s) of required/prohibited clause(s)
0.0 = no match on required clause (regionDFLT:new-hampshire)
  0.0 = no matching term
0.0 = no match on required clause (+countryDFLT:united
 (countryDFLT:uniten)^0.4202 +countryDFLT:states
 (countryDFLT:statesir)^0.56)
  0.0 = Failure to meet condition(s) of required/prohibited clause(s)
0.0 = no match on required clause (countryDFLT:united)
  0.0 = no matching term
0.0 = n

Re: Index Optimization

2019-06-25 Thread Erick Erickson
Optimize is rarely useful. It can give some performance gains, but is quite an 
expensive operation. Pre Solr 7.5, optimizing had some behaviors that weren’t 
obvious, see: 
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Post 7.5, the behavior has changed.

If I were going to offer advice, you have two paths:
1> don’t optimize at all. This is my preference
2> optimize after every update, _assuming_ this means you update only daily at 
most.

Best,
Erick

> On Jun 25, 2019, at 5:34 AM, Eduardo Costa Lopes 
>  wrote:
> 
> Hello folks, 
> 
> I got some Lucene indexes in my project, mostly of them are created once and 
> updated, not so frequently, about once a week or monthly. The indexes sizes 
> are about 20GB and as more inserts are done the indexes grow, so I'd like to 
> know what the best index optimization strategy or even it is really 
> necessary, since 99% of the time we do read operations. The documentation is 
> not clear in some aspects in this subject. If someone could give some tips, 
> I'll be very grateful. 
> 
> Best regrads, 
> Eduardo Lopes. 
> 
> 
> 
> 
> 
> -
> 
> 
> "Esta mensagem do SERVIÇO FEDERAL DE PROCESSAMENTO DE DADOS (SERPRO), empresa 
> pública federal regida pelo disposto na Lei Federal nº 5.615, é enviada 
> exclusivamente a seu destinatário e pode conter informações confidenciais, 
> protegidas por sigilo profissional. Sua utilização desautorizada é ilegal e 
> sujeita o infrator às penas da lei. Se você a recebeu indevidamente, queira, 
> por gentileza, reenviá-la ao emitente, esclarecendo o equívoco."
> 
> "This message from SERVIÇO FEDERAL DE PROCESSAMENTO DE DADOS (SERPRO) -- a 
> government company established under Brazilian law (5.615/70) -- is directed 
> exclusively to its addressee and may contain confidential data, protected 
> under professional secrecy rules. Its unauthorized use is illegal and may 
> subject the transgressor to the law's penalties. If you're not the addressee, 
> please send it back, elucidating the failure."


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: A possible Java exception message fix

2019-06-24 Thread Erick Erickson
What are you asking here? Indeed, Lucene 8 (and therefore Solr) will not open 
an index that has ever been touched by Lucene 6x or earlier. You must re-index 
into 8x.

You cannot spoof this with, for instance, IndexUpgraderTool and go from 
6->7->8. You must reindex from your system-of-record.

This has been discussed several times on the mailing list, please search the 
mail archives for a fuller discussion.

Best,
Erick

> On Jun 24, 2019, at 11:05 AM, baris.ka...@oracle.com wrote:
> 
> Ok i forgot to mention below that i was trying to run 6.6. index with Lucene 
> 8.1.1.
> 
> Best regards
> 
> 
> On 6/24/19 2:03 PM, baris.ka...@oracle.com wrote:
>> 
>> Index created with Lucene 6.6 and*i tried running the same index with Lucene 
>> 8.1 *and according to that,  this error message might need update:-> This 
>> version of Lucene only supports indexes created with release 7.0 and later.
>> 
>> The first part of the error message is consistent with the error: -> 6 
>> (needs to be between 7 and 9).
>> 
>> Hope this helps
>> 
>> baris
>> 
>> PS, here is the stack trace
>> 
>> 
>> org.apache.lucene.index.IndexFormatTooOldException: Format version is not 
>> supported (resource 
>> BufferedChecksumIndexInput(MMapIndexInput(path="/scratch/bkazar/auto_correct_index/index/segments_5"))):
>>  6 (needs to be between 7 and 9). *This version of Lucene only supports 
>> indexes created with release 6.0 and later.*
>> at 
>> org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:213)
>> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:305)
>> at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:289)
>> at 
>> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:64)
>> at 
>> org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61)
>> at 
>> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:680)
>> at 
>> org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84)
>> at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:76)
>> at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64)
>> 
>>  (this should be enough for the trace)
>> 
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Live index upgrading

2019-06-21 Thread Erick Erickson
You’re exactly right that storing all the fields necessary to reconstruct the 
document is a way to not have to reindex from scratch. Of course that bloats 
your index, in large installations perhaps unacceptably.

bq. Is there a convenient place to store….

Lucene itself doesn’t preserve anything except the index across interruptions, 
so storing the counter somewhere is indicated.

So sounds like you’re set with how to go forward, good luck!

> On Jun 21, 2019, at 10:21 AM, David Allouche  wrote:
> 
> The bottom line for me, is that I am not going to upgrade to Lucene8 for a 
> while.
> 
> The index migration would either cause a service interruption, or would 
> require a little while  to implement.
> 
> I have more urgent technical debt to deal with.
> 
>> On 21 Jun 2019, at 19:11, David Allouche  wrote:
>> 
>> Unfortunately, I cannot assume SolrCloud, because our software predates Solr.
>> 
>> So I would either need to switch to Solr or reimplement a work-around for 
>> the lack of index migration. I am reluctant to switch to Solr because it 
>> increases the operational complexity.
>> 
>> I understand the argument: if the algorithm fₙ() used to derive index data 
>> iₙ from the raw data rₙ changes [iₙ=fₙ(rₙ)], the index data iₙ₊₁ may not be 
>> derivable from iₙ [∃n∄g \ iₙ=g(iₙ₊₁)].
>> 
>> On the application level, one could store non-tokenized content (I guess 
>> that's why ElasticSearch has .raw fields). And traverse the index. I already 
>> have index traversal code that I use for garbage collection of old entries. 
>> Use the non-tokenized content to build a new index. So the progress of the 
>> conversion could be recorded as the index into LeafReader.getLiveDocs().
>> 
>> https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs--
>> 
>> Alternatively, since I do not have all the non-tokenized content in the 
>> index now, I could use the external document id to retrieve the original 
>> document text.
>> 
>> Is there a convenient place to store the getLiveDocs index across process 
>> interruptions? Or should I use something stupid like a file to store the 
>> counter?
>> 
>> That is still a lot of hassle, but I understand how it makes sense for 
>> Lucene to consider index migration should be handled up the stack. 
>> 
>> 
>>> On 21 Jun 2019, at 18:06, Erick Erickson  wrote:
>>> 
>>> Assuming SolrCloud, reindex from scratch into a new collection then use 
>>> collection aliasing when you were ready to switch. You don’t need to stop 
>>> your clients when you use CREATEALIAS.
>>> 
>>> Prior to writing the marker, Lucene would appear to work with older 
>>> indexes, but there would be subtle errors because the information needed to 
>>> score docs just wasn’t there.
>>> 
>>> Here are two quotes from people who know that crystalized the problem 
>>> Lucene faces for me:
>>> 
>>> From Robert Muir: 
>>> 
>>> “I think the key issue here is Lucene is an index not a database. Because 
>>> it is a lossy index and does not retain all of the user's data, its not 
>>> possible to safely migrate some things automagically. In the norms case 
>>> IndexWriter needs to re-analyze the text ("re-index") and compute stats to 
>>> get back the value, so it can be re-encoded. The function is y = f(x) and 
>>> if x is not available its not possible, so lucene can't do it.”
>>> 
>>> From Mike McCandless:
>>> 
>>> “This really is the difference between an index and a database: we do not 
>>> store, precisely, the original documents.  We store an efficient 
>>> derived/computed index from them.  Yes, Solr/ES can add database-like 
>>> behavior where they hold the true original source of the document and use 
>>> that to rebuild Lucene indices over time.  But Lucene really is just a 
>>> "search index" and we need to be free to make important improvements with 
>>> time.”
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Jun 21, 2019, at 7:10 AM, David Allouche  wrote:
>>>> 
>>>> Wow. That is annoying. What is the reason for this?
>>>> 
>>>> I assumed there was a smooth upgrade path, but apparently, by design, one 
>>>> has to rebuild the index at least once every two major releases.
>>>> 
>>>> So, my question becomes, what is the recommended way of dealing with 
>>>> reindex-from-scratch without service interruption? 
>&

Re: Live index upgrading

2019-06-21 Thread Erick Erickson
Assuming SolrCloud, reindex from scratch into a new collection then use 
collection aliasing when you were ready to switch. You don’t need to stop your 
clients when you use CREATEALIAS.

Prior to writing the marker, Lucene would appear to work with older indexes, 
but there would be subtle errors because the information needed to score docs 
just wasn’t there.

Here are two quotes from people who know that crystalized the problem Lucene 
faces for me:

From Robert Muir: 

“I think the key issue here is Lucene is an index not a database. Because it is 
a lossy index and does not retain all of the user's data, its not possible to 
safely migrate some things automagically. In the norms case IndexWriter needs 
to re-analyze the text ("re-index") and compute stats to get back the value, so 
it can be re-encoded. The function is y = f(x) and if x is not available its 
not possible, so lucene can't do it.”

From Mike McCandless:

“This really is the difference between an index and a database: we do not 
store, precisely, the original documents.  We store an efficient 
derived/computed index from them.  Yes, Solr/ES can add database-like behavior 
where they hold the true original source of the document and use that to 
rebuild Lucene indices over time.  But Lucene really is just a "search index" 
and we need to be free to make important improvements with time.”

Best,
Erick

> On Jun 21, 2019, at 7:10 AM, David Allouche  wrote:
> 
> Wow. That is annoying. What is the reason for this?
> 
> I assumed there was a smooth upgrade path, but apparently, by design, one has 
> to rebuild the index at least once every two major releases.
> 
> So, my question becomes, what is the recommended way of dealing with 
> reindex-from-scratch without service interruption? 
> 
> So I guess the upgrade path looks something like:
> - Create Lucene6 index
> - Update Lucene6 index
> - Create Lucene7 index
> - Separately keep track of which documents are indexed in Lucene7 and Lucene6 
> indexes
> - Make updates to Lucene6 index, concurrently build Lucene7 index from 
> scratch, user Lucene6 index for search.
> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 
> index for search.
> 
> Rinse and repeat every major version.
> 
> Really, isn't there something simpler already to handle Lucene major version 
> upgrades?
> 
> 
>> On 17 Jun 2019, at 18:04, Erick Erickson  wrote:
>> 
>> Let’s back up a bit. What version of Lucene are you using? Starting with 
>> Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It 
>> does not matter if the index has been completely rewritten. It does not 
>> matter if it’s been run through IndexUpgraderTool, which just does a 
>> forceMerge to 1 segment. A marker is preserved when a segment is created, 
>> and the earliest one is preserved across merges. So say you have two 
>> segments, one created with 6 and one with 7. The Lucene 6 marker is 
>> preserved when they are merged.
>> 
>> Now, if any segment has the Lucene 6 marker, the index will not be opened by 
>> Lucene.
>> 
>> If you’re using Lucene 7, then this error implies that one or more of your 
>> segments was created with Lucene 5 or earlier.
>> 
>> So you probably need to re-index from scratch on whatever version of Lucene 
>> you want to use.
>> 
>> Best,
>> Erick
>> 
>> 
>> 
>>> On Jun 17, 2019, at 8:41 AM, David Allouche  wrote:
>>> 
>>> Hello,
>>> 
>>> I use Lucene with PyLucene on a public-facing web application. We have a 
>>> moderately large index (~24M documents, ~11GB index data), with a constant 
>>> stream of new documents.
>>> 
>>> I recently upgraded to PyLucene 7.
>>> 
>>> When trying to test the new release of PyLucene 8, I encountered an 
>>> IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 
>>> was not complete.
>>> 
>>> I found IndexUpgrader, and I had a look at its implementation. I would very 
>>> much like to avoid putting down the service during the index upgrade, so I 
>>> believe I cannot use IndexUpgrader because I need the write lock to be held 
>>> by the web application to index new documents.
>>> 
>>> So I figure I could get the desired result with an 
>>> IndexWriter.forceMerge(1). But the documentation says "This is a horribly 
>>> costly operation, especially when you pass a small maxNumSegments; usually 
>>> you should only call this if the index is static (will no longer be 
>>> changed)." 
>>> https://lucene.apache.org/core/7_7_2/core/org/apache/

Re: Issue with lucene-core 3.4.0 and h2 database

2019-06-19 Thread Erick Erickson
Capitalization of the field name?

> On Jun 19, 2019, at 8:40 AM, Robert Damian  wrote:
> 
> Hello,
> 
> For a long time I've been using lucene as a search engine for my embedded
> h2 databases. I am using lucene-core 3.4.0, which I know is pretty old and
> h2 database 1.3.160.
> 
> One issue that I discovered recently is that lucene indexing doesn't seem
> to work properly. For example
> SELECT COUNT(*) FROM BURSUCUL.NORMA WHERE NORMADESCRIERE LIKE '%beton%'
> returns about 2000, but
> SELECT COUNT(*) FROM FTL_SEARCH_DATA("+beton", 0, 0)
> returns about 70 results, although the column "NORMADESCRIERE" in
> "BURSUCUL.NORMA" is indexed.
> 
> considering that, I tried to drop the indexes and recreate them but when I
> do:
> CALL FTL_CREATE_INDEX("BURSUCUL", "NORMA", "NORMACOD, NORMADESCRIERE")
> i get:
> Error: org.h2.jdbc.JdbcSQLException: Error creating or initializing trigger
> "FTL_NORMA" object, class "org.h2.fulltext.
> FullTextLucene$FullTextTrigger", cause: "java.sql.SQLException: Column not
> found: NORMACOD"; see root cause for details; SQL statement:
> CREATE
> TRIGGER IF NOT EXISTS "BURSUCUL"."FTL_NORMA" AFTER INSERT, UPDATE,
> DELETE, ROLLBACK ON "BURSUCUL"."NORMA" FOR EACH ROW CALL
> "org.h2.fulltext.FullTextLucene$FullTextTrigger" [90043-160]
> 
> My first thought was that it must be a typo, but I checked a few times and
> I know I wrote it well. Here is the structure of my table:
> 
> sql> SHOW COLUMNS FROM BURSUCUL.NORMA;
> FIELD| TYPE  | NULL | KEY | DEFAULT
> normaid  | integer(10)   | NO   | PRI | (NEXT VALUE FOR BURSUCUL.
> SYSTEM_SEQUENCE_A35B0D2E_EB1B_4869_B3C8_80767DBE85DE)
> normalinkarticol | integer(10)   | NO   | | NULL
> normacod | varchar(256)  | NO   | | NULL
> normaum  | varchar(256)  | NO   | | NULL
> normadescriere   | varchar(2048) | NO   | | NULL
> 
> 
> 
> Updating to a newer version of h2 database and lucene would be an issue
> since the application is already installed on a number of systems and that
> would mean finding a way to update deployed databases on the fly.
> 
> Please let me know if you have any suggestions regarding these issues.
> Thank you!


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Live index upgrading

2019-06-17 Thread Erick Erickson
Let’s back up a bit. What version of Lucene are you using? Starting with Lucene 
8, any index that’s ever been touched by Lucene 6 will not open. It does not 
matter if the index has been completely rewritten. It does not matter if it’s 
been run through IndexUpgraderTool, which just does a forceMerge to 1 segment. 
A marker is preserved when a segment is created, and the earliest one is 
preserved across merges. So say you have two segments, one created with 6 and 
one with 7. The Lucene 6 marker is preserved when they are merged.

Now, if any segment has the Lucene 6 marker, the index will not be opened by 
Lucene.

If you’re using Lucene 7, then this error implies that one or more of your 
segments was created with Lucene 5 or earlier.

So you probably need to re-index from scratch on whatever version of Lucene you 
want to use.

Best,
Erick



> On Jun 17, 2019, at 8:41 AM, David Allouche  wrote:
> 
> Hello,
> 
> I use Lucene with PyLucene on a public-facing web application. We have a 
> moderately large index (~24M documents, ~11GB index data), with a constant 
> stream of new documents.
> 
> I recently upgraded to PyLucene 7.
> 
> When trying to test the new release of PyLucene 8, I encountered an 
> IndexFormatTooOld error because my index conversion from Lucene6 to Lucene7 
> was not complete.
> 
> I found IndexUpgrader, and I had a look at its implementation. I would very 
> much like to avoid putting down the service during the index upgrade, so I 
> believe I cannot use IndexUpgrader because I need the write lock to be held 
> by the web application to index new documents.
> 
> So I figure I could get the desired result with an IndexWriter.forceMerge(1). 
> But the documentation says "This is a horribly costly operation, especially 
> when you pass a small maxNumSegments; usually you should only call this if 
> the index is static (will no longer be changed)." 
> https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
> 
> And indeed, forceMerge tends be killed the kernel OOM killer on my 
> development VM. I want to avoid this failure mode in production. I could 
> increase the VM until it works, but I would rather have a less brutal 
> approach to upgrading a live index. Something that could run in the 
> background with reasonable amounts of anonymous memory.
> 
> What is the recommended approach to upgrading a live index?
> 
> How can I know from the code that the index needs upgrading at all? I could 
> add a manual knob to start an upgrade, but it would be better if it occurred 
> transparently when I upgrade PyLucene.
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FuzzyQuery- why is it ignored?

2019-06-13 Thread Erick Erickson
Shot in the dark: stemming. Whenever I see a problem with something ending in 
“s” (or “er” or “ing” or….) my first suspect is that stemming is turned on. In 
that case the token in the index that’s actually searched on is somewhat 
different than you expect.

The test is easy, just insure your fieldType contains no stemmers. 
PorterStemmer is particularly aggressive, but for this case to test I’d just 
remove all stemming, re-index and see if the results differ.

Best,
Erick

> On Jun 13, 2019, at 7:26 AM, baris.ka...@oracle.com wrote:
> 
> Tomoko,-
> 
>  That is strange indeed.
> 
> Something is wrong when i use mains but maink, mainl, mainr,mainq, maint all 
> work ok any consonant at the end except s works in this case.
> 
> Case #3 had +contentDFLT:mains~2 but not +contentDFLT:"mains~2".
> 
> i am using fuzzy query with ~ from Query.builder and that is not PhraseQuery.
> 
> Similarly FuzzyQuery with input "mains" (it has to be lowercase since it does 
> not go through StandardAnalyzer) is also not PhraseQuery.
> 
> can there be a clearer sample case for ComplexPhraseQuery please in the docs?
> 
> did You also index "MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED STATES" the 
> expected output in this case?
> 
> Thanks for spending time on this, i would like to thank everyone.
> 
> Best regards
> 
> 
> On 6/13/19 12:13 AM, Tomoko Uchida wrote:
>> Hi,
>> 
>>> Ok, i think only this very specific only "mains" has an issue.
>> It looks strange to me. I did some test locally.
>> 
>> 1. Indexed this text: "NASHUA NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED 
>> STATES".
>> 
>> 2a. This query string (just copied from your Case #3) worked correctly
>> for me as far as I can see.
>> +contentDFLT:mains~2 +contentDFLT:"nashua",
>> +contentDFLT:"new-hampshire", +contentDFLT:"united state"
>> 
>> 2b. However this query string got no results.
>> +contentDFLT:"mains~2", +contentDFLT:"nashua",
>> +contentDFLT:"new-hampshire", +contentDFLT:"united states"
>> It is an expected behaviour because the classic query parser does not
>> support fuzzy query inside phrase query (as far as I know).
>> 
>> I suspect you use fuzzy query operator (~) inside phrase query ("), as
>> the 2b case.
>> 
>> FYI: there is a special parser for such complex phrase query.
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_8-5F1-5F0_queryparser_org_apache_lucene_queryparser_complexPhrase_ComplexPhraseQueryParser.html&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=ZcXpaSlwS5DegX76mHTb_6DH3P7noan1eeMXc-Vh5M8&s=FoIMlcjDO2b7Gut9XRx-NIBWiBQWItsj8IlylJC7Wkc&e=
>> 
>> Tomoko
>> 
>> 2019年6月13日(木) 6:16 :
>>> Ok, i think only this very specific only "mains" has an issue.
>>> 
>>> all i knew about Lucene was fine :) Great...
>>> 
>>> i have one more question:
>>> 
>>> which one is advised to use: FuzzyQuery or the Query.parser with search 
>>> string~ appended?
>>> 
>>> The second one will go through analyzer and make search string lowercase.
>>> 
>>> Best regards
>>> 
>>> 
>>> On 6/12/19 1:03 PM, baris.ka...@oracle.com wrote:
>>> 
>>> Hi again,-
>>> 
>>> this is really interesting and i hope i am missing something. Index small 
>>> cases all entries so case sensitivity is not an issue i think.
>>> 
>>> Case #1:
>>> 
>>> org.apache.lucene.queryparser.classic.QueryParser parser = new 
>>> org.apache.lucene.queryparser.classic.QueryParser(field, phraseAnalyzer) ;
>>> Query q1 = null;
>>> try {
>>> q1 = parser.parse("Main");
>>> } catch (ParseException e) {
>>> e.printStackTrace();
>>> }
>>> booleanQuery.add(q1, BooleanClause.Occur.MUST);
>>> booleanQuery.add(Utils.createPhraseQuery(phraseAnalyzer, field, 
>>> "NASHUA"), BooleanClause.Occur.MUST);
>>> booleanQuery.add(Utils.createPhraseQuery(phraseAnalyzer, field, 
>>> "NEW HAMPSHIRE"), BooleanClause.Occur.MUST);
>>> booleanQuery.add(Utils.createPhraseQuery(phraseAnalyzer, field, 
>>> "UNITED STATES"), BooleanClause.Occur.MUST);
>>> 
>>> 
>>> This brings with this:
>>> 
>>> query plan:
>>> 
>>> [+contentDFLT:main, +contentDFLT:"nashua", +contentDFLT:"new-hampshire", 
>>> +contentDFLT:"united states"]
>>> 
>>> testQuerySearch1 Time to compute: 0 seconds (copied answer after exec 
>>> finished)
>>> 
>>> Number of results: 12
>>> Name: Main Dunstable Rd
>>> Score: 41.204945
>>> ID: 12677400
>>> Country Code: US
>>> Coordinates: 42.72631, -71.50269
>>> Search Key: MAIN DUNSTABLE NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED STATES
>>> 
>>> Name: Main St
>>> Score: 41.204945
>>> ID: 12681980
>>> Country Code: US
>>> Coordinates: 42.76416, -71.46681
>>> Search Key: MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED STATES
>>> 
>>> Name: Main St
>>> Score: 41.204945
>>> ID: 12681973
>>> Country Code: US
>>> Coordinates: 42.75045, -71.4607
>>> Search Key: MAIN NASHUA HILLSBOROUGH NEW HAMPSHIRE UNITED STATES
>>> 
>>> Name: Main St
>>> Score: 41.204945
>>> ID: 12

Re: IntField to IntPoint

2019-06-07 Thread Erick Erickson
Omitting norms and the like only matters for text fields, primitives (numerics, 
boolean, string) don’t have any  of that information.

You really have no choice but to re-index to jump from 4->7. Or, I should say 
you’re completely unsupported  and you will have to deal with any anomalies. I 
suppose if the only thing you care about is non-textual data you might be OK, 
but it's iffy at best.

And you’ll have to play low-level games with Lucene to rewrite the segments 
with points rather than ints.

Good  luck!
Erick

I’ll wager that you’ll be faster to re-index, painful though it may be rather 
than write custom code  to do this.

> On Jun 7, 2019, at 1:40 AM, Riccardo Tasso  wrote:
> 
> Thanks Erik for your answer.
> 
> Unfortunately I should migrate the index for time reasons. Maybe in a
> second moment we will have the opportunity to reindex.
> 
> Our use case is to classify documents in the index with lucene queries,
> hence we're not really interested in ranking or sorting (which could be
> relevant for the "norms case"). Do you think that migrating and reindexing
> only the numeric fields could compromise the results returned by any query
> (term, boolean, range, phrase, prefix)?
> 
> Il giorno mer 5 giu 2019 alle ore 17:41 Erick Erickson <
> erickerick...@gmail.com> ha scritto:
> 
>> You cannot upgrade more than one major version, you must re-index from
>> scratch. There’s a long discussion of why, but basically it’s summed up by
>> this quote from Robert Muir:
>> 
>> “I think the key issue here is Lucene is an index not a database. Because
>> it is a lossy index and does not retain all of the user's data, its not
>> possible to safely migrate some things automagically. In the norms case
>> IndexWriter needs to re-analyze the text ("re-index") and compute stats to
>> get back the value, so it can be re-encoded. The function is y = f(x) and
>> if x is not available its not possible, so lucene can't do it.”
>> 
>> This has always been true, before 8x it would just  fail silently as  you
>> have found. Solr/Lucene starts up but don’t  work quite as expected. As of
>> Lucene 8x, Lucene (and therefore Solr) will not even open an index that
>> has  _ever_ been touched by Lucene 6x, no matter what intervening steps
>> have been taken. Or in general,  Lucene/Solr X will  not  open indexes
>> touched by X-2, starting with 8x rather than behave unexpectedly.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 5, 2019, at 8:27 AM, Riccardo Tasso 
>> wrote:
>>> 
>>> Hello everybody,
>>> I have a (very big) lucene 4 index with documents using IntField. On that
>>> field, which should be stored and sortable, I should search and execute
>>> range queries.
>>> 
>>> I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed
>> that
>>> IntFields aren't searchable anymore.
>>> 
>>> Which is the most efficient way to convert IntFields to IntPoints, which
>>> are stored and sortable?
>>> 
>>> Thanks,
>>> Riccardo
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IntField to IntPoint

2019-06-06 Thread Erick Erickson



> On Jun 5, 2019, at 2:07 PM, Riccardo Tasso  wrote:
> 
> 
> Considering that the IndexUpgrader will efficiently do the most of the work
> I should investigate how to fill this gap, without reindexing from scratch.
> 
> 

This is actually a problem. IndexUpgraderTool creates a single massive segment, 
essentially an optimize. Here are the reasons that’s bad: 

https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

IndexUpgraderTool does _not_ respect the max segment size even now, so the 
linked article in the one above  about how optimize may not be so bad in Solr 
7.5+ is irrelevant.

Textual data is most sensitive to the changes in how Lucene works, other than 
deprecated types. I strongly recommend you bite the bullet and re-index from 
your  system of record.

Best,
Erick
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: IntField to IntPoint

2019-06-05 Thread Erick Erickson
You cannot upgrade more than one major version, you must re-index from scratch. 
There’s a long discussion of why, but basically it’s summed up by this quote 
from Robert Muir:

“I think the key issue here is Lucene is an index not a database. Because it is 
a lossy index and does not retain all of the user's data, its not possible to 
safely migrate some things automagically. In the norms case IndexWriter needs 
to re-analyze the text ("re-index") and compute stats to get back the value, so 
it can be re-encoded. The function is y = f(x) and if x is not available its 
not possible, so lucene can't do it.”

This has always been true, before 8x it would just  fail silently as  you have 
found. Solr/Lucene starts up but don’t  work quite as expected. As of Lucene 
8x, Lucene (and therefore Solr) will not even open an index that has  _ever_ 
been touched by Lucene 6x, no matter what intervening steps have been taken. Or 
in general,  Lucene/Solr X will  not  open indexes touched by X-2, starting 
with 8x rather than behave unexpectedly.

Best,
Erick

> On Jun 5, 2019, at 8:27 AM, Riccardo Tasso  wrote:
> 
> Hello everybody,
> I have a (very big) lucene 4 index with documents using IntField. On that
> field, which should be stored and sortable, I should search and execute
> range queries.
> 
> I've tried to upgrade it from 4 to 7 with IndexUpgrader but I observed that
> IntFields aren't searchable anymore.
> 
> Which is the most efficient way to convert IntFields to IntPoints, which
> are stored and sortable?
> 
> Thanks,
> Riccardo


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: About DuplicateFilter

2019-04-23 Thread Erick Erickson
How is the score being calculated? Because if it’s the usual scoring algorithm, 
there will be very few scores that are exactly identical. And the usual BM25 
scores really don’t mean the documents are “similar”.

This feels like an XY problem. How is “similarity” determined here?

Best,
Erick

> On Apr 22, 2019, at 9:44 PM, kongchao...@163.com wrote:
> 
> Hi!
>Here I hava some questions about DuplicateFilter.
> I use lucene search news,news contains 
> 'id','title','content','pubtime','score' and so on.'score' value type is 
> Long,same 'score' means similar news.
> I want to search news filter resultset  just first one when 'score' is same.
> The indexed entity is like bellow(items over 1,000,000,000):
> id
> title
> content
> pubtime
> score
> 1
> title1 
> content1
> 2019-04-23
> 
> 2
> title2 
> content2
> 2019-04-23
> 
> 3
> title3 
> content3
> 2019-04-23
> 
> 4
> title4 
> content4
> 2019-04-23
> 
> 5
> title5 
> content5
> 2019-04-23
> 
> When I search news, i want the resultset just contains id=1 and id=2,how can 
> i do?please help me!
> 
> 
> kongchao...@163.com


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Why does Lucene 7.4.0 commit() Increase Memory Usage x2

2019-04-02 Thread Erick Erickson
Task manager is almost useless for this kind of measurement. You never quite 
know how much garbage that hasn’t been collected is in that total.

You can attach something like jconsole to the running Solr process and hit the 
“perform full GC” to get a more accurate number.

Or you can look at GCViewer opened on the GC logs to get something more useful.

Best,
Erick

> On Apr 2, 2019, at 4:50 AM, thturk  wrote:
> 
> I am watching via task manager.
> Now i tired to handle this with hard coded way. I create new index and  with
> commit in small index cost low memory. but i dont think that its good way to
> do this. Its getting harder to manage indexes. 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene migrate to 6.6.5 from 5.5.3

2019-03-28 Thread Erick Erickson
First I have to ask why not use something much more recent? 7.5 comes to mind.

There’s not enough information here to say anything at all about what your 
problem might or might not be. “It doesn’t work” provides little to diagnose. 
You might want to review:

https://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

> On Mar 27, 2019, at 10:39 PM, brahmam  wrote:
> 
> Hi Team,
> 
> we want to migrate from lucene 5.5.3 to 6.6.5.
> We see after upgraded to 6.6.5 we are not able to search the existing
> data(which was managed with 5.5.3), are we missing anything here during
> upgrade?
> 
> -- 
> Thanks & Regards,
> Sree


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Environmental Protection Agency: Stop Deforesting in Sri Lanka

2019-03-21 Thread Erick Erickson
This is an entirely inappropriate use of this list, do not do so again.

> On Mar 21, 2019, at 12:06 AM, bjchathura...@gmail.com wrote:
> 
> Hello there,
> 
> I just signed the petition "Environmental Protection Agency: Stop
> Deforesting in Sri Lanka" and wanted to see if you could help by adding
> your name.
> 
> Our goal is to reach 15,000 signatures and we need more support. You can
> read more and sign the petition here:
> 
> http://chng.it/vY78rzGf8G
> 
> Thanks!
> Janaka


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Fetching 1000 documents taking around 30ms

2019-03-02 Thread Erick Erickson
“Is this expected”

Yes. For each document, if there is any field with stored=true that does _not_ 
have docValues=true or is flagged as useDocValuesAsStored=false, there is
1> a disk seek to read the stored data from the fdt file
2> decompression of the data read in <1>, 16K block minimum.

So getting this all in 30 ms for 1,000 docs isn’t bad at all.

If (and only if) _all_ the values you ask for are docValues=true and 
useDocValuesAsStored=true then all the values will be returned from the 
in-memory docValues data.

Best,
Erick



> On Mar 2, 2019, at 1:12 AM, Venkat Kranthi Chalasani 
>  wrote:
> 
> Hi,
> 
> I have an index of ~4M documents. My queries are running in 1-2ms but 
> fetching the top hits (~1000 documents) takes around 30ms. Is this expected? 
> 
> If it is, I was wondering if maintaining an application cache with docId as a 
> key is ok. I understand docIds are ephemeral as documents are added/deleted 
> but in our usage of lucene, we don’t add/delete documents. 
> 
> thanks
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Ignoring “de la” at index or search time

2019-02-24 Thread Erick Erickson
Case 1. Stopwords are irrelevant. If you sreach field:(a AND b) you're
asking if both appear in the field, and that's the only question. It
doesn't matter what other words are in the field. It doesn't matter whether
they're close to each other.

Case 2. Yep.

On Sun, Feb 24, 2019, 17:02 baris.kazar  wrote:

> There is PhraseQuery, too, but lets consider two cases:
>
> case1: that PhraseQuery is not being used:
> then should i add to standard filter’s stopwords also the french stopwords
> both at index & search times? can i just add them at search time and keep
> old friends index as it is?
>
> case2: that PhraseQuery being used:
> i guess i need to play with the “slops” and stopwords in this case will
> not help, right?
>
> Thanks
>
> > On Feb 24, 2019, at 2:25 PM, baris.kazar  wrote:
> >
> > That is not what i am looking for. Thanks.
> >
> > c b search string finds
> > a b
> > but how cant find
> > a de la b
> > so i will try french stopwords.
> > Doing that i am using 8 queries like the ones i mentioned.
> > Best
> >
> >> On Feb 24, 2019, at 1:19 PM, Erick Erickson 
> wrote:
> >>
> >> Phrase search is looking for words next to each other. A phrase search
> on the text “my dog has fleas” would succeed for “my dog” or “has fleas”
> but not “my fleas” since the words are not right next to each other. “my
> fleas”~3 would succeed because the “~3” indicates that the words can have
> intervening terms.
> >>
> >> Searching (dog AND fleas) would match no matter how many words were
> between the two.
> >>
> >> If you’re unclear about what phrase search .vs. non-phrase search
> means, some background research/ self-education are strongly recommended,
> such basic understanding of search is pretty much assumed.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Feb 24, 2019, at 9:25 AM, baris.kazar 
> wrote:
> >>>
> >>> i guess so
> >>> what is phrase search?
> >>> c b is searched do you expect a de la b?
> >>> Thanks
> >>>
> >>>> On Feb 24, 2019, at 10:49 AM, Erick Erickson 
> wrote:
> >>>>
> >>>> Not sure we’re talking about the same thing. I was talking
> specifically about _phrase_ searches. If all you want is the clause you
> just said, phrases are not involved at all and the presence or absence of
> intervening words is totally unnecessary. This assumes your field type
> tokenizes the input similar to the text_general field in the examples.
> Specifically _not_ “string” fields or fields that use KeywordTokenizer.
> >>>>
> >>>> q=name:(a AND b) OR name:b
> >>>>
> >>>> for instance. With a query like that it doesn’t matter in the least
> whether there are, or are not any words between “a” and “b”.
> >>>>
> >>>> All that may be obvious to you, but when I read your latest e-mail it
> occurred to me that we might not be talking about the same thing.
> >>>>
> >>>> Best,
> >>>> Erick
> >>>>
> >>>>> On Feb 23, 2019, at 7:33 PM, baris.kazar 
> wrote:
> >>>>>
> >>>>> In this case search string is c b
> >>>>> and then search query has 8 combos
> >>>>> including two cases with c b ~ which means find all containing c And
> b and c Or b ( two separate queries having ~ )
> >>>>> and then i can find a b but not a de la b without French stopwords.
> >>>>> Thanks
> >>>>>
> >>>>>> On Feb 23, 2019, at 6:52 PM, Erick Erickson <
> erickerick...@gmail.com> wrote:
> >>>>>>
> >>>>>> Lucene won’t ignore these unless you tell it to via stopwords.
> >>>>>>
> >>>>>> This is a problem no matter how you look at it. If you do put in
> stopwords, the word _positions_ are retained. In your example,
> >>>>>> word position
> >>>>>> a   1
> >>>>>> de 2
> >>>>>> la 3
> >>>>>> b   4
> >>>>>>
> >>>>>> If you remove “de” and “la” via stopwords, the positions are still:
> >>>>>>
> >>>>>> word position
> >>>>>> a   1
> >>>>>> b   4
> >>>>>>
> >>>>>> So searching for “a b” would fail in the second case unless you
>

Re: Ignoring “de la” at index or search time

2019-02-24 Thread Erick Erickson
Phrase search is looking for words next to each other. A phrase search on the 
text “my dog has fleas” would succeed for “my dog” or “has fleas” but not “my 
fleas” since the words are not right next to each other. “my fleas”~3 would 
succeed because the “~3” indicates that the words can have intervening terms.

Searching (dog AND fleas) would match no matter how many words were between the 
two.

If you’re unclear about what phrase search .vs. non-phrase search means, some 
background research/ self-education are strongly recommended, such basic 
understanding of search is pretty much assumed.

Best,
Erick

> On Feb 24, 2019, at 9:25 AM, baris.kazar  wrote:
> 
> i guess so
> what is phrase search?
> c b is searched do you expect a de la b?
> Thanks
> 
>> On Feb 24, 2019, at 10:49 AM, Erick Erickson  wrote:
>> 
>> Not sure we’re talking about the same thing. I was talking specifically 
>> about _phrase_ searches. If all you want is the clause you just said, 
>> phrases are not involved at all and the presence or absence of intervening 
>> words is totally unnecessary. This assumes your field type tokenizes the 
>> input similar to the text_general field in the examples. Specifically _not_ 
>> “string” fields or fields that use KeywordTokenizer. 
>> 
>> q=name:(a AND b) OR name:b
>> 
>> for instance. With a query like that it doesn’t matter in the least whether 
>> there are, or are not any words between “a” and “b”.
>> 
>> All that may be obvious to you, but when I read your latest e-mail it 
>> occurred to me that we might not be talking about the same thing.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 23, 2019, at 7:33 PM, baris.kazar  wrote:
>>> 
>>> In this case search string is c b
>>> and then search query has 8 combos
>>> including two cases with c b ~ which means find all containing c And b and 
>>> c Or b ( two separate queries having ~ )
>>> and then i can find a b but not a de la b without French stopwords.
>>> Thanks
>>> 
>>>> On Feb 23, 2019, at 6:52 PM, Erick Erickson  
>>>> wrote:
>>>> 
>>>> Lucene won’t ignore these unless you tell it to via stopwords.
>>>> 
>>>> This is a problem no matter how you look at it. If you do put in 
>>>> stopwords, the word _positions_ are retained. In your example,
>>>> word position
>>>> a   1
>>>> de 2
>>>> la 3
>>>> b   4
>>>> 
>>>> If you remove “de” and “la” via stopwords, the positions are still:
>>>> 
>>>> word position
>>>> a   1
>>>> b   4
>>>> 
>>>> So searching for “a b” would fail in the second case unless you included 
>>>> “slop” as
>>>> “a b”~2
>>>> 
>>>> But let’s say you _do not_ have input with these stopwords, just “a b". 
>>>> The positions
>>>> will be 1 and 2 respectively. Here the user would expect “a b” to match 
>>>> this doc, but
>>>> not a doc with “a de la b” (unless they knew a lot about search!).
>>>> 
>>>> So maybe the right thing to do is let phrases have slop as a matter of 
>>>> course.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> 
>>>>> On Feb 23, 2019, at 11:07 AM, baris.kazar  wrote:
>>>>> 
>>>>> Thanks Erick there is a pattern i cant catch in my results such as:
>>>>> a de la b
>>>>> i catch “a b” though.
>>>>> I though Lucene might ignore those automatically while creating index.
>>>>> 
>>>>> 
>>>>>> On Feb 23, 2019, at 12:29 PM, Erick Erickson  
>>>>>> wrote:
>>>>>> 
>>>>>> Use stopwords, although it's becoming less of a concern, why do you think
>>>>>> you need to?
>>>>>> 
>>>>>>> On Sat, Feb 23, 2019, 08:42 baris.kazar  wrote:
>>>>>>> 
>>>>>>> Hi,-
>>>>>>> What is the (most efficient) way to
>>>>>>> ignore “de la” kinda connectors
>>>>>>> in a string at index or search time?
>>>>>>> Thanks
>>>>>>> 
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For

Re: Ignoring “de la” at index or search time

2019-02-24 Thread Erick Erickson
Not sure we’re talking about the same thing. I was talking specifically about 
_phrase_ searches. If all you want is the clause you just said, phrases are not 
involved at all and the presence or absence of intervening words is totally 
unnecessary. This assumes your field type tokenizes the input similar to the 
text_general field in the examples. Specifically _not_ “string” fields or 
fields that use KeywordTokenizer. 

q=name:(a AND b) OR name:b

for instance. With a query like that it doesn’t matter in the least whether 
there are, or are not any words between “a” and “b”.

All that may be obvious to you, but when I read your latest e-mail it occurred 
to me that we might not be talking about the same thing.

Best,
Erick

> On Feb 23, 2019, at 7:33 PM, baris.kazar  wrote:
> 
> In this case search string is c b
> and then search query has 8 combos
> including two cases with c b ~ which means find all containing c And b and c 
> Or b ( two separate queries having ~ )
> and then i can find a b but not a de la b without French stopwords.
> Thanks
> 
>> On Feb 23, 2019, at 6:52 PM, Erick Erickson  wrote:
>> 
>> Lucene won’t ignore these unless you tell it to via stopwords.
>> 
>> This is a problem no matter how you look at it. If you do put in stopwords, 
>> the word _positions_ are retained. In your example,
>> word position
>> a   1
>> de 2
>> la 3
>> b   4
>> 
>> If you remove “de” and “la” via stopwords, the positions are still:
>> 
>> word position
>> a   1
>> b   4
>> 
>> So searching for “a b” would fail in the second case unless you included 
>> “slop” as
>> “a b”~2
>> 
>> But let’s say you _do not_ have input with these stopwords, just “a b". The 
>> positions
>> will be 1 and 2 respectively. Here the user would expect “a b” to match this 
>> doc, but
>> not a doc with “a de la b” (unless they knew a lot about search!).
>> 
>> So maybe the right thing to do is let phrases have slop as a matter of 
>> course.
>> 
>> Best,
>> Erick
>> 
>> 
>>> On Feb 23, 2019, at 11:07 AM, baris.kazar  wrote:
>>> 
>>> Thanks Erick there is a pattern i cant catch in my results such as:
>>> a de la b
>>> i catch “a b” though.
>>> I though Lucene might ignore those automatically while creating index.
>>> 
>>> 
>>>> On Feb 23, 2019, at 12:29 PM, Erick Erickson  
>>>> wrote:
>>>> 
>>>> Use stopwords, although it's becoming less of a concern, why do you think
>>>> you need to?
>>>> 
>>>>> On Sat, Feb 23, 2019, 08:42 baris.kazar  wrote:
>>>>> 
>>>>> Hi,-
>>>>> What is the (most efficient) way to
>>>>> ignore “de la” kinda connectors
>>>>> in a string at index or search time?
>>>>> Thanks
>>>>> 
>>>>> -
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> 
>>>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Ignoring “de la” at index or search time

2019-02-23 Thread Erick Erickson
Lucene won’t ignore these unless you tell it to via stopwords.

This is a problem no matter how you look at it. If you do put in stopwords, the 
word _positions_ are retained. In your example,
word position
a   1
de 2
la 3
b   4

If you remove “de” and “la” via stopwords, the positions are still:

word position
a   1
b   4

So searching for “a b” would fail in the second case unless you included “slop” 
as
“a b”~2

But let’s say you _do not_ have input with these stopwords, just “a b". The 
positions
will be 1 and 2 respectively. Here the user would expect “a b” to match this 
doc, but
not a doc with “a de la b” (unless they knew a lot about search!).

So maybe the right thing to do is let phrases have slop as a matter of course.

Best,
Erick


> On Feb 23, 2019, at 11:07 AM, baris.kazar  wrote:
> 
> Thanks Erick there is a pattern i cant catch in my results such as:
> a de la b
> i catch “a b” though.
> I though Lucene might ignore those automatically while creating index.
> 
> 
>> On Feb 23, 2019, at 12:29 PM, Erick Erickson  wrote:
>> 
>> Use stopwords, although it's becoming less of a concern, why do you think
>> you need to?
>> 
>>> On Sat, Feb 23, 2019, 08:42 baris.kazar  wrote:
>>> 
>>> Hi,-
>>> What is the (most efficient) way to
>>> ignore “de la” kinda connectors
>>> in a string at index or search time?
>>> Thanks
>>> 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>>> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Ignoring “de la” at index or search time

2019-02-23 Thread Erick Erickson
Use stopwords, although it's becoming less of a concern, why do you think
you need to?

On Sat, Feb 23, 2019, 08:42 baris.kazar  wrote:

> Hi,-
> What is the (most efficient) way to
> ignore “de la” kinda connectors
> in a string at index or search time?
> Thanks
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Updating specific fields of huge docs

2019-02-13 Thread Erick Erickson
If (and only if) the fields you need to update are single-valued,
docValues=true, indexed=false, you can do in-place update of the DV
field only.

Otherwise, you'll probably have to split the docs up. The question is
whether you have evidence that reindexing is too expensive.

If you do need to split the docs up, you might find some of the
streaming capabilities useful for join kinds of operations of other
join options don't work out or you just prefer the streaming
alternative.

Best,
Erick

On Wed, Feb 13, 2019 at 11:43 AM Luís Filipe Nassif  wrote:
>
> Hi all,
>
> Lucene 7 still deletes and re-adds docs when an update operation is done,
> as I understood.
>
> When docs have dozens of fields and one of them is large text content
> (extracted by Tika) and if I need to update some other small fields, what
> is the best approach to not reindex that large text field?
>
> Any better way than splitting the index in two (metadata and text indexes)
> and using ParallelCompositeReader for searches?
>
> Thanks in advance,
> Luis

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SynonymGraphFilter can't consume an incoming graph

2019-02-10 Thread Erick Erickson
It's, well, undefined. As in nobody knows except that it'll be wrong.
And exactly what the results are may change with any given release.

Best,
Erick

On Sun, Feb 10, 2019 at 10:48 AM lambda.coder lucene
 wrote:
>
> Hello,
>
> The Javadocs of SynonymGraphFilter says that it can’t consume an incoming 
> graph and that the result will be undefined
>
> Is there any example that exhibits the limitations and what is meant by 
> undefined ?
>
>
> Regards
> Patrick
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Manifoldcf2.10 - Sending user-defined fields to solr

2019-01-09 Thread Erick Erickson
You'd probably get more knowledgeable info from the Manifold
folks, I don't know how many people on this list _also_ use
Mainfold...

Best,
Erick

On Wed, Jan 9, 2019 at 5:48 AM subasini  wrote:
>
> Hi
> I am using manifoldcf 2.10 and Solr 7.6.0.
> I can crawl my website and indexing done in Solr successfully.
> Now I want to send one key-value pair from manifoldcf which should appear in
> Solr.
> For different websites, the value will be different so that I can use the
> same for filtering in my solr query.
>
> I can see there is no "forced metadata" option in manifold_2.10.
> Can anybody guide me how can I send the data. Is there any setting in
> manifold mcf-crawler-ui/ console.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: is Document match Query

2018-12-17 Thread Erick Erickson
I'm not sure I understand, but why not just fire the queries off with
an fq of the document ID?

If you just need to know if any of N queries match the doc, you could
check several at once with a big OR clause.

Best,
Erick
On Mon, Dec 17, 2018 at 5:06 AM Valentin Popov  wrote:
>
> Hello.
>
> I need implement a feature, that answer for a question: is a Document match a 
> Query.
>
> Right now, I’m implemented this such way:
>
> 1. Use RadDirectory
> 2. Index Document
> 3. Search used Query
> 4. If any doc match, this is mean Document match Query.
>
> Problem with this approach, it is too slow. Any way to detect, is Document 
> match Query without indexing and search?
>
> Thanks!
>
>
> Regards,
> Valentin
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SearcherManager not seeing changes in IndexWriteral and

2018-11-09 Thread Erick Erickson
You might be able to do this with a couple of threads in a single
program, but certainly up to you.

Best,
Erick
On Fri, Nov 9, 2018 at 7:47 AM Boris Petrov  wrote:
>
> Well, while debugging I put a bunch of println's which showed the
> expected order. And, besides, I've written the code, I know that writing
> to the index happens way before searching in it - the test makes sure of
> that.
>
> If you think there indeed might be some problem, I'll try to reproduce
> it in a small example. But I will try only the obvious (which perhaps
> you've tried a million times and have tests for) - I'll have one thread
> write to the index, another (which starts after the first) search in it
> and I'll create a bash script that runs the program until it fails (what
> I currently do with our test). I'll do this beginning of next week.
>
> Thank you for the support!
>
> On 11/9/18 5:37 PM, Erick Erickson wrote:
> > If it's hard to do in a single thread, how about timestamping the
> > events to insure that they happen in the expected order?
> >
> > That would verify the sequencing is happening as you expect and
> > _still_ not see the expected docs...
> >
> > Assuming it does fail, do you think you could reduce it to a
> > multi-threaded test case?
> >
> > Best,
> > Erick
> > On Fri, Nov 9, 2018 at 7:03 AM Boris Petrov  wrote:
> >> Well, it's a bit involved to try it in a single thread as I've
> >> oversimplified the example. But as far as I understand this should work,
> >> right? So something else is wrong? Committing the writer and then
> >> "maybeRefreshBlocking" should be enough to have the changes visible, yes?
> >>
> >> On 11/9/18 4:45 PM, Michael Sokolov wrote:
> >>> That should work, I think, but if you are serializing these threads so
> >>> that they cannot run concurrently, maybe try running both operations
> >>> in a single thread, at least as a test.
> >>> On Fri, Nov 9, 2018 at 9:16 AM Boris Petrov  wrote:
> >>>> If you mean the synchronization of the threads, it is not in the
> >>>> example, but Thread 2 is *started* after Thread 1 finished executing the
> >>>> code that I gave as an example. So there is happens-before between them.
> >>>> If you mean synchronization on the Lucene level - isn't that what
> >>>> "maybeRefreshBlocking" should do?
> >>>>
> >>>> On 11/9/18 3:29 PM, Michael Sokolov wrote:
> >>>>> I'm not seeing anything there that would synchronize, or serialize, the
> >>>>> read after the write and commit. Did you expect that for some reason?
> >>>>>
> >>>>> On Fri, Nov 9, 2018, 6:00 AM Boris Petrov  >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I'm using Lucene version 7.5.0. We have a test that does something 
> >>>>>> like:
> >>>>>>
> >>>>>> Thread 1:
> >>>>>>
> >>>>>> Field idStringField = new StringField("id", id,
> >>>>>> Field.Store.YES);
> >>>>>> Field contentsField = new TextField("contents", reader);
> >>>>>> Document document = new Document();
> >>>>>> document.add(idStringField);
> >>>>>> document.add(contentsField);
> >>>>>>
> >>>>>> writer.updateDocument(new Term(ID_FIELD, id), document);
> >>>>>> writer.flush(); // not sure this flush is needed?
> >>>>>> writer.commit();
> >>>>>>
> >>>>>> Thread 2:
> >>>>>>
> >>>>>> searchManager.maybeRefreshBlocking();
> >>>>>> IndexSearcher searcher = searchManager.acquire();
> >>>>>> try {
> >>>>>> QueryParser parser = new QueryParser("contents", 
> >>>>>> analyzer);
> >>>>>> Query luceneQuery = parser.parse(queryText);
> >>>>>> ScoreDoc[] hits = searcher.search(luceneQuery,
> >>>>>> 50).scoreDocs;
> >>>>>> } finally {
> >>>>>> searchManager.release(searcher);
> >>>>>> }
&

Re: SearcherManager not seeing changes in IndexWriteral and

2018-11-09 Thread Erick Erickson
If it's hard to do in a single thread, how about timestamping the
events to insure that they happen in the expected order?

That would verify the sequencing is happening as you expect and
_still_ not see the expected docs...

Assuming it does fail, do you think you could reduce it to a
multi-threaded test case?

Best,
Erick
On Fri, Nov 9, 2018 at 7:03 AM Boris Petrov  wrote:
>
> Well, it's a bit involved to try it in a single thread as I've
> oversimplified the example. But as far as I understand this should work,
> right? So something else is wrong? Committing the writer and then
> "maybeRefreshBlocking" should be enough to have the changes visible, yes?
>
> On 11/9/18 4:45 PM, Michael Sokolov wrote:
> > That should work, I think, but if you are serializing these threads so
> > that they cannot run concurrently, maybe try running both operations
> > in a single thread, at least as a test.
> > On Fri, Nov 9, 2018 at 9:16 AM Boris Petrov  wrote:
> >> If you mean the synchronization of the threads, it is not in the
> >> example, but Thread 2 is *started* after Thread 1 finished executing the
> >> code that I gave as an example. So there is happens-before between them.
> >> If you mean synchronization on the Lucene level - isn't that what
> >> "maybeRefreshBlocking" should do?
> >>
> >> On 11/9/18 3:29 PM, Michael Sokolov wrote:
> >>> I'm not seeing anything there that would synchronize, or serialize, the
> >>> read after the write and commit. Did you expect that for some reason?
> >>>
> >>> On Fri, Nov 9, 2018, 6:00 AM Boris Petrov  >>>
>  Hi all,
> 
>  I'm using Lucene version 7.5.0. We have a test that does something like:
> 
>  Thread 1:
> 
>  Field idStringField = new StringField("id", id,
>  Field.Store.YES);
>  Field contentsField = new TextField("contents", reader);
>  Document document = new Document();
>  document.add(idStringField);
>  document.add(contentsField);
> 
>  writer.updateDocument(new Term(ID_FIELD, id), document);
>  writer.flush(); // not sure this flush is needed?
>  writer.commit();
> 
>  Thread 2:
> 
>  searchManager.maybeRefreshBlocking();
>  IndexSearcher searcher = searchManager.acquire();
>  try {
>  QueryParser parser = new QueryParser("contents", 
>  analyzer);
>  Query luceneQuery = parser.parse(queryText);
>  ScoreDoc[] hits = searcher.search(luceneQuery,
>  50).scoreDocs;
>  } finally {
>  searchManager.release(searcher);
>  }
> 
>  Thread 1 happens before Thread 2.
> 
>  Sometimes, only sometimes, the commit from thread 1 is not *immediately*
>  visible in Thread 2. If I put a "Thread.sleep(1000)" it always works.
>  Without it, sometimes the search is empty. I'm not sure if I'm doing
>  something wrong or this is a bug?
> 
>  Thanks!
> 
> 
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Static index, fastest way to do forceMerge

2018-11-03 Thread Erick Erickson
Do you really need exactly one segment? Or would, say, 5 be good enough?
You see where this is going, set maxsegments to 5 and maybe be able to get
some parallelization...

On Fri, Nov 2, 2018, 14:17 Dawid Weiss  Thanks for chipping in, Toke. A ~1TB index is impressive.
>
> Back of the envelope says reading & writing 900GB in 8 hours is
> 2*900GB/(8*60*60s) = 64MB/s. I don't remember the interface for our
> SSD machine, but even with SATA II this is only ~1/5th of the possible
> fairly sequential IO throughput. So for us at least, NVMe drives are
> not needed to have single-threaded CPU as bottleneck.
>
> The mileage will vary depending on the CPU -- if it can merge the data
> from multiple files at ones fast enough then it may theoretically
> saturate the bandwidth... but I agree we also seem to be CPU bound on
> these N-to-1 merges, a regular SSD is enough.
>
> > And +1 to the issue BTW.
>
> I agree. Fine-grained granularity here would be a win even in the
> regular "merge is a low-priority citizen" case. At least that's what I
> tend to think. And if there are spare CPUs, the gain would be
> terrific.
>
> Dawid
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Static index, fastest way to do forceMerge

2018-11-02 Thread Erick Erickson
The merge process is rather tricky, and there's nothing that I know of
that will use all resources available. In fact the merge code is
written to _not_ use up all the possible resources on the theory that
there should be some left over to handle queries etc.

Yeah, the situation you describe is indeed one of the few where
merging down to 1 segment makes sense. Out of curiosity, what kind of
performance gains to you see?

This applies to the default TieredMergePolicy (TMP):

1> there is a limit to the number of segments that can be merged at
once, so sometimes it can take more than one pass. If you have more
than 30 segments, it'll be multi-pass. You can try (and I haven't done
this personally) setting maxMergeAtOnceExplicit in your solrconfig.xml
to see if it helps. That only takes effect when you forceMerge.
There's a trick bit of reflection that handles this, see the very end
of TieredMergePolicy.java for the parameters you can set.

2> As of Solr 7.5 (see LUCENE-7976) the default behavior has changed
from automatically merging down to 1 segment to respecting
"maxMergedSegmentMB" (default 5G). You will have to explicitly pass
maxSegments=1 to get the old behavior.

Best,
Erick
On Fri, Nov 2, 2018 at 3:13 AM Jerven Bolleman
 wrote:
>
> Dear Lucene Devs and Users,
>
> First of all thank you for this wonderful library and API.
>
> forceMerges are normally not recommended but we fall into one of the few
> usecases where it makes sense.
>
> In our use case we have a large index (3 actually) and we don't update
> them ever after indexing. i.e. we index all the documents and then never
> ever add another document to the index, nor are any deleted.
>
> It has proven beneficial for search performance to always foreMerge down
> to one segment. However, this takes significant time. Are there any
> suggestions on what kind of merge scheduler/policy settings will utilize
> the most of the available IO, CPU and RAM capacity? Currently we end up
> being single thread bound, leaving lots of potential cpu and bandwidth
> not used during the merge.
>
> e.g. we are looking for a MergeEvertyThing use all hardware policy and
> scheduler.
>
> We are currently on lucene 7.4 but nothing is stopping us from upgrading.
>
> Regards,
> Jerven
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene stops working

2018-11-02 Thread Erick Erickson
Is this custom code? What method? Can you show us a sample?

There's not enough information here to say much.
On Fri, Nov 2, 2018 at 7:38 AM egorlex  wrote:
>
> Hi, I am new in Lucene and i have strange problem. Lucene  stops working
> without any errors after some time. It works fine for 1 day or several
> hours. I did some investigation and found that 5 IndexReaders are opened in
> the search method, but they do not close when exiting the method..can it be
> a cause? Do I need to close the IndexReader after use?
> (Lucene version 7.3.0)
>
> I would be very grateful for the help.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Exception Details

2018-10-30 Thread Erick Erickson
No clue, org.compass.core is not part of Solr/Lucene, you'll have
to ask the compass folks.

Best,
Erick
On Tue, Oct 30, 2018 at 6:52 AM Veeraswami Pattapagalu
 wrote:
>
> Hi team,
>
>
> We have got following exception while doing indexing , please share us the
> information why we are getting this exception
>
>
> Search engine transaction not successfully started or already
> committed/rolledback
>
> org.compass.core.engine.SearchEngineException: Search engine transaction
> not successfully started or already committed/rolledback
>
> at
> org.compass.core.lucene.engine.LuceneSearchEngine.verifyWithinTransaction(LuceneSearchEngine.java:234)
>
> at
> org.compass.core.lucene.engine.LuceneSearchEngine.rollback(LuceneSearchEngine.java:265)
>
> at
> org.compass.core.transaction.LocalTransaction.doRollback(LocalTransaction.java:119)
>
> at
> org.compass.core.transaction.AbstractTransaction.rollback(AbstractTransaction.java:46)
>
>
> Thank you,
>
> Veeraswami.P

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Release the RAM

2018-10-25 Thread Erick Erickson
This really seems like an XY problem. What are you trying to
accomplish that makes you want to use RAMDirectory at all? Why I'm
asking:

1> RAMDirectory is quite special-purpose, very rarely is it something
you should use

2> Java doesn't collect garbage when you close an object that
references it, it'll be collected on some later GC cycle. You can
attach something like jconsole and hit the "force GC" button to see if
the memory is released.

3> bq. "when i want to initialize app via tomcat its loaded all memory
then write to disk which i dont want to have" Again, what is the
purpose here? What are you trying to accomplish _for the user_? I
frankly have no idea what "loaded all memory then write to disk"
means. Writing to disk doesn't happen unless you're updating
documents.

Best,
Erick
On Thu, Oct 25, 2018 at 6:33 AM thturk  wrote:
>
> I want to release ram when i want ,
> i have tired to close created reader, ram directory, and searcher given in
> below;
>
> ramDir = new RAMDirectory(FSDirectory.open(indexDir), IOContext.READ);
> reader = DirectoryReader.open(ramDir);
> searcher = new IndexSearcher(reader);
>
> searcher = null;
> reader.close();
> reader = null;
> ramDir.close();
> ramDir = null;
>
> but it didnt  released Ram ,and when i wantto initialize app via tomcat its
> loaded all memory then write to disk which i dont want to have
>
> Is there anyone know what is the best way for this .
> Thanks for your help.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: force deletes - terms enum still has deleted terms?

2018-09-28 Thread Erick Erickson
You might be hitting a rounding error. When this happens, how many
deleted documents are there in the remaining segments? 1?

The calculation for whether to merge the segment is:

double pctDeletes = 100. * ((double) deleted_docs_in_segment /
(double) doc_count_in_segment_including_deleted_docs
if (pctDeletes > forceMergeDeletesPctAllowed) {merge the segment}.

At any rate, calling findForcedMerges instead will purge all deleted
docs no matter what.

NOTE: as of 7.5, the behavior has changed in that both of these
methods will respect the maximum segment size by default. Prior to
7.5, either of these could produce a single segment for all the
segments that were merged (all of them in forceMerge, all with > n%
deleted docs in forceMergeDeletes). If you require a single segment to
result, you can specify the maxSegmentCount as 1.

See LUCENE-7976 for all the gory details of this change if you're curious

Best,
Erick
On Fri, Sep 28, 2018 at 5:41 AM Rob Audenaerde  wrote:
>
> Hi all,
>
> We build a FST on the terms of our index by iterating the terms of the
> readers for our fields, like this:
>
> for (final LeafReaderContext ctx : leaves) {
> final LeafReader leafReader = ctx.reader();
>
> for (final String indexField : indexFields) {
> final Terms terms =
> leafReader.terms(indexField);
> // If the field does not exist in this
> reader, then we get null, so check for that.
> if (terms != null) {
> final TermsEnum termsEnum =
> terms.iterator();
>
> However, it sometimes the building of the FST seems to find terms that are
> from documents that are deleted. This is what we expect, checking the
> javadocs.
>
> So, now we switched the IndexWriter to a config with a TieredMergePolicy
> with: setForceMergeDeletesPctAllowed(0).
>
> When calling indexWriter.forceMergeDeletes(true) we expect that there will
> be no more deletes. However, the deleted terms still sometimes appear. We
> use the DirectoryReader.openIfChanged() to refresh the reader before
> iterating the terms.
>
> Are we forgetting something?
>
> Thanks in advance.
> Rob Audenaerde

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Running query against a single document

2018-09-21 Thread Erick Erickson
bq. We would like to know if there is a way to test a query against a document
without creating an index. We were thinking that maybe we could use lucene
highlighter component
to achieve this,

I don't really understand this at all. How are you using the
highlighter component without creating an index? Custom code?

But that aside, there are dozens, if not hundreds of examples of this
in the Solr test code. You could write a Solr junit test, which
is "just some Java code" and run that.

To execute this within the test framework, you have two options:
1> from the top level "ant -Dtestcase=custom_test test", which takes a
long time to run
2> from solr/core "ant -Dtestcase=custom_test test-nocompile". You
have to have compiled your code of course for this to work.

BTW, if you skip all that and just use a Solr instance, one very
useful trick is to use &debug=true&debug.explainOther
(https://lucene.apache.org/solr/guide/6_6/common-query-parameters.html).
That will show you exactly how the doc was
scored _whether or not_ it would have been returned by the primary query.

Best,
Erick
On Fri, Sep 21, 2018 at 6:16 AM Tom Mortimer  wrote:
>
> Hi,
>
> Have you considered using MemoryIndex
> 
> ?
>
> cheers,
> Tom
>
>
> tel +44 8700 118334 : mobile +44 7876 741014 : skype tommortimer
>
>
> On Fri, 21 Sep 2018 at 13:58, Aurélien MAZOYER 
> wrote:
>
> > Hi,
> >
> > We would like to know if there is a way to test a query against a document
> > without creating an index. We were thinking that maybe we could use lucene
> > highlighter component
> > to achieve this, but it seems it doesn't work as expected with complex
> > queries.
> > For example, we create a SpanQuery (+spanFirst(field:saint, 1)
> > +spanNear([field:saint, field:quentin], 0, true)) and we tested it against
> > two documents :
> > D1={field=eglise saint quentin}
> > D2={field=saint quentin deladadoupa}
> > We expect to get these entries from the highlighter :
> > D1 eglise saint quentin
> > D2 saint quentin deladadoupa
> > But we got
> > eglise saint quentin for D1, which is unexpected from our
> > perspective because it doesn't match our SpanQuery.
> > Do you have any ideas if this approach is correct or if we better use some
> > other way to achieve this functionality.
> > FYI we use Lucene 6.5.1.
> >
> > Thank you for your help,
> >
> > Regards,
> >
> > Aurelien and Andrey
> > Tchiota GMBH
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to access DocValues inside a customized collector?

2018-09-20 Thread Erick Erickson
What Luke are you using? I think this one is being maintained:
https://github.com/DmitryKey/luke

The Terms component directly access the indexed data and can be used
to poke around in the indexed data.

I'll skip the accessing DocValues as I have to go back and look every time.
On Thu, Sep 20, 2018 at 6:23 PM Lisheng Zhang  wrote:
>
> we need to use binary DocValues (in a customized collector) added during
> indexing, i first tested in standard TopScoreDocCollector, it seems that we
> need to:
>
> LeafReaderContext => reader() => get binary iterator => advanced to correct
> location
>
> Is this the correct way or actually we have a better API (since we already
> in that docId it seems to me that the binary DocValues should be readily
> available?
>
> Also do we have a way to see directly indexed data (Luke seems obsolete,
> Marple does not work with lucene 7.4.0 yet)?
>
> Thanks very much for helps, Lisheng

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MultiPhraseQuery

2018-09-18 Thread Erick Erickson
bq. i wish the Javadocs has examples like PhraseQuery Javadocs gave.

This is where someone coming into the examples for the first time is
invaluable, javadoc patches are most welcome! It can be hard to back
off enough to remember what the confusing bits are when you wrote the
code ;)
On Tue, Sep 18, 2018 at 1:56 PM  wrote:
>
> Any suggestions please?
> Two main questions:
> - how do synonyms get utilized by MultiPhraseQuery?
> - how do we get second token "app" applied to the example on
> MultiPhraseQuery javadocs page? (and how do we get Terms[] array from
> Terms object?)
>
> Now three questions :)
>
> i wish the Javadocs has examples like PhraseQuery Javadocs gave.
>
> Best
>
> On 9/18/18 4:45 PM, baris.ka...@oracle.com wrote:
> > Trying to implement the example on
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_6-5F6-5F1_core_org_apache_lucene_search_MultiPhraseQuery.html&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=7WmT3NC9wzVk4FPBupACoALoL4kho6V7-c2o4Kac5QM&s=gM6_4hvpLEZY1_7r-CEInZbUb-ublYDcJOQ8rmeAgVA&e=
> >
> > // A generalized version of PhraseQuery, with the possibility of
> > adding more than one term at the same position that are treated as a
> > disjunction (OR). To use this class to search for the phrase
> > "Microsoft app*" first create a Builder and use
> >
> > // MultiPhraseQuery.Builder.add(Term) on the term "microsoft"
> > (assuming lowercase analysis), then find all terms that have "app" as
> > prefix using LeafReader.terms(String), seeking to "app" then iterating
> > and collecting terms until there is no longer that prefix,
> >
> > // and finally use MultiPhraseQuery.Builder.add(Term[]) to add them.
> > MultiPhraseQuery.Builder.build() returns the fully constructed (and
> > immutable) MultiPhraseQuery.
> >
> >
> > IndexSearcher is = new IndexSearcher(indexReader);
> >
> > MultiPhraseQuery.Builder builder = new MultiPhraseQuery.Builder();
> > builder.add(new Term("body", "one"), 0);
> >
> > Terms terms = LeafReader.terms("body"); // will this be slow? and how
> > do we incorporate token/word "app" here?
> >
> > // i STILL dont see how to get individual Term objects from terms
> > object and plus do i need to declare LeafReader object?
> >
> > Term[] termArr = new Term[k]; // i will get this filled via using
> > Terms.iterator
> > builder.add(termArr);
> > MultiPhraseQuery mpq = builder.build();
> > TopDocs hits = is.search(mpq, 20);// 20 hits
> >
> >
> > Best regards
> >
> >
> > On 9/18/18 4:16 PM, baris.ka...@oracle.com wrote:
> >> Hi,-
> >>
> >>  how does MultiPhraseQuery treat synonyms?
> >>
> >> is the following possible?
> >>
> >> ... (created index with synonyms and indexReader object has the index)
> >>
> >> IndexSearcher is = new IndexSearcher(indexReader);
> >>
> >> MultiPhraseQuery.Builder builder = new MultiPhraseQuery.Builder();
> >> builder.add(new Term("body", "one"), 0);
> >> builder.add(new Term("body", "two"), 1);
> >> MultiPhraseQuery mpq = builder.build();
> >> TopDocs hits = is.search(mpq, 20);// 20 hits
> >>
> >> Best regards
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Subscribe to lucene user list

2018-09-17 Thread Erick Erickson
http://lucene.apache.org/solr/community.html#mailing-lists-irc
On Mon, Sep 17, 2018 at 6:12 AM Anupam Rastogi  wrote:
>
> Hi,
>
>I would like to subscribe to Lucene user list.
>Thanks for all the help.
>
> Thanks,
> Anupam Rastogi

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any way to improve document fetching performance?

2018-08-28 Thread Erick Erickson
bq. It seems store field is not perform well.

Stored fields perform exactly as intended. Consider the situation
where very large text fields are stored. Making those into something
like docValues would be a very poor tradeoff, even if it were
possible. Not to mention highlighting etc. There are circumstances
where fetching from docValues actually has poorer overall performance
than using stored=true.

That said, the ability to use docValues fields in place of stored
(subject to certain restrictions that you should take the time to
understand) does indeed blur the distinction.

It's really a matter of choosing the use-case that supports the
use-case you require, there's no one-size-fits-all way to go about it.

Encoding/decoding in your own binary format? Will you be able to use
those values for things like faceting, grouping and sorting (which is
what docValues were designed to enhabnce)?
On Tue, Aug 28, 2018 at 2:11 AM alex stark  wrote:
>
> I simple tried MultiDocValues.getBinaryValues to fetch result by doc value, 
> it improves a lot, 2000 result takes only 5 ms. I even encode all the 
> returnable fields to binary docvalues and then decode them, the results is 
> also good enough. It seems store field is not perform well In our 
> scenario (I think it is more common nowadays), search phrase should return as 
> many results as possible so that rank phrase can resort the results by 
> machine learning algorithm(on other clusters). Fetching performance is also 
> important. ---- On Tue, 28 Aug 2018 00:11:40 +0800 Erick Erickson 
>  wrote  Don't use that call. You're exactly 
> right, it goes out to disk, reads the doc, decompresses it (16K blocks 
> minimum per doc IIUC) all just to get the field. 2,000 in 50ms actually isn't 
> bad for all that work ;). This sounds like an XY problem. You're asking how 
> to speed up fetching docs, but not telling us anything about _why_ you want 
> to do this. Fetching 2,000 docs is not generally what Solr was built for, 
> it's built for returning the top N where N is usually < 100, most frequently 
> < 20. If you want to return lots of documents' data you should seriously look 
> at putting the fields you want in docValues=true fields and pulling from 
> there. The entire Streaming functionality is built on this and is quite fast. 
> Best, Erick On Mon, Aug 27, 2018 at 7:35 AM  wrote: > 
> > can you post your query string? > > Best > > > On 8/27/18 10:33 AM, alex 
> stark wrote: > > In same machine, no net latency. When I reduce to 500 limit, 
> it takes 20ms, which is also slower than I expected. btw, indexing is 
> stopped.  On Mon, 27 Aug 2018 22:17:41 +0800  
> wrote  yes, it should be less than a ms actually for those type of files. 
> index and search on the same machine? no net latency in between? Best On 
> 8/27/18 10:14 AM, alex stark wrote: > quite small, just serveral simple short 
> text store fields. The total index size is around 1 GB (2m doc).  On Mon, 
> 27 Aug 2018 22:12:07 +0800  wrote  Alex,- how big 
> are those docs? Best regards On 8/27/18 10:09 AM, alex stark wrote: > Hello 
> experts, I am wondering is there any way to improve document fetching 
> performance, it appears to me that visiting from store field is quite slow. I 
> simply tested to use indexsearch.doc() to get 2000 document which takes 50ms. 
> Is there any idea to improve that? 
> - To 
> unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
> commands, e-mail: java-user-h...@lucene.apache.org 
> - To 
> unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
> commands, e-mail: java-user-h...@lucene.apache.org > > > > 
> - > To 
> unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional 
> commands, e-mail: java-user-h...@lucene.apache.org > 
> - To 
> unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
> commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Any way to improve document fetching performance?

2018-08-27 Thread Erick Erickson
Don't use that call. You're exactly right, it goes out to disk, reads
the doc, decompresses it (16K blocks minimum per doc IIUC) all just to
get the field. 2,000 in 50ms actually isn't bad for all that work ;).

This sounds like an XY problem. You're asking how to speed up fetching
docs, but not telling us anything about _why_ you want to do this.
Fetching 2,000 docs is not generally what Solr was built for, it's
built for returning the top N where N is usually < 100, most
frequently < 20.

If you want to return lots of documents' data you should seriously
look at putting the fields you want in docValues=true fields and
pulling from there. The entire Streaming functionality is built on
this and is quite fast.

Best,
Erick
On Mon, Aug 27, 2018 at 7:35 AM  wrote:
>
> can you post your query string?
>
> Best
>
>
> On 8/27/18 10:33 AM, alex stark wrote:
> > In same machine, no net latency.  When I reduce to 500 limit, it takes 
> > 20ms, which is also slower than I expected. btw, indexing is stopped.  
> > On Mon, 27 Aug 2018 22:17:41 +0800  wrote  yes, 
> > it should be less than a ms actually for those type of files. index and 
> > search on the same machine? no net latency in between? Best On 8/27/18 
> > 10:14 AM, alex stark wrote: > quite small, just serveral simple short text 
> > store fields. The total index size is around 1 GB (2m doc).  On Mon, 27 
> > Aug 2018 22:12:07 +0800  wrote  Alex,- how big 
> > are those docs? Best regards On 8/27/18 10:09 AM, alex stark wrote: > Hello 
> > experts, I am wondering is there any way to improve document fetching 
> > performance, it appears to me that visiting from store field is quite slow. 
> > I simply tested to use indexsearch.doc() to get 2000 document which takes 
> > 50ms. Is there any idea to improve that? 
> > - To 
> > unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
> > commands, e-mail: java-user-h...@lucene.apache.org 
> > - To 
> > unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
> > commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Question about usage of LuceneTestCase

2018-08-22 Thread Erick Erickson
bq. My understanding at this point is (though it may be a repeat of your words,)
first we should find out the combinations behind the failures.
If there are any particular patterns, there could be bugs, so we should fix
it.

You don't really have to figure out exactly what the combinations are,
just execute the test with the "reproduce with" flags set, cut/paste
the error message at the root of your local Solr source tree in a
command prompt.

ant test  -Dtestcase=CommitsImplTest
-Dtests.method=testGetSegmentAttributes -Dtests.seed=35AF58F652536895
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=de
-Dtests.timezone=Africa/Kigali -Dtests.asserts=true
-Dtests.file.encoding=UTF-8

That should reproduce exactly the same results from random() and
(hopefully) reliably reproduce the problem. Not sure how to mavenize
it, but you shouldn't need to if you have Solr locally. If it fails
every time, you can debug. I've had some luck just defining the
tests.seed in my IDE and running the test there (I use IntelliJ, but
I'm sure Eclipse and Netbeans etc. have an equivalent way to do
things). If just setting the seed as a sysvar in your IDE doesn't do
the trick, you can always define all of them in the IDE.

Even setting all the sysvars in the IDE doesn't always work. That is
executing the entire test from the command line can consistently fail
but defining all the sysvars in the IDE succeeds. But when it does
fail in the IDE it makes things _much_ easier ;)

Second question:

I doubt there's any real point in exercising Luke on non-FS based
indexes, so disabling the randomization of the filesystem seems fine.

See SolrTestCaseJ4, the "useFactory" method. You can do something like
this in your test:

@BeforeClass
public static void beforeTriLevelCompositeIdRoutingTest() throws Exception {
  useFactory(null); // uses Standard or NRTCaching, FS based anyway.
}

or even:

useFactory("solr.StandardDirectoryFactory");

I'm not sure about useFactory("org.apache.solr.core.HdfsDirectoryFactory");

Or if you're really adventurous:

@BeforeClass
public static void beforeTriLevelCompositeIdRoutingTest() throws Exception {
  switch (random().nextInt(2)) {
 case 0:
useFactory(null); // uses Standard or NRTCaching, FS based anyway.
break;
case 1:
useFactory("org.apache.solr.core.HdfsDirectoryFactory");
break;
// I guess whatever else you wanted...

}


Frankly in this case I'd:

1> see if executing the full reproduce line consistently fails and if so
2> try using the above to disable other filesystems. If that
consistently succeeds, consider it done.

Since Luke is intended to be used on an existing index I don't see
much use in randomizing for edge cases. But that pre-supposes that
it's a problem with some of the directory implementations of course...

Best,
Erick

On Wed, Aug 22, 2018 at 8:13 AM, Tomoko Uchida
 wrote:
> Can I ask one more question.
>
> 4> If MIke's intuition that it's one of the file system randomizations
> that occasionally gets hit _and_ you determine that that's an invalid
> test case (and for Luke requiring that the FS-basesd tests are all
> that are necessary may be fine) I'm pretty sure you you can disable
> that randomization for your specific tests.
>
> As you may know, Luke calls relatively low Lucene APIs (such as
> o.a.l.u.IndexCommit or SegmentInfos) to show commit points, segment files,
> etc. ("Commits" tab do this.)
> I am not sure about when we could/should disable randomization, could you
> give me any cues for this? Or, real test cases that disable randomization
> are helpful for me, I will search Lucene/Solr code base.
>
> Thanks,
> Tomoko
>
> 2018年8月22日(水) 21:58 Tomoko Uchida :
>
>> Thanks for your kind explanations,
>>
>> sorry of course I know what is the randomization seed,
>> but your description and instruction is exactly what I wanted.
>>
>> > The randomization can cause different
>> > combinations of "stuff" to happen. Say the locale is randomized to
>> > Turkish and a token is also randomly generated that breaks _only_ with
>> > that combination. You'd never explicitly be able to test all of those
>> > kinds of combinations, thus the random() function. And there may be
>> > many calls to random() by the time a test is run.
>>
>> My understanding at this point is (though it may be a repeat of your
>> words,)
>> first we should find out the combinations behind the failures.
>> If there are any particular patterns, there could be bugs, so we should
>> fix it.
>>
>> Thanks,
>> Tomoko
>>
>> 2018年8月22日(水) 14:59 Erick Erickson :
>>
>>> The pseudo-random generato

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Erick Erickson
The pseudo-random generator in the Lucene test framework is used to
randomize lots of test conditions, we're talking about the file system
implementation here, but there are lots of others. Whenever you see a
call to random().whatever, that's the call to the framework's method.

But here's the thing. The randomization can cause different
combinations of "stuff" to happen. Say the locale is randomized to
Turkish and a token is also randomly generated that breaks _only_ with
that combination. You'd never explicitly be able to test all of those
kinds of combinations, thus the random() function. And there may be
many calls to random() by the time a test is run.

Here's the key. When "seeded" with the same number, the calls to
random() produce the exact same output every time. So say with seed1 I
get
nextInt() - 1
nextInt() - 67
nextBool() - true

Whenever I use 1 as the seed, I'll get exactly the above. However, if
I use 2 as a seed, I might get
nextInt() - 93
nextInt() - 63
nextBool() - false

So the short form is

1. randomization is used to try out various combinations.

2. using a particular seed guarantees that the randomization is repeatable.

3.  when a test fails with a particular seed, running the test with
the _same_ seed will produce the same conditions, hopefully allowing
that particular error resulting from that particular combination to be
reproduced reliably (and fixed).

4. at least that's the theory and in practice it works quite well.
There is no _guarantee_ that the test will fail using the same seed,
sometimes the failures are a result of subtle timing etc, which is not
under control of the randomization. I breathe a sigh of relief,
though, when a test _does_ reproduce with a particular seed 'cause
then I have a hope of knowing the issue is actually fixed ;).


Best,
Erick

On Tue, Aug 21, 2018 at 3:56 PM, Tomoko Uchida
 wrote:
> Thanks a lot for your information & insights,
>
> I will try to reproduce the errors and investigate the results.
> And, maybe I should learn more about internal of the test framework,
> I'm not familiar with it and still do not understand what does "seed" means
> exactly in this context.
>
> Regards,
> Tomoko
>
> 2018年8月22日(水) 1:05 Erick Erickson :
>
>> Couple of things (and I know you've been around for a while, so pardon
>> me if it's all old hat to you):
>>
>> 1> if you run the entire "reproduce with" line and can get a
>> consistent failure, then you are half way there, nothing is as
>> frustrating as not getting failures reliably. The critical bit is
>> often the -Dtests.seed. As Michael mentioned, there are various
>> randomizations done for _many_ things in Lucene tests using a random
>> generator.  tests.seed, well, seeds that generator so it produces the
>> same numbers every time it's run with that seed. You'll see lots of
>> calls to a static ramdom() method calls. I'll add that if you want to
>> use randomness in your tests, use that method and do _not_ use a local
>> instance of Java's Random.
>>
>> 2> MIke: You say IntelliJ succeeds. But that'll use a new random()
>> seed. Once you run a test, in the upper right (on my version at
>> least), IntelliJ will show you a little box with the test name and you
>> can "edit configurations" on it. I often have luck by editing the
>> configuration and adding the test seed to the "VM option" box for the
>> test, just the "-Dtests.seed=35AF58F652536895" part. You can add all
>> of the -D flags in the "reproduce with" line if you want, but often
>> just the seed works for me. If that works and you track it down, do
>> remember to take that seed _out_ of the "VM options" box rather than
>> forget it as I have done ;)
>>
>> 3> Mark Miller's beasting script can be used to run a zillion tests
>> over night: https://gist.github.com/markrmiller/dbdb792216dc98b018ad
>>
>> 4> If MIke's intuition that it's one of the file system randomizations
>> that occasionally gets hit _and_ you determine that that's an invalid
>> test case (and for Luke requiring that the FS-basesd tests are all
>> that are necessary may be fine) I'm pretty sure you you can disable
>> that randomization for your specific tests.
>>
>> Best,
>> Erick
>>
>> On Tue, Aug 21, 2018 at 7:47 AM, Tomoko Uchida
>>  wrote:
>> > Hi, Mike
>> >
>> > Thanks for sharing your experiments.
>> >
>> >> CommitsImplTest.testListCommits
>> >> CommitsImplTest.testGetCommit_generation_notfound
>> >> CommitsImplTest.testG

Re: Question about usage of LuceneTestCase

2018-08-21 Thread Erick Erickson
Couple of things (and I know you've been around for a while, so pardon
me if it's all old hat to you):

1> if you run the entire "reproduce with" line and can get a
consistent failure, then you are half way there, nothing is as
frustrating as not getting failures reliably. The critical bit is
often the -Dtests.seed. As Michael mentioned, there are various
randomizations done for _many_ things in Lucene tests using a random
generator.  tests.seed, well, seeds that generator so it produces the
same numbers every time it's run with that seed. You'll see lots of
calls to a static ramdom() method calls. I'll add that if you want to
use randomness in your tests, use that method and do _not_ use a local
instance of Java's Random.

2> MIke: You say IntelliJ succeeds. But that'll use a new random()
seed. Once you run a test, in the upper right (on my version at
least), IntelliJ will show you a little box with the test name and you
can "edit configurations" on it. I often have luck by editing the
configuration and adding the test seed to the "VM option" box for the
test, just the "-Dtests.seed=35AF58F652536895" part. You can add all
of the -D flags in the "reproduce with" line if you want, but often
just the seed works for me. If that works and you track it down, do
remember to take that seed _out_ of the "VM options" box rather than
forget it as I have done ;)

3> Mark Miller's beasting script can be used to run a zillion tests
over night: https://gist.github.com/markrmiller/dbdb792216dc98b018ad

4> If MIke's intuition that it's one of the file system randomizations
that occasionally gets hit _and_ you determine that that's an invalid
test case (and for Luke requiring that the FS-basesd tests are all
that are necessary may be fine) I'm pretty sure you you can disable
that randomization for your specific tests.

Best,
Erick

On Tue, Aug 21, 2018 at 7:47 AM, Tomoko Uchida
 wrote:
> Hi, Mike
>
> Thanks for sharing your experiments.
>
>> CommitsImplTest.testListCommits
>> CommitsImplTest.testGetCommit_generation_notfound
>> CommitsImplTest.testGetSegments
>> DocumentsImplTest.testGetDocumentFIelds
>
> I also found CommitsImplTest and DocumentsImplTest fail frequently,
> especially CommitsImplTest is unhappy with lucene test framework (I pointed
> that in my previous post.)
>
>> I wonder if this is somehow related to running mvn from command line vs
> running in IntelliJ since previously I was doing the latter
>
> In my personal experience, when I was running those suspicious tests on
> IntelliJ IDEA, they were always green - but I am not sure that `mvn test`
> is the cause.
>
> Thanks,
> Tomoko
>
> 2018年8月21日(火) 22:53 Michael Sokolov :
>
>> I was running these luke tests a bunch and found the following tests fail
>> intermittently; pretty frequently. Once I @Ignore them I can get a
>> consistent pass:
>>
>>
>> CommitsImplTest.testListCommits
>> CommitsImplTest.testGetCommit_generation_notfound
>> CommitsImplTest.testGetSegments
>> DocumentsImplTest.testGetDocumentFIelds
>>
>> I did not attempt to figure out why the tests were failing, but to do that,
>> I would:
>>
>> Run repeatedly until you get a failure -- save the test "seed" from this
>> run that should be printed out in the failure message Then you should be
>> able to reliably reproduce this failure by re-running with system property
>> "tests.seed" set to that value. This is used to initialize the
>> randomization that LuceneTestCase does.
>>
>> My best guess is that the failures may have to do with randomly using some
>> Directory implementation or other Lucene feature that Luke doesn't properly
>> handle?
>>
>> Hmm I was trying this again to see if I could get an example, and strangely
>> these tests are no longer failing for me after several runs, when
>> previously they failed quite often. I wonder if this is somehow related to
>> running mvn from command line vs running in IntelliJ since previously I was
>> doing the latter
>>
>> -Mike
>>
>> On Tue, Aug 21, 2018 at 9:01 AM Tomoko Uchida <
>> tomoko.uchida.1...@gmail.com>
>> wrote:
>>
>> > Hello,
>> >
>> > Could you give me some advice or comments about usage of LuceneTestCase.
>> >
>> > Some of our unit tests extending LuceneTestCase fail by assertion error
>> --
>> > sometimes, randomly.
>> > I suppose we use LuceneTestCase in inappropriate way, but cannot find out
>> > how to fix it.
>> >
>> > Here is some information about failed tests.
>> >
>> >  * The full test code is here:
>> >
>> >
>> https://github.com/DmitryKey/luke/blob/master/src/test/java/org/apache/lucene/luke/models/commits/CommitsImplTest.java
>> >  * We run tests by `mvn test` on Mac PC or Travis CI (oracle jdk8/9/10,
>> > openjdk 8/9/10), assertion errors occur regardless of platform or jdk
>> > version.
>> >  * Stack trace of an assertion error is at the end of this mail.
>> >
>> > Any advice are appreciated. Please tell me if more information is needed.
>> >
>> > Thanks,
>> > Tomoko
>> >
>> >
>> > ---

Re: Question about threading in search

2018-08-17 Thread Erick Erickson
Please don't optimize to 1 segment unless you can afford to do it
quite regularly, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

(NOTE: this is changing as of 7.5, see:
https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/).

bq. It seems that search is sometimes faster with multiple segments.

In addition to what Toke said, this may just be an autowarming
problem. Measurements mean little/nothing unless they're performed on
a warmed-up index since there's quite a bit of reading from disk into
the heap and OS memory space that's required. You may just be seeing
that.

Best,
Erick


On Fri, Aug 17, 2018 at 2:26 AM, Toke Eskildsen  wrote:
> On Sat, 2017-09-02 at 18:33 -0700, Peilin Yang wrote:
>> we're comparing two different indexes on the same collection - one
>> with lots of different segments (default settings), and one with a
>> force merged into one segment. It seems that search is sometimes
>> faster with multiple segments.
>
> If you are using Lucene 7+ and if some of the fields you are requesting
> as part of your search result are stored as DocValues, you might have
> encountered a performance regression with the streaming API:
> https://issues.apache.org/jira/browse/LUCENE-8374
>
> One peculiar effect of this issue is that fewer larger segments gets
> slower DocValues retrieval, compared to more smaller segments. So a
> force merge to 1 segment can result in worse performance.
>
> - Toke Eskildsen, the Royal Danish Library, Denmark
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: testing with system properties

2018-08-09 Thread Erick Erickson
See TestSolrXml.java for an example of:

@Rule
public TestRule solrTestRules = RuleChain.outerRule(new
SystemPropertiesRestoreRule());

Best,
Erick

On Thu, Aug 9, 2018 at 2:33 PM, Michael Sokolov  wrote:
> I ran into a need to test some functionality that relies on system
> properties. Writing the test was error-prone because the properties persist
> across the jvm so if you set them in a test they leak across to other tests
> unless you are careful about @After methods. It occurred to me it would be
> nice if LuceneTestCase would detect this and yell. It could save all the
> system properties before each test (or at least each test class) and see if
> they are restored at the end.  I don't know if this arises much in Lucene,
> but maybe in Solr?
>
> -Mike

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene index file gets corrupted while creating index with 2 nodes.

2018-07-31 Thread Erick Erickson
There is no chance anyone will try to change the code for 3.6, so
raising a JIRA is pointless.

see: 
http://lucene.472066.n3.nabble.com/Issues-with-locked-indices-td4339180.html

Uwe is very knowledgeable in this area, so I'd strongly recommend you
follow his advice.

Best,
Erick

On Tue, Jul 31, 2018 at 2:33 AM, Bhavin Chheda  wrote:
> Hi,
>
>
> The lucene index file gets corrupted during loadtest of 15 min :-  creating
> the index with 2 nodes with 60 cocurrent users.
>
> I am using Lucene 3.6 version. The index is created in NFS.
>
> Please let me know does lucene create index works on multiple nodes with
> NFS.
>
>
>
> The error exception is org.apache.lucene.LockObtainFailedException: Lock
> obtain timed out.
>
>
>
>
> Please let me know - can I post above error in Issue Tracker -
> lucene.apache.org.
>
>
> Regards,
>
> Bhavin

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: WildcardQuery question

2018-07-23 Thread Erick Erickson
This just implements ReversedWildcardFilter, which while also in Solr
should be readily adaptable to lucene-only.

In general you want to do some trick like this. Otherwise it doesn't
scale well as conceptually Lucene has to enumerate
_all_ the terms to assemble the actual list of contained terms. I.e.
say you are searching for

*what

There's no way to know what terms match unless you look at them all,
which doesn't scale when the number of terms
gets huge.

You might be able to do something with ngrams too.

Best,
Erick

On Mon, Jul 23, 2018 at 12:07 PM, Evert Wagenaar
 wrote:
> Thanks Eric,
>
> I see only Solr documents in there. My solution is 100% Lucene.
>
> Regards,
>
> Evert
>
> On Mon, Jul 23, 2018 at 7:56 PM Erick Erickson 
> wrote:
>
>> Take a look at ReverseWilcardFilterFactory:
>>
>> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html
>>
>> Best,
>> Erick
>>
>> On Mon, Jul 23, 2018 at 7:53 AM, Evert Wagenaar
>>  wrote:
>> > Hello all,
>> >
>> > I have a WebApp (see http://ejwagenaar.com/index.php/Lingoweb/) which
>> makes
>> > extensive use of wildcardquery. I want to enable the first character(s)
>> > too. How can I enable this?
>> >
>> > Many thanks,
>> >
>> > Evert Wagenaar
>> > --
>> > Sent from Gmail IPad
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>> --
> Sent from Gmail IPad

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: WildcardQuery question

2018-07-23 Thread Erick Erickson
Take a look at ReverseWilcardFilterFactory:

https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html

Best,
Erick

On Mon, Jul 23, 2018 at 7:53 AM, Evert Wagenaar
 wrote:
> Hello all,
>
> I have a WebApp (see http://ejwagenaar.com/index.php/Lingoweb/) which makes
> extensive use of wildcardquery. I want to enable the first character(s)
> too. How can I enable this?
>
> Many thanks,
>
> Evert Wagenaar
> --
> Sent from Gmail IPad

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene stringField EnglishAnalyzer serch nor working

2018-07-20 Thread Erick Erickson
Why so complicated? Boosts do you no good,
you're only trying to find one document. Boosts influence
the score of documents in the ranking, but there's
only one.

I suspect if you looked at the debug form of the parsed query,
you'd find it pretty unexpected. You say it works  with
text fields, but I also suspect it's not doing quite what
you expect.

And it's all unnecessary and overly expensive anyway.
Just use a TermQuery. Term queries are very stupid and
do _no_ analysis of the term provided, they expect that
you already have done all that. They're very, very
efficient.

Best,
Erick

On Fri, Jul 20, 2018 at 9:30 AM, egorlex  wrote:
> Hi Erick,
>
> Thanks for reply,
>
> In need find document by  uuid (random generated field).
>
> Example of create field in document:
>
> Document document = new Document();
> document.add(new StringField("uuid", product.getId() , Field.Store.YES));
>
> Example of query to search:
>
> Map boosts = new HashMap<>();
>   boosts.put("uuid", 1.0f);
>   boosts.put("name", 1.0f);
>
> String[] fields = new String[]{"uuid",  "name"};
>
> MultiFieldQueryParser multiQP = new MultiFieldQueryParser(fields,
> engAnalyzer, boosts);
> quary = multiQP.parse(query);
>
> TopDocs hits = sercher.search(query, maxResult, sort);
>
> I wont to find document by uuid but result is always empty.
>
> Everything works if i create document with text field
> document.add(new TextField("uuid", product.getId() , Field.Store.YES));
> But i need exact match... this approach good for name but not for uuid.
>
> Lucene version 7.3.0
>
> Thanks.
>
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene stringField EnglishAnalyzer serch nor working

2018-07-20 Thread Erick Erickson
Please provide specific examples of what you mean.
along with the fieldType you tried, an example of
what the input at index time for the field, and examples
of what searches "didn't work". What exactly did
you expect to happen that didn't?

You might review:
https://wiki.apache.org/solr/UsingMailingLists

But at a blind guess, I'd expect you want a
solr.TextField-based field with
KeywordTokenizer and LowercaseFilter.

There are usually examples in the distributed schemas, but
it's impossible to tell in your case because you haven't
told us what version of Solr you're using.

Best,
Erick

On Fri, Jul 20, 2018 at 4:22 AM, egorlex  wrote:
> Hi, please help..
>
> I need exact match for my search for one field. I make it by stringField,
> but it is not working no search result.I tried lowercase   but it still no
> result in search.  I tried text field and it works but it is no exact
> match..
> I use EnglishAnalyzer
>
> Thanks!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Lucene-Java-Users-f532864.html
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Grant Ingersoll's 2009 blog article- is there a newer version?

2018-07-05 Thread Erick Erickson
Maybe look at the Solr payload code to see how to do it in Lucene?

But yeah, that article is quite out of date.

On Thu, Jul 5, 2018 at 8:23 AM,   wrote:
> Thanks i saw these posts but Grant's article is based on Lucene.
>
> i am not using Solr. Many classes in that article does not exist in latest
> versions of Lucene like version 6.1.
>
> For instance BoostingTermQuery does not exist in 6.1 and the way docs are
> indexed are also different on 6.1.
>
> There is a new class PayloadScoreQuery but there is no examples like this
> great article how to put them together.
>
> Best regards
>
>
> On 7/5/18 11:18 AM, Ishan Chattopadhyaya wrote:
>>
>> Try these, maybe?
>>
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2017_09_14_solr-2Dpayloads_&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=Ak4sr1zTaxPibIGJz26XQrj9fM4hZls8OegNbEWu1lI&s=9hxjhLoi6Lnb7KbYaOeb4-SP039x4Zx0XIynF_HzOJk&e=
>>
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.textsearch.io_-3Fp-3D5&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=Ak4sr1zTaxPibIGJz26XQrj9fM4hZls8OegNbEWu1lI&s=elEAMRZBIF2jldvS2kCD9B3r43kZ3hOToKVyR0I4qzo&e=
>>
>> On Thu, Jul 5, 2018 at 8:26 PM,  wrote:
>>
>>> Hi,-
>>>   Is there a newer version of this great article from Mr. Grant
>>> Ingersoll?
>>>
>>>
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2009_08_05_getting-2Dstarted-2Dwith-2Dpayloads_&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=nlG5z5NcNdIbQAiX-BKNeyLlULCbaezrgocEvPhQkl4&m=Ak4sr1zTaxPibIGJz26XQrj9fM4hZls8OegNbEWu1lI&s=isAZ026j7ugASeuPdoeXnoi5XfSGfxEgiWECE2ziURo&e=
>>> Thanks
>>>
>>> This article is based on Lucene 2.9.
>>> Best regards
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Size of Document

2018-07-04 Thread Erick Erickson
I think we're not talking about the same thing.

You asked "How can I calculate the total size of a Lucene Document"...

I was responding to the Terry's comment "In the document types I
usually index (.pdf, .docx/.doc, .eml), there exists a metadata field
called "stream_size" that contains the size of the document on disk. "

Two totally different beasts. One is the source document, the other is
what you choose to put into the index from that document. Not to even
mention that you could, for instance, choose to index only the title
and throw everything else away so the size of the raw document on disk
doesn't seem useful for your case.

Best,
Erick

On Wed, Jul 4, 2018 at 9:24 AM, Chris Bamford  wrote:
> Hi Erick
>
> Yes, size on disk is what I’m after as it will feed into an eventual 
> calculation regarding actual bytes written (not interested in the source data 
> document size, just real disk usage).
> Thanks
>
> Chris
>
> Sent from my iPhone
>
>> On 4 Jul 2018, at 17:08, Erick Erickson  wrote:
>>
>> But does size on disk help? If the doc has a zillion
>> images in it, those aren't part of the resulting index
>> (I'm excluding stored data here)
>>
>>> On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen  wrote:
>>> In the document types I usually index (.pdf, .docx/.doc, .eml), there
>>> exists a metadata field called "stream_size" that contains the size of
>>> the document on disk.  You don't have to compute it.  Thus, when you
>>> retrieve each document you can pull out the contents of this field and,
>>> if you like, include it in each hitlist entry.
>>>
>>>
>>>> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>>>> Hi there,
>>>>
>>>> How can I calculate the total size of a Lucene Document that I'm about
>>>> to write to an index so I know how many bytes I am writing please?  I
>>>> need it for some external metrics collection.
>>>>
>>>> Thanks
>>>>
>>>> - Chris
>>>>
>>>> -
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Size of Document

2018-07-04 Thread Erick Erickson
But does size on disk help? If the doc has a zillion
images in it, those aren't part of the resulting index
(I'm excluding stored data here)

On Wed, Jul 4, 2018 at 7:49 AM, Terry Steichen  wrote:
> In the document types I usually index (.pdf, .docx/.doc, .eml), there
> exists a metadata field called "stream_size" that contains the size of
> the document on disk.  You don't have to compute it.  Thus, when you
> retrieve each document you can pull out the contents of this field and,
> if you like, include it in each hitlist entry.
>
>
> On 07/04/2018 05:26 AM, Chris and Helen Bamford wrote:
>> Hi there,
>>
>> How can I calculate the total size of a Lucene Document that I'm about
>> to write to an index so I know how many bytes I am writing please?  I
>> need it for some external metrics collection.
>>
>> Thanks
>>
>> - Chris
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene hadoop index

2018-06-11 Thread Erick Erickson
I think you're far more likely to find people who know the details on
the Hadoop mailing list

On Mon, Jun 11, 2018 at 2:13 AM, Yonghui Zhao  wrote:
> I found there was
> "org.apache.hadoop.contrib.index.lucene.FileSystemDirectory" for lucene in
> hadoop old version.
>
> http://www.massapi.com/class/org/apache/hadoop/contrib/index/lucene/FileSystemDirectory.html
>
>
> But I don't find this in recent hadoop code base.
>
> Is there any plugin support  new lucene hadoop index?

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Deletions in NRTCAchingDirectory

2018-05-14 Thread Erick Erickson
" I have a use case of bulk deletions with solr and want to understand
using soft commits will help or not."

Will help with what? You haven't told us what the problem you're
worried about is. This
might help:
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
and
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

Best,
Erick

On Mon, May 14, 2018 at 1:10 AM, Shivam Omar
 wrote:
> Hi,
>
> I need to understand whether deletions get put in memory by the 
> NRTCachingDirectory or not. I have a use case of bulk deletions with solr and 
> want to understand using soft commits will help or not. Please suggest.
>
> Shivam
> DISCLAIMER
> This email and any files transmitted with it are intended solely for the 
> person or the entity to whom they are addressed and may contain information 
> which is Confidential and Privileged. Any misuse of the information contained 
> in this email, including but not limited to retransmission or dissemination 
> of the said information by person or entities other than the intended 
> recipient is unauthorized and strictly prohibited. If you are not the 
> intended recipient of this email, please delete this email and contact the 
> sender immediately.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: SortingMergePolicy is removed in 7.2.1?

2018-04-10 Thread Erick Erickson
I found it in 
.../solr/core/src/java/org/apache/solr/index/SortingMergePolicy.java
and the associated factory in
,,,/solr/core/src/java/org/apache/solr/index/SortingMergePolicyFactory.java
so I'm not sure what you're having trouble with

Best,
Erick

On Tue, Apr 10, 2018 at 4:56 AM, Yonghui Zhao  wrote:
> I can't find this class now? Which is replacement?
>
> Thanks!

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Storage of indexed and stored fields (Space and Performance)

2018-03-15 Thread Erick Erickson
Stored data is kept in separate segment files (*.fdt and *.fdx). As
such they have no measurable impact on query time. All the data for
executing searches is kept in other extensions in each segment and
accessed separately.

Adding stored data does increase the size on disk by roughly 50% of
the number of bytes stored (i.e. if I have a field with 128 bytes, the
_stored_ portion of the data will occupy roughly 64 bytes) and will
add some I/O but by and large the effects can be ignored.

Best,
Erick

On Thu, Mar 15, 2018 at 6:48 AM, Rajnish kamboj
 wrote:
> Hi
>
>
>
> How are indexed and stored fields treated by Lucene w.r.t space and
> performance?
>
> Is there any performance hit with stored fields which are indexed?
>
>
>
> Lucene Version: 5.3.1
>
>
>
> Assumption:
>
> Stored fields are just simple strings (not huge documents)
>
>
>
> Example:
>
> Data: [101, Gold]; [102, Silver]; [103, Gold]
>
> Additional Data: Stored and indexed as well: Partition1, Partition2,
> Schema1, Version1 etc. depending on data
>
>
>
> Index:
>
> Gold: 101 (Partition1, Schema1, Version1) , 103 (Partition2, Schema1,
> Version1)
>
> Silver: 102 (Partition1, Schema1, Version1)
>
> Partition1: 101 (Partition1, Schema1, Version1), 102 (Partition1, Schema1,
> Version1)
>
> Partition2: 103 (Partition1, Schema1, Version1)
>
> Schema1: 101 (Partition1, Schema1, Version1), 102 (Partition1, Schema1,
> Version1), 103 (Partition1, Schema1, Version1)
>
> Version1: 101 (Partition1, Schema1, Version1), 102 (Partition1, Schema1,
> Version1), 103 (Partition1, Schema1, Version1)
>
>
>
>
> Is it how the index will look like? i.e. stored fields will be replicated
> with each data field?
>
>
>
>
>
> Thanks & Regards
>
> Rajnish

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: getting Lucene Docid from inside score()

2018-03-10 Thread Erick Erickson
I was thinking this was a Solr question rather than a Lucene one so
the [docid] bit doesn't apply if you're in the lucene code. If you
_are_ really going from solr, just put [docid] in your Solr "fl" list.
Look in the Solr ref guide for an explanation:
https://lucene.apache.org/solr/guide/6_6/transforming-result-documents.html

If you _are_ doing this in the Lucene code, Isn't what you want just
the "doc" member variable of a ScoreDoc?

Best,
Erick


On Sat, Mar 10, 2018 at 4:43 AM, dwaipayan@gmail.com
 wrote:
> Hi Erick,
>
> Many thanks for your reply and explanation.
>
> I really want this to work. The good news for me is, the index is static, 
> there is no chance of any modification of the index.
>
>> Luke and the like are using a point-in-time snapshot of the index.
>
> I want to get that lucene-assigned docid, the same id that is returned, after 
> performing a search(), in the form of topDocs.scoreDocs.
> ScoreDoc[] hits;
> indexSearcher.search(luceneQuery, collector);
> topDocs = collector.topDocs();
> hits = topDocs.scoreDocs;
> System.out.println(hits[0].doc);   // I want this docid 
> inside score()
>
>> If you still want to get the internal ID, just specify the
>> pseudo-field [docid], as: "fl=id,[docid]"
>
> I didn't get your suggestion properly. Can you please explain a little? I 
> will be waiting for you reply.
>
> With regards,
>
> Dwaipayan..
>
> On 2018/03/09 20:04:59, Erick Erickson  wrote:
>> You almost certainly do _not_ want this unless you are absolutely and
>> totally sure that your index does not change between the time you ask
>> for for the internal Lucene doc ID and the time you use it. No docs
>> may be added. No forceMerges are done. In fact, I'd go so far as to
>> say you shouldn't open any new searchers.
>>
>> Here's the reason. Say I have a single segment index with internal doc
>> IDs 1, 2, 3, 4, 5. Say I delete docs 2 and 3. Now say I optimize, the
>> new segment has IDs 1, 2, 3. This a simplification to illustrate that
>> _whenever_ a segment gets rewritten for any reason, internal Lucene
>> doc IDs may change. All this goes on in the background and you have no
>> control over when.
>>
>> Docs may even get renumbered relative to each other. Let's claim that
>> your SOlr ID is doc1 and its associated internal ID is 1. doc100 has
>> internal id 100. Segment merging could assign doc1 an id of 200 and
>> doc100 an id of 150. You just don't know.
>>
>> Luke and the like are using a point-in-time snapshot of the index.
>>
>> If you still want to get the internal ID, just specify the
>> pseudo-field [docid], as: "fl=id,[docid]"
>>
>> Best,
>> Erick
>>
>> On Fri, Mar 9, 2018 at 3:50 AM, dwaipayan@gmail.com
>>  wrote:
>> > Thank you very much for your reply. Yes, I really want this (for
>> > implementing a retrieval function that extends the LMDir function).
>> > Precisely, I want the document numbering same as that we see in
>> > Lucene-Index-Viewers like Luke.
>> >
>> > I am not sure what you meant by "segment offset, held by a leaf reader"..
>> > Can you please explain a little, exactly when and what I need to do?
>> >
>> > Many thanks.
>> >
>> > On 2018/03/09 11:25:44, Michael Sokolov  wrote:
>> >> Are you sure you want this? Lucene docids aren't generally useful outside 
>> >> a
>> >> narrow internal context. They can change over time for example.
>> >>
>> >> But if you do, it sounds like maybe what you are seeing is the per segment
>> >> docid. To get a global one you have to add the segment offset, held by a
>> >> leaf reader.
>> >>
>> >> On Mar 9, 2018 5:06 AM, "Dwaipayan Roy"  wrote:
>> >>
>> >> > While searching, I want to get the lucene assigned docid (that starts 
>> >> > from
>> >> > 0 to the number of documents -1) of a document having a particular query
>> >> > term.
>> >> >
>> >> > From inside the score(), printing 'doc' or calling docId() is returning 
>> >> > a
>> >> > docid which, I think, is the internal docid of a segment in which the
>> >> > document is indexed. However, I want to have the lucene assigned docid. 
>> >> > How
>> >> > to do that?
>> >> >
>> >> > Dwaipayan..
>> >> >
>> >>

Re: getting Lucene Docid from inside score()

2018-03-09 Thread Erick Erickson
You almost certainly do _not_ want this unless you are absolutely and
totally sure that your index does not change between the time you ask
for for the internal Lucene doc ID and the time you use it. No docs
may be added. No forceMerges are done. In fact, I'd go so far as to
say you shouldn't open any new searchers.

Here's the reason. Say I have a single segment index with internal doc
IDs 1, 2, 3, 4, 5. Say I delete docs 2 and 3. Now say I optimize, the
new segment has IDs 1, 2, 3. This a simplification to illustrate that
_whenever_ a segment gets rewritten for any reason, internal Lucene
doc IDs may change. All this goes on in the background and you have no
control over when.

Docs may even get renumbered relative to each other. Let's claim that
your SOlr ID is doc1 and its associated internal ID is 1. doc100 has
internal id 100. Segment merging could assign doc1 an id of 200 and
doc100 an id of 150. You just don't know.

Luke and the like are using a point-in-time snapshot of the index.

If you still want to get the internal ID, just specify the
pseudo-field [docid], as: "fl=id,[docid]"

Best,
Erick

On Fri, Mar 9, 2018 at 3:50 AM, dwaipayan@gmail.com
 wrote:
> Thank you very much for your reply. Yes, I really want this (for
> implementing a retrieval function that extends the LMDir function).
> Precisely, I want the document numbering same as that we see in
> Lucene-Index-Viewers like Luke.
>
> I am not sure what you meant by "segment offset, held by a leaf reader"..
> Can you please explain a little, exactly when and what I need to do?
>
> Many thanks.
>
> On 2018/03/09 11:25:44, Michael Sokolov  wrote:
>> Are you sure you want this? Lucene docids aren't generally useful outside a
>> narrow internal context. They can change over time for example.
>>
>> But if you do, it sounds like maybe what you are seeing is the per segment
>> docid. To get a global one you have to add the segment offset, held by a
>> leaf reader.
>>
>> On Mar 9, 2018 5:06 AM, "Dwaipayan Roy"  wrote:
>>
>> > While searching, I want to get the lucene assigned docid (that starts from
>> > 0 to the number of documents -1) of a document having a particular query
>> > term.
>> >
>> > From inside the score(), printing 'doc' or calling docId() is returning a
>> > docid which, I think, is the internal docid of a segment in which the
>> > document is indexed. However, I want to have the lucene assigned docid. How
>> > to do that?
>> >
>> > Dwaipayan..
>> >
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: [EXTERNAL] - Re: Is docvalue sorted by value?

2018-03-06 Thread Erick Erickson
OK, you're asking a different question I think.

See SOLR-5730 and SOLR-8621, particularly SOLR-5730. This will work
only a single field which you decide at index time. You can still sort
by any field at the same expense as now, but since your docs are
ordered by one field the early termination part won't be applicable to
other fields.

Best,
Erick

On Mon, Mar 5, 2018 at 6:28 PM, Tony Ma  wrote:
> Hi Erick,
>
> I raise this question is about the sorting scenario as you mentioned in #2.
>
> If the hit docs are about 100, and my query just want top 2. If the values 
> are not sorted, it has to iterate all 100 docs and find top2 in a priority 
> queue. If the values are already sorted, it just need to iterate first 2. If 
> the query is unselective, the hit doc might be huge, pre-sort or not will 
> have big differences.
>
> I understand your thinking that if the doc values are not persisted with doc 
> id sequence, it is unable to retrieve field value by doc id.
>
> Actually, I am just wondering how lucene handle the sorting scenario, is 
> iterating all values of all docs unavoidable?
>
>
> On 3/6/18, 6:50 AM, "Erick Erickson"  wrote:
>
> I think there are two issues here that are being conflated
> 1> _within_ a document, i.e. for a multi-valued field the values are
> stored as Dominik says as a SORTED_SET. Not only will they be returned
> (if you return from docValues rather than stored) in lexical order,
> but identical values will be collapsed
>
> 2> across multiple documents, the question about  "...persisted with
> order of values, not document id..." really makes no sense. The point
> of DocValues is to answer the question "for document X what is the
> value of field Y". X here is the _internal_ document ID. Now consider
> a search. There are two documents that are hits, doc 35 and doc 198
> (internal lucene doc ID). To sort them by field Y you have to know
> what the value in that field is for those two docs is. How would
> "pre-ordering" the values help here? If I have the _values_ in order,
> I have no clue what docs are associated with them. That question is
> what the "inverted index" is there to answer.
>
> So I have doc 35 and 198. Think of DocValues as a large array indexed
> by internal doc id. To know how these two docs sort all I have to do
> is index into the array. It's slightly more complicated than that, but
> conceptually that's what happens.
>
> Best,
> Erick
>
> On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
>  wrote:
> >> So, can doc values be persisted with order of values, not document id? 
> This should be fast in sort scenario that the values are pre-ordered instead 
> of scan/sort at runtime.
> >
> >
> > No, unfortunately doc values cannot be persisted in order. Lucene 
> stores this values internally as a DocValuesType.SORTED_SET, where the values 
> are being stored using for example Long.compareTo().
> >
> > If you'd like to retrieve the values in insertion order, use stored 
> instead of doc values instead of. Then you might access the values in order 
> using the LeafReader's document function. However, beware that may induce 
> performance issues because it requires loading the document from disk.
> >
> > If you require to store and retrieve multiple numeric values per 
> document in order, you might consider using PointValues. PointValues are 
> internally indexed with KD-trees. But, beware that PointValues have a limited 
> dimensionality, in terms that you can for example store values in 8 
> dimensions, each of max 16 bytes.
> >
> >> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
> >>
> >> Per my understanding, doc values (binary doc values / numeric doc 
> values) are stored with sequence of document id. Sorted numeric doc values 
> just means if a document has multiple values, the values will be sorted for 
> same document, but for different documents, the value is still ordered by 
> document id. Is that true?
> >> So, can doc values be persisted with order of values, not document id? 
> This should be fast in sort scenario that the values are pre-ordered instead 
> of scan/sort at runtime.
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is docvalue sorted by value?

2018-03-05 Thread Erick Erickson
I think there are two issues here that are being conflated
1> _within_ a document, i.e. for a multi-valued field the values are
stored as Dominik says as a SORTED_SET. Not only will they be returned
(if you return from docValues rather than stored) in lexical order,
but identical values will be collapsed

2> across multiple documents, the question about  "...persisted with
order of values, not document id..." really makes no sense. The point
of DocValues is to answer the question "for document X what is the
value of field Y". X here is the _internal_ document ID. Now consider
a search. There are two documents that are hits, doc 35 and doc 198
(internal lucene doc ID). To sort them by field Y you have to know
what the value in that field is for those two docs is. How would
"pre-ordering" the values help here? If I have the _values_ in order,
I have no clue what docs are associated with them. That question is
what the "inverted index" is there to answer.

So I have doc 35 and 198. Think of DocValues as a large array indexed
by internal doc id. To know how these two docs sort all I have to do
is index into the array. It's slightly more complicated than that, but
conceptually that's what happens.

Best,
Erick

On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
 wrote:
>> So, can doc values be persisted with order of values, not document id? This 
>> should be fast in sort scenario that the values are pre-ordered instead of 
>> scan/sort at runtime.
>
>
> No, unfortunately doc values cannot be persisted in order. Lucene stores this 
> values internally as a DocValuesType.SORTED_SET, where the values are being 
> stored using for example Long.compareTo().
>
> If you'd like to retrieve the values in insertion order, use stored instead 
> of doc values instead of. Then you might access the values in order using the 
> LeafReader's document function. However, beware that may induce performance 
> issues because it requires loading the document from disk.
>
> If you require to store and retrieve multiple numeric values per document in 
> order, you might consider using PointValues. PointValues are internally 
> indexed with KD-trees. But, beware that PointValues have a limited 
> dimensionality, in terms that you can for example store values in 8 
> dimensions, each of max 16 bytes.
>
>> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
>>
>> Per my understanding, doc values (binary doc values / numeric doc values) 
>> are stored with sequence of document id. Sorted numeric doc values just 
>> means if a document has multiple values, the values will be sorted for same 
>> document, but for different documents, the value is still ordered by 
>> document id. Is that true?
>> So, can doc values be persisted with order of values, not document id? This 
>> should be fast in sort scenario that the values are pre-ordered instead of 
>> scan/sort at runtime.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Similarity

2018-02-08 Thread Erick Erickson
As of Solr 6.6, payload support has been added to Solr, see:
SOLR-1485. Before that, it was much more difficult, see:
https://lucidworks.com/2014/06/13/end-to-end-payload-example-in-solr/

Best,
Erick

On Thu, Feb 8, 2018 at 8:36 AM, Ahmet Arslan  wrote:
>
>
> Hi Roy,
>
>
> In order to activate payloads during scoring, you need to do two separate 
> things at the same time:
> * use a payload aware query type: org.apache.lucene.queries.payloads.*
> * use payload aware similarity
>
> Here is an old post that might inspire you :  
> https://lucidworks.com/2009/08/05/getting-started-with-payloads/
>
>
> Ahmet
>
>
>
> On Saturday, January 27, 2018, 5:43:36 PM GMT+3, Dwaipayan Roy 
>  wrote:
>
>
>
>
>
> Thanks for your replies. But still, I am not sure about the way to do the
> thing. Can you please provide me with an example code snippet or, link to
> some page where I can find one?
>
> Thanks..
>
> On Tue, Jan 16, 2018 at 3:28 PM, Dwaipayan Roy 
> wrote:
>
>> I want to make a scoring function that will score the documents by the
>> following function:
>> given Q = {q1, q2, ... }
>> score(D,Q) =
>>for all qi:
>>  SUM of {
>>  LOG { weight_1(qi) + weight_2(qi) + weight_3(qi) }
>>  }
>>
>> I have stored weight_1, weight_2 and weight_3 for all term of all
>> documents as payload, with payload delimiter = | (pipe) during indexing.
>>
>> However, I am not sure on how to integrate all the weights during
>> retrieval. I am sure that I have to @Override some score() but not sure
>> about the exact class.
>>
>> Please help me here.
>>
>> Best,
>> Dwaipayan..
>
>>
>>
>
>
> --
> Dwaipayan Roy.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: indexing performance 6.6 vs 7.1

2018-01-18 Thread Erick Erickson
Robert:

Ah, right. I keep confusing my gmail lists
"lucene dev"
and
"lucene list"

Siiih.



On Thu, Jan 18, 2018 at 9:18 AM, Adrien Grand  wrote:
> If you have sparse data, I would have expected index time to *decrease*,
> not increase.
>
> Can you enable the IW info stream and share flush + merge times to see
> where indexing time goes?
>
> If you can run with a profiler, this might also give useful information.
>
> Le jeu. 18 janv. 2018 à 11:23, Rob Audenaerde  a
> écrit :
>
>> Hi all,
>>
>> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
>> indexing performace.
>>
>> We have a-typical use of Lucene, as we (also) index some database tables
>> and add all the values as AssociatedFacetFields as well. This allows us to
>> create pivot tables on search results really fast.
>>
>> These tables have some overlapping columns, but also disjoint ones.
>>
>> We anticipated a decrease in index size because of the sparse docvalues. We
>> see this happening, with decreases to ~50%-80% of the original index size.
>> But we did not expect an drop in indexing performance (client systems
>> indexing time increased with +50% to +250%).
>>
>> (Our indexing-speed used to be mainly bound by the speed the Taxonomy could
>> deliver new ordinals for new values, currently we are investigating if this
>> is still the case, will report later when a profiler run has been done)
>>
>> Does anyone know if this increase in indexing time is to be expected as
>> result of the sparse docvalues change?
>>
>> Kind regards,
>>
>> Rob Audenaerde
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: indexing performance 6.6 vs 7.1

2018-01-18 Thread Erick Erickson
My first question is always "are you running the Solr CPUs flat out?".
My guess in this case is that the indexing client is the same and the
problem is in Solr, but it's worth checking whether the clients are
just somehow not delivering docs as fast as they were before.

My suspicion is that the indexing client hasn't changed, but it's
worth checking.

Best,
Erick

On Thu, Jan 18, 2018 at 2:23 AM, Rob Audenaerde
 wrote:
> Hi all,
>
> We recently upgraded from Lucene 6.6 to 7.1.  We see a significant drop in
> indexing performace.
>
> We have a-typical use of Lucene, as we (also) index some database tables
> and add all the values as AssociatedFacetFields as well. This allows us to
> create pivot tables on search results really fast.
>
> These tables have some overlapping columns, but also disjoint ones.
>
> We anticipated a decrease in index size because of the sparse docvalues. We
> see this happening, with decreases to ~50%-80% of the original index size.
> But we did not expect an drop in indexing performance (client systems
> indexing time increased with +50% to +250%).
>
> (Our indexing-speed used to be mainly bound by the speed the Taxonomy could
> deliver new ordinals for new values, currently we are investigating if this
> is still the case, will report later when a profiler run has been done)
>
> Does anyone know if this increase in indexing time is to be expected as
> result of the sparse docvalues change?
>
> Kind regards,
>
> Rob Audenaerde

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Maven snapshots

2018-01-09 Thread Erick Erickson
Maven support is not officially part of the project, it's maintained on a
"when someone interested gets to it" basis.

So the short answer is "no, you shouldn't expect those to be absolutely
current"

contributions welcome ;)

Best,
Erick

On Tue, Jan 9, 2018 at 6:36 AM, Armins Stepanjans 
wrote:

> Hi,
>
> I'm not sure I understand your question.
>
> There should be no confusion about setting a Maven snapshot dependency in
> the pom file, as you can specify version with
> 8.0-SNAPSHOT (substituting 8.0 with the version you
> want).
>
> However, in the case you are looking for a particular version of Lucene,
> you should check out the archives of released versions here:
> http://archive.apache.org/dist/lucene/java/
>
> Is there a particular reason you want the snapshot of 7.2 or 7.3?
>
> Regards,
> Armīns
>
> On Tue, Jan 9, 2018 at 4:13 PM, Terry Smith  wrote:
>
> > Guys,
> >
> > I'm just following up in case this question slipped between the cracks.
> >
> > Should I expect the apache snapshots maven repository to be current for
> > Lucene 7.x and 8? Specifically, I don't see snapshot releases for 7.2 or
> > 7.3 and it looks like the 8.0 snapshot releases are pretty stale.
> >
> > Thanks,
> >
> > --Terry
> >
> >
> >
> >
> > On Fri, Jan 5, 2018 at 11:06 AM, Terry Smith  wrote:
> >
> > > Hi,
> > >
> > > I'm not seeing snapshot releases on the maven repository for 7.2 or
> 7.3.
> > > Is this on purpose?
> > >
> > > https://repository.apache.org/content/groups/snapshots/org/
> > > apache/lucene/lucene-core/
> > >
> > > --Terry
> > >
> > >
> >
>


Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Erick Erickson
Luke has some capabilities to look at the index at a low level,
perhaps that could give you some pointers. I think you can pull
the older branch from here:
https://github.com/DmitryKey/luke

or:
https://code.google.com/archive/p/luke/

NOTE: This is not a part of Lucene, but an independent project
so it won't have the same labels.

Best,
Erick

On Tue, Jan 2, 2018 at 2:06 AM, Dawid Weiss  wrote:
> Ok. I think you should look at the Java API -- this will give you more
> clarity of what is actually stored in the index
> and how to extract it. The thing (I think) you're missing is that an
> inverted index points in the "other" direction (from a given value to
> all documents that contained it). So unless you "store" that value
> with the document as a stored field, you'll have to "uninvert" the
> index yourself.
>
> Dawid
>
> On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra
>  wrote:
>>> Only stored fields are kept for each document. If you need to dump
>>> internal data structures (terms, positions, offsets, payloads, you
>>> name it) you'll need to dive into the API and traverse all segments,
>>> then dump the above (and note that document IDs are per-segment and
>>> will have to be somehow consolidated back to your document IDs).
>>
>> Okie. So this would require deeper understanding of index format.
>> Would have a look. To start with I was just looking for a way to dump
>> indexed field names per document and nothing more
>>
>> /foo/bar|status, lastModified
>> /foo/baz|status, type
>>
>> Where path is stored field (primary key) and rest of the stuff are
>> sorted field names. Then such a file can be generated for both indexes
>> and diff can be done post sorting
>>
>>> I don't quite understand the motive here -- the indexes should behave
>>> identically regardless of the order of input documents; what's the
>>> point of dumping all this information?
>>
>> This is because of way indexing logic is given access to the Node
>> hierarchy. Would try to provide a brief explanation
>>
>> Jackrabbit Oak provides a hierarchical storage in a tree form where
>> sub trees can be of specific type.
>>
>> /content/dam/assets/december/banner.png
>>   - jcr:primaryType = "app:Asset"
>>   + jcr:content
>> - jcr:primaryType = "app:AssetContent"
>> + metadata
>>   - status = "published"
>>   - jcr:lastModified = "2009-10-9T21:52:31"
>>   - app:tags = ["properties:orientation/landscape",
>> "marketing:interest/product"]
>>   - comment = "Image for december launch"
>>   - jcr:title = "December Banner"
>>   + xmpMM:History
>> + 1
>>   - softwareAgent = "Adobe Photoshop"
>>   - author = "David"
>> + renditions (nt:folder)
>>   + original (nt:file)
>> + jcr:content
>>   - jcr:data = ...
>>
>> To access this content Oak provides a NodeStore/NodeState api [1]
>> which provides way to access the children. The default indexing logic
>> uses this api to read the content to be indexed and uses index rules
>> which allow to index content via relative path. For e.g. it would
>> create a Lucene field status which maps to
>> jcr:content/metadata/@status (for an index rule for nodes of type
>> app:Asset).
>>
>> This mode of access proved to be slow over remote storage like Mongo
>> specially for full reindexing case. So we implemented a newer approach
>> where all content was dumped in a flat file (1 node per line) ->
>> sorted file and then have a NodeState impl over this flat file. This
>> changes the way how relative paths work and thus there may be some
>> potential bugs in newer implementation.
>>
>> Hence we need to validate that indexing using new api produces same
>> index as using the stable api. For a case both index would have a
>> document for "/content/dam/assets/december/banner.png" but if newer
>> impl had some bug then it may not have indexed the "status" field
>>
>> So I am looking for way where I can map all fieldNames for a given
>> document. Actual indexed content would be same if both index have
>> "status" field indexed so we only need to validate fieldnames per
>> document. Something like
>>
>> Thanks for reading all this if you have read so far :)
>>
>> Chetan Mehrotra
>> [1] 
>> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java
>>
>>
>> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss  wrote:
>>> Only stored fields are kept for each document. If you need to dump
>>> internal data structures (terms, positions, offsets, payloads, you
>>> name it) you'll need to dive into the API and traverse all segments,
>>> then dump the above (and note that document IDs are per-segment and
>>> will have to be somehow consolidated back to your document IDs).
>>>
>>> I don't quite understand the motive here -- the indexes should behave
>>> identically regardless of the order of input documents; what's the
>>> point of dumping all this information?
>>>
>>> Dawid
>>>
>>
>> -

Re: solr 7.0: What causes the segment to flush

2017-10-17 Thread Erick Erickson
bq:  Is there a way to not write to disk continuously and only write the file...

Not if we're talking about the transaction log. The design is for the
transaction log in particular to continuously get updates flushed to
it, otherwise you could not replay the transaction log upon restart
and have any hope of not losing data.

And one other thing you want to be aware of: Having such long delays
between commits will mean that in the event of ungraceful shut-down
(pull the plug, kill -9 and the like) the _entire_ set if documents
sent since the last time you did a commit will be replayed from the
tlog before the replica accepts incoming requests.

As for <2> I haven't a clue.

Best,
Erick

On Tue, Oct 17, 2017 at 8:40 AM, Nawab Zada Asad Iqbal  wrote:
> I take my yesterday's comment back. I assumed that the file being written
> is a segment, however after letting solr run for the night. I see that the
> segment is flushed at the expected size:1945MB (so that file which i
> observed was still open for writing).
> Now, I have two other questions:-
>
> 1. Is there a way to not write to disk continuously and only write the file
> when segment is flushed?
>
> 2. With 6.5: i had ramBufferSizeMB=20G and limiting the threadCount to 12
> (since LUCENE-6659
> ,
> there is no configuration for indexing thread count, so I did a local
> workaround to limit the number of threads in code); I had very good write
> throughput. But with 7.0, I am getting comparable throughput only at
> indexing threadcount > 50. What could be wrong ?
>
>
> Thanks @Erick, I checked the commit settings, both soft and hard commits
> are off.
>
>
>
>
> On Tue, Oct 17, 2017 at 3:47 AM, Amrit Sarkar 
> wrote:
>
>> >
>> > In 7.0, i am finding that the file is written to disk very early on
>> > and it is being updated every second or so. Had something changed in 7.0
>> > which is causing it?  I tried something similar with solr 6.5 and i was
>> > able to get almost a GB size files on disk.
>>
>>
>> Interesting observation, Nawab, with ramBufferSizeMB=20G, you are getting
>> 20GB segments on 6.5 or less? a GB?
>>
>> Amrit Sarkar
>> Search Engineer
>> Lucidworks, Inc.
>> 415-589-9269
>> www.lucidworks.com
>> Twitter http://twitter.com/lucidworks
>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>
>> On Tue, Oct 17, 2017 at 12:48 PM, Nawab Zada Asad Iqbal 
>> wrote:
>>
>> > Hi,
>> >
>> > I have  tuned  (or tried to tune) my settings to only flush the segment
>> > when it has reached its maximum size. At the moment,I am using my
>> > application with only a couple of threads (i have limited to one thread
>> for
>> > analyzing this scenario) and my ramBufferSizeMB=2 (i.e. ~20GB). With
>> > this, I assumed that my file sizes on the disk will be at in the order of
>> > GB; and no segments will be flushed until the segment's in memory size is
>> > 2GB. In 7.0, i am finding that the file is written to disk very early on
>> > and it is being updated every second or so. Had something changed in 7.0
>> > which is causing it?  I tried something similar with solr 6.5 and i was
>> > able to get almost a GB size files on disk.
>> >
>> > How can I control it to not write to disk until the segment has reached
>> its
>> > maximum permitted size (1945 MB?) ? My write traffic is 'new only' (i.e.,
>> > it doesn't delete any document) , however I also found following
>> infostream
>> > logs, which incorrectly say 'delete=true':
>> >
>> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-887) [   x:filesearch]
>> > o.a.s.c.S.Request [filesearch]  webapp=/solr path=/update
>> > params={commit=false} status=0 QTime=21
>> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-889) [   x:filesearch]
>> > o.a.s.u.LoggingInfoStream [DW][qtp761960786-889]: anyChanges?
>> > numDocsInRam=4434 deletes=true hasTickets:false
>> pendingChangesInFullFlush:
>> > false
>> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-889) [   x:filesearch]
>> > o.a.s.u.LoggingInfoStream [IW][qtp761960786-889]: nrtIsCurrent:
>> infoVersion
>> > matches: false; DW changes: true; BD changes: false
>> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-889) [   x:filesearch]
>> > o.a.s.c.S.Request [filesearch]  webapp=/solr path=/admin/luke
>> > params={show=index&numTerms=0&wt=json} status=0 QTime=0
>> >
>> >
>> >
>> > Thanks
>> > Nawab
>> >
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: run in eclipse error

2017-10-17 Thread Erick Erickson
Anyone can raise a JIRA and submit a patch, it's then up to one of the
committers to pick it up and commit to the code lines. You have to
create an ID of course.

See: https://issues.apache.org/jira/

On Tue, Oct 17, 2017 at 5:04 AM, Mike Sokolov  wrote:
> Checkstyle has a onetoplevelclass rule that would enforce this
>
> On October 17, 2017 3:45:01 AM EDT, Uwe Schindler  wrote:
>>Hi,
>>
>>this has nothing to do with the Java version. I generally ignore this
>>Eclipse-failure as I only develop in Eclipse, but run from command
>>line. The reason for this behaviour is a problem with Eclipse's
>>resource management/compiler with the way how some classes in Solr
>>(especially facet component) are setup.
>>
>>In general, it is nowadays a no-go to have so called "non-inner"
>>pkg-private classes. These are classes which share the same source code
>>file, but are not nested in the main class. Instead they appear next to
>>each other in the source file. This is a relic from Java 1.0 and should
>>really no longer used!
>>
>>Unfortunately some Solr developers still create such non-nested
>>classes. Whenever I see them I change them to be static inner classes.
>>The problem with the bug caused by this is that Eclipse randomly fails
>>(it depends on the order how it compiles). The problem is that Eclipse
>>(but also other tools) cannot relate the non-inner class file to a
>>source file and therefore cannot figure out when it needs to be
>>recompiled.
>>
>>BTW. The same problem applies to other build system like javac and Ant
>>when it needs to compile. When you change such an inner non-nested
>>inner class, it fails to compile in most cases unless you do "ant
>>clean". The problem is again, that the compiler cannot relate the class
>>files to source code files!
>>
>>We should really fix those classes to be static and inner - or place
>>them in separate source files. I am looking to find a solution to
>>detect this with forbiddenapis or our Source Code Regexes, if anybody
>>has an idea: tell me!
>>
>>Uwe
>>
>>-
>>Uwe Schindler
>>Achterdiek 19, D-28357 Bremen
>>http://www.thetaphi.de
>>eMail: u...@thetaphi.de
>>
>>> -Original Message-
>>> From: 380382...@qq.com [mailto:380382...@qq.com]
>>> Sent: Tuesday, October 17, 2017 4:43 AM
>>> To: java-user 
>>> Subject: run in eclipse error
>>>
>>> i am trying to run solr in eclipse. but got the error "The type
>>> FacetDoubleMerger is already defined". i don't know why. Whether it
>>is jdk
>>> version wrong?
>>> Does git master need to use java9 for development?
>>>
>>>
>>> 380382...@qq.com
>>
>>
>>-
>>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: run in eclipse error

2017-10-16 Thread Erick Erickson
bq: Does git master need to use java9 for development

i can at least answer that with "no". Java8 is the current standard for master.

No clue what's going on with Eclipse though, I use IntelliJ

That class is part of Solr so Java 9 is probably not germane.

Best,
Erick

On Mon, Oct 16, 2017 at 7:42 PM, 380382...@qq.com <380382...@qq.com> wrote:
> i am trying to run solr in eclipse. but got the error "The type 
> FacetDoubleMerger is already defined". i don't know why. Whether it is jdk 
> version wrong?
> Does git master need to use java9 for development?
>
>
> 380382...@qq.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Query & reading plongs used by a custom Scorer

2017-10-06 Thread Erick Erickson
docValues are the first thing I'd look at. What you've done is an
anit-pattern for scoring because it reads the stored data from disk
and decompress it to read the value; as you say costly.

Getting it from a docValues field, OTOH, will read the value(s)
directly from MMapDirectory space, i.e. the OSs memory space. As an
aside, this is why Streaming only works with DV fields.

Two cautions though in terms of differences between DV and stored when
you have more than 1 term. The underlying structured is a sorted set,
therefore:
1> the contents are ordered by "natural" order rather than insertion order.
2> multiple identical values are collapsed into a single value.

So storing 1, 3, 99, 4, 4, 4, 4, 2, 3 will be returned a s 1, 2, 3, 4, 99

And another caution: you'll have to re-index completely when you add
docValues=true to your field definition, I'd start with a new
collection.

Best,
Erick

On Fri, Oct 6, 2017 at 7:32 AM, Dominik Safaric
 wrote:
> I've implemented a custom Query whose responsibilities are as follows.
> First, using an instance of a PointValues.IntersectVisitor classifying
> documents as hit or not using a plong value. Secondly, calculating custom
> scores using another document field, specified in the mapping as plongs.
> The later is expected to calculate the custom score using an array of longs
> comprised of 46 values.
>
> The problem I am having is performance wise. Namely for calculating the
> custom score I'm retrieving the values of the field using
> LeafReader.document(docId()) which is a costly process. What alternatives
> are there for reading plongs using a LeafReader and DocIdSetIterator within
> a custom Scorer implementation?
>
> Thanks in advance.
> Dominik

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: solr7.0.1: TestControlledRealTimeReopenThread stalled forever

2017-10-03 Thread Erick Erickson
Whew!

Thanks for letting us know.

Erick

On Tue, Oct 3, 2017 at 1:12 PM, Nawab Zada Asad Iqbal  wrote:
> Actually, it seems that one of my local changes is causing the halting
> issue. I am debugging it now. Sorry for noise.
>
> On Tue, Oct 3, 2017 at 12:08 PM, Nawab Zada Asad Iqbal 
> wrote:
>
>> Hi,
>>
>> I am using solr 7.0.1 and following lucene tests seem to run forever. Is
>> anyone else seeing this problem ?
>>
>> Thanks
>> Nawab
>>
>>
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:20:56,
>> stalled for 67.1s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:21:32,
>> stalled for  127s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J2 PID(57732@localhost): 2017-10-03T11:21:44,
>> stalled for 70.5s at: TestDocValuesIndexing.testMixedTypesDifferentThreads
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:21:44,
>> stalled for 71.2s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:21:56,
>> stalled for  127s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:22:32,
>> stalled for  187s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:22:44,
>> stalled for  131s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J2 PID(57732@localhost): 2017-10-03T11:22:44,
>> stalled for  130s at: TestDocValuesIndexing.testMixedTypesDifferentThreads
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:22:56,
>> stalled for  187s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:23:32,
>> stalled for  247s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:23:44,
>> stalled for  191s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J2 PID(57732@localhost): 2017-10-03T11:23:44,
>> stalled for  191s at: TestDocValuesIndexing.testMixedTypesDifferentThreads
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:23:56,
>> stalled for  247s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:24:33,
>> stalled for  307s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J2 PID(57732@localhost): 2017-10-03T11:24:45,
>> stalled for  251s at: TestDocValuesIndexing.testMixedTypesDifferentThreads
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:24:45,
>> stalled for  251s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:24:57,
>> stalled for  307s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:25:33,
>> stalled for  367s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J2 PID(57732@localhost): 2017-10-03T11:25:45,
>> stalled for  311s at: TestDocValuesIndexing.testMixedTypesDifferentThreads
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:25:45,
>> stalled for  311s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:25:57,
>> stalled for  367s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:26:33,
>> stalled for  427s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J2 PID(57732@localhost): 2017-10-03T11:26:45,
>> stalled for  371s at: TestDocValuesIndexing.testMixedTypesDifferentThreads
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:26:45,
>> stalled for  371s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:26:57,
>> stalled for  427s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:27:33,
>> stalled for  487s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J2 PID(57732@localhost): 2017-10-03T11:27:45,
>> stalled for  431s at: TestDocValuesIndexing.testMixedTypesDifferentThreads
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:27:45,
>> stalled for  431s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J1 PID(57733@localhost): 2017-10-03T11:27:57,
>> stalled for  487s at: TestControlledRealTimeReopenThread.
>> testControlledRealTimeReopenThread
>>[junit4] HEARTBEAT J3 PID(57731@localhost): 2017-10-03T11:28:33,
>> stalled for  547s at: TestIndexManyDocuments.test
>>[junit4] HEARTBEAT J0 PID(57730@localhost): 2017-10-03T11:28:45,
>> stalled for  491s at: TestIndexWriterCommit.testCommitThreadSafety
>>[junit4] HEARTBEAT J2

Re: Still using lucene 2.3, is compatible with java 8?

2017-09-16 Thread Erick Erickson
I doubt anyone has tested it. I'd compile it under Java 8 and see if
all of the tests run.

Best,
Erick

On Sat, Sep 16, 2017 at 7:41 AM, Lisheng Zhang  wrote:
> Hi, in one of our product we are still using lucene 2.3, is lucene 2.3
> compatible with java 1.8?
>
> Thanks very much for helps, Lisheng

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Need to unsub from lucene groups.

2017-09-10 Thread Erick Erickson
See: http://lucene.apache.org/solr/community.html, the "unsubscribe"
section. If you have problems, look at the "Problems" link. Note, you
_must_ use the exact same e-mail you originally subscribed with.

Best,
Erick

On Sun, Sep 10, 2017 at 8:51 AM, Khurram Shehzad
 wrote:
> Hi,
>
> Please someone tell me how to unsubscribe from
>
>
> java-user@lucene.apache.org
>
>
> I tried to email on java-user-unsubscr...@lucene.apache.org several time but 
> no use.
>
>
> Regards,
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Encryption at lucene index

2017-08-11 Thread Erick Erickson
Encrypting the _tokens_ inevitably leads to reduced capabilities BTW.
Trivial example:
I have these tokens in my index
run
runner
running
runs

Any non-trivial encryption algorithm will not encrypt the first three
letters "run" identically in all three so searching for run* simply
won't work.

As you can see, there's quite a bit of back-and-forth with that JIRA
and it is pretty much been abandoned.

Best,
Erick

On Thu, Aug 10, 2017 at 11:17 PM, Kumaran Ramasubramanian
 wrote:
> Hi Ishan, thank you :-)
>
> -
> -
> Kumaran R
>
>
>
> On Mon, Aug 7, 2017 at 10:53 PM, Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
>> Harry Ochiai (Hitachi) has some index encryption solution,
>> https://www.slideshare.net/maggon/securing-solr-search-data-in-the-cloud
>> I think it is proprietary, but I'm not sure. Maybe more googling might help
>> find the exact page where his solution is described.
>>
>> On Mon, Aug 7, 2017 at 9:59 PM, Kumaran Ramasubramanian <
>> kums@gmail.com>
>> wrote:
>>
>> > Hi Erick, i want to encrypt some fields of an document which has personal
>> > identifiable information ( both indexed and stored data)... for eg:
>> email,
>> > mobilenumber etc.. i am able to find LUCENE-6966 alone while googling
>> it..
>> > any related pointers in solr or latest lucene version?
>> >
>> >
>> > -
>> > -
>> > Kumaran R
>> >
>> > On Mon, Aug 7, 2017 at 9:52 PM, Erick Erickson 
>> > wrote:
>> >
>> > > No, since you haven't defined what you want to encrypt, what your
>> > > requirements are, what you hope to get out of "encryption" etc.
>> > >
>> > > Put the index on an encrypting filesystem and forget about it if you
>> > > possibly can, because anything else is a significant amount of work.
>> > > To encrypt the searchable tokens on a per-user basis in memory is a
>> > > _lot_ of work. It depends on your security needs.
>> > >
>> > > Otherwise, as I said, please ask specific questions as the topic is
>> > > quite large, much too large to conduct a seminar through the user's
>> > > list.
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Mon, Aug 7, 2017 at 9:07 AM, Kumaran Ramasubramanian
>> > >  wrote:
>> > > > Hi Erick,
>> > > >
>> > > > Thanks for the information. Any pointers about encryption options
>> > in
>> > > > solr?
>> > > >
>> > > >
>> > > > --
>> > > > Kumaran R
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Aug 7, 2017 at 9:17 PM, Erick Erickson <
>> > erickerick...@gmail.com>
>> > > > wrote:
>> > > >
>> > > >> Encryption in Solr has a bunch of ramifications. Do you care about
>> > > >>
>> > > >> - encryption at rest or in memory?
>> > > >> - encrypting the _searchable_ tokens?
>> > > >> - encrypting the searchable tokens per-user?
>> > > >> - encrypting the stored data (which a filter won't do BTW).
>> > > >>
>> > > >> It's actually a fairly complex topic the discussion at LUCENE-6966
>> > > >> outlines much of it. Please ask specific questions as you research
>> the
>> > > >> topic. One  per-user encryption package that I know of is by Hitachi
>> > > >> Solutions (commercial) and it explicitly does _not_ support, for
>> > > >> instance, wildcards (there are other limitations too). See:
>> > > >> http://www.hitachi-solutions.com/securesearch/
>> > > >>
>> > > >> Most of the time when people ask for encryption they soon discover
>> > > >> it's much more difficult than they imagine and settle for just
>> putting
>> > > >> the indexes on an encrypting file system. When they move beyond that
>> > > >> it gets complex and you'd be well advised to consult with Solr
>> > > >> security experts.
>> > > >>
>> > > >> Best,
>> > > >> Erick
>> > > >>
>> > > >> On Sun, Aug 6, 2017 at 11:30 PM, Kumaran Ramasubramanian
>> > > >>  wrote:
>> > > >> > Hi All,
>> > > >> >
>> > > 

Re: Encryption at lucene index

2017-08-07 Thread Erick Erickson
No, since you haven't defined what you want to encrypt, what your
requirements are, what you hope to get out of "encryption" etc.

Put the index on an encrypting filesystem and forget about it if you
possibly can, because anything else is a significant amount of work.
To encrypt the searchable tokens on a per-user basis in memory is a
_lot_ of work. It depends on your security needs.

Otherwise, as I said, please ask specific questions as the topic is
quite large, much too large to conduct a seminar through the user's
list.

Best,
Erick

On Mon, Aug 7, 2017 at 9:07 AM, Kumaran Ramasubramanian
 wrote:
> Hi Erick,
>
> Thanks for the information. Any pointers about encryption options in
> solr?
>
>
> --
> Kumaran R
>
>
>
> On Mon, Aug 7, 2017 at 9:17 PM, Erick Erickson 
> wrote:
>
>> Encryption in Solr has a bunch of ramifications. Do you care about
>>
>> - encryption at rest or in memory?
>> - encrypting the _searchable_ tokens?
>> - encrypting the searchable tokens per-user?
>> - encrypting the stored data (which a filter won't do BTW).
>>
>> It's actually a fairly complex topic the discussion at LUCENE-6966
>> outlines much of it. Please ask specific questions as you research the
>> topic. One  per-user encryption package that I know of is by Hitachi
>> Solutions (commercial) and it explicitly does _not_ support, for
>> instance, wildcards (there are other limitations too). See:
>> http://www.hitachi-solutions.com/securesearch/
>>
>> Most of the time when people ask for encryption they soon discover
>> it's much more difficult than they imagine and settle for just putting
>> the indexes on an encrypting file system. When they move beyond that
>> it gets complex and you'd be well advised to consult with Solr
>> security experts.
>>
>> Best,
>> Erick
>>
>> On Sun, Aug 6, 2017 at 11:30 PM, Kumaran Ramasubramanian
>>  wrote:
>> > Hi All,
>> >
>> >
>> > After looking at all below discussions, i have one doubt which may be
>> silly
>> > or novice but i want to throw this to lucene user list.
>> >
>> > if we have encryption layer included in our analyzer's flow of filters
>> like
>> > EncryptionFilter to control field-level encryption. what are the
>> > consequences ? am i missing anything basic?
>> >
>> > Thanks in advance..
>> >
>> >
>> > Related links:
>> >
>> > https://issues.apache.org/jira/browse/LUCENE-2228 : AES Encrypted
>> Directory
>> > - in lucene 3.x
>> >
>> > https://issues.apache.org/jira/browse/LUCENE-6966 :  Codec for
>> index-level
>> > encryption - at codec level, to have control on which column / field have
>> >  personal identifiable information
>> >
>> > https://security.stackexchange.com/questions/53/is-a-lucene-search-
>> index-effectively-a-backdoor-for-field-level-encryption
>> >
>> >
>> > A decent encrypting algorithm will not produce, say, the same first
>> portion
>> >> for two tokens that start with the same letters. So wildcard searches
>> won't
>> >> work. Consider "runs", "running", "runner". A search on "run*" would be
>> >> expected to match all three, but wouldn't unless the encryption were so
>> >> trivial as to be useless. Similar issues arise with sorting. "More Like
>> >> This" would be unreliable. There are many other features of a robust
>> search
>> >> engine that would be impacted, and an index with encrypted terms would
>> be
>> >> useful for only exact matches, which usually results in a poor search
>> >> experience.
>> >
>> >
>> > https://stackoverflow.com/questions/36604551/adding-
>> encryption-to-solr-lucene-indexes
>> >
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Kumaran R
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Encryption at lucene index

2017-08-07 Thread Erick Erickson
Encryption in Solr has a bunch of ramifications. Do you care about

- encryption at rest or in memory?
- encrypting the _searchable_ tokens?
- encrypting the searchable tokens per-user?
- encrypting the stored data (which a filter won't do BTW).

It's actually a fairly complex topic the discussion at LUCENE-6966
outlines much of it. Please ask specific questions as you research the
topic. One  per-user encryption package that I know of is by Hitachi
Solutions (commercial) and it explicitly does _not_ support, for
instance, wildcards (there are other limitations too). See:
http://www.hitachi-solutions.com/securesearch/

Most of the time when people ask for encryption they soon discover
it's much more difficult than they imagine and settle for just putting
the indexes on an encrypting file system. When they move beyond that
it gets complex and you'd be well advised to consult with Solr
security experts.

Best,
Erick

On Sun, Aug 6, 2017 at 11:30 PM, Kumaran Ramasubramanian
 wrote:
> Hi All,
>
>
> After looking at all below discussions, i have one doubt which may be silly
> or novice but i want to throw this to lucene user list.
>
> if we have encryption layer included in our analyzer's flow of filters like
> EncryptionFilter to control field-level encryption. what are the
> consequences ? am i missing anything basic?
>
> Thanks in advance..
>
>
> Related links:
>
> https://issues.apache.org/jira/browse/LUCENE-2228 : AES Encrypted Directory
> - in lucene 3.x
>
> https://issues.apache.org/jira/browse/LUCENE-6966 :  Codec for index-level
> encryption - at codec level, to have control on which column / field have
>  personal identifiable information
>
> https://security.stackexchange.com/questions/53/is-a-lucene-search-index-effectively-a-backdoor-for-field-level-encryption
>
>
> A decent encrypting algorithm will not produce, say, the same first portion
>> for two tokens that start with the same letters. So wildcard searches won't
>> work. Consider "runs", "running", "runner". A search on "run*" would be
>> expected to match all three, but wouldn't unless the encryption were so
>> trivial as to be useless. Similar issues arise with sorting. "More Like
>> This" would be unreliable. There are many other features of a robust search
>> engine that would be impacted, and an index with encrypted terms would be
>> useful for only exact matches, which usually results in a poor search
>> experience.
>
>
> https://stackoverflow.com/questions/36604551/adding-encryption-to-solr-lucene-indexes
>
>
>
>
>
>
> --
> Kumaran R

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: 答复: local variable name question

2017-08-06 Thread Erick Erickson
Usually there's no very good reason, it's just with a bunch of people with
more or lest time and more or less pressure and different habits choose
variable names that reflect how they're thinking about the issue at the
time.

Generally when working on a bit of code if the names are confusing whoever
picks it up next can rename them.

Best,
Erick

On Sun, Aug 6, 2017 at 2:16 AM, 马可阳  wrote:

> And this:
>
>  Similarity.SimWeight stats
>
>
>
> I bet there are more these things. Just out of curiosity.
>
>
>
>
>
>
>
> [image: 邮件签名]
>
>
>
> *发件人:* 马可阳
> *发送时间:* 2017年8月6日 17:14
> *收件人:* 'java-user@lucene.apache.org'
> *主题:* local variable name question
>
>
>
> In code I can see this:
>
>  final TermContext termState
>
>
>
> while it is instance of TermContext, why not name it termContext rather
> than termState? If termState is more describable name, why not change Class
> name to TermState?
>
>
>
>
>
>
>
> [image: 邮件签名]
>
>
>


Re: Lucene 6.6: "Too many open files"

2017-07-31 Thread Erick Erickson
No, nothing's changed fundamentally. But you say:

"We have some batch indexing scripts, which
flood the solr servers with indexing requests (while keeping open-searcher
false)"

What is your commit interval? Regardless of whether openSearcher is false
or not, background merging continues apace with every commit. By any chance
did you change your merge policy (or not copy the one from 4x to 6x)? Shot
in the dark...

Best,
Erick

On Mon, Jul 31, 2017 at 7:15 PM, Nawab Zada Asad Iqbal  wrote:
> Hi,
>
> I am upgrading from solr4.5 to solr6.6 and hitting this issue during
> complete reindexing scenario.  We have some batch indexing scripts, which
> flood the solr servers with indexing requests (while keeping open-searcher
> false) for many hours and then perform one commit. This used to work fine
> with 4.5, but with 6.6, i get 'Too many open files' within a couple of
> minutes. I have checked that "ulimit" is same between old and new servers.
>
> Has something fundamentally changed in recent lucene versions, which keeps
> file descriptors around for a longer time?
>
>
> Here is a sample error message:
> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:749)
> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:763)
> at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3206)
> at
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:644)
> at
> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:93)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:68)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1894)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1871)
> at
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:160)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:68)
> at
> org.apache.solr.handler.RequestHandlerUtils.handleCommit(RequestHandlerUtils.java:68)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:62)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)

Re: Maintaining sorting order (stored fields vs DocValue fields) while upgrading Lucene version

2017-06-29 Thread Erick Erickson
1>  Is it correct that stored fields can only be sorted on if they become a
DocValue field in 5.x

no. Indexed-only fields can still be used to sort. DocValues are just more
efficient at load time and don't consume as much of the Java heap.
Essentially this latter can be thought of as moving the "uninverted"
structure from heap to MMap space.

That said, I can't think of any _good_ reason to continue to sort on
indexed="true" docValues="false" fields. Use DocValues.

2> When "updating" stored fields to DocValue fields , is it required to
update all documents in the index at the same time?

Yes. I'm assuming here you're talking about changing the schema definition
to include docValues="true". In general I advocate re-indexing everything
when upgrading major versions. Technically if you want to some
"interesting" things with low-level Lucene you can upgrade your index, Uwe
Schindler outlined the process. I copied what he said but don't understand
it ;).

I've seen some situations where people will define a _new_ field with both,
gradually re-index and when all the docs have been updated switch to using
the new field. That assumes that it's just impossible to reindex all at
once.

The question I have to ask... Why upgrade just to 5x? Solr is releasing 7.0
very shortly. I can't think of a really good reason not to jump to 6x
unless you have heavy customizations and the like. Even in that case you'll
have to upgrade eventually. And if you wind up re-indexing everything
anyway, it seems like stopping at 5x is unnecessary.

Best,
Erick

On Thu, Jun 29, 2017 at 6:45 PM, Florian Buetow 
wrote:

>
> Hi,
>
>
>
> I am in the process of updating a large index from Lucene 4.x to 5.x and
> have two questions related to the sorting order.
>
>
>
> 1. Is it correct that stored fields can only be sorted on if they become a
> DocValue field in 5.x?
>
> 2. When "updating" stored fields to DocValue fields , is it required to
> update all documents in the index at the same time?
>
>
>
> Thank you in advance for your help.
>
>
>
> Best regards
>
> Florian
>
>
>
>
> Florian Buetow m: +44 7702 557267 <+44%207702%20557267> www.mimecast.com
> Software Engineer p: +44 207 847 8700 <+44%2020%207847%208700> Address
> click here 
> --
> [image: Mimecast Logo]
> 
>
> [image: Linked In]
> 
> [image: You Tube]
> 
> [image: Facebook]
> 
> [image: Blog]
> 
> [image: TwitterGlobal]
> 
>
>
> [image: ESRA]
> 
>
>
> *Disclaimer*
> The information contained in this communication from *
> fbue...@mimecast.com  * sent at 2017-06-30 02:45:29
> is confidential and may be legally privileged. It is intended solely for
> use by * java-user@lucene.apache.org  * and
> others authorized to receive it. If you are not *
> java-user@lucene.apache.org  * you are
> hereby notified that any disclosure, copying, distribution or taking action
> in reliance of the contents of this information is strictly prohibited and
> may be unlawful.
>
> This email message has been scanned for viruses by Mimecast. Mimecast
> delivers a complete managed email solution from a single web based
> platform. For more information please visit http://www.mimecast.com
>
>
>
>


  1   2   3   4   5   6   7   8   9   10   >