Re: [jira] [Commented] (LUCENE-8635) Lazy loading Lucene FST offheap using mmap

2019-01-16 Thread Michael Sokolov
I used the wikimedia2m data set for the second set of tests (the first test was on a tiny index - 10k docs) -- at least I think I did! I am kind of new to the benchmarking game. I ran the becnhmarks with python src/python/localrun.py -source wikimedium2m, and I can see that the index dir is 861M.

Re: Congratulations to the new Lucene/Solr PMC chair, Cassandra Targett

2018-12-31 Thread Michael Sokolov
Heavy is the head that wears the crown - congrats and thank you! And here's to a peaceful transition of power in the new year :) On Mon, Dec 31, 2018 at 1:39 PM Dawid Weiss wrote: > > Congratulations, Cassandra! > > On Mon, Dec 31, 2018 at 7:04 PM Gus Heck wrote: > > > > Congratulations :) > > >

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-20 Thread Michael Sokolov
This is a great idea. It would also be compelling to modify the term frequency using this deboosting so that stacked indexed terms can be weighted according to their closeness to the original term. On Tue, Nov 20, 2018, 2:19 PM jim ferenczi Sorry for the late reply, > > > So perhaps one way forwa

Re: [GitHub] lucene-solr issue #500: LUCENE-8517: do not wrap FixedShingleFilter with con...

2018-11-19 Thread Michael Sokolov
Oh! got it - We run our tests and other release machinery etc against a single JDK, and it is currently Java 8. I will precommit with Java 8 then. Presumably at some future date JDK11 becomes the system of record? Historically how long have we waited after a new Java release before shifting over? O

Re: [jira] [Commented] (LUCENE-8548) Reevaluate scripts boundary break in Nori's tokenizer

2018-10-26 Thread Michael Sokolov
I agree w/Robert let's not reinvent solutions that are solved elsewhere. In an ideal world, wouldn't you want to be able to delegate tokenization of latin script portions to StandardTokenizer? I know that's not possible today, and I wouldn't derail the work here to try to make it happen since it wo

Re: [jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-26 Thread Michael Sokolov
In case it wasn't clear, I am +1 for Alan's plan. We can always restore offset-alterations here if at some future date we figure out how to do it correctly. On Fri, Oct 26, 2018 at 6:08 AM Michael Sokolov wrote: > The current situation is that it is impossible to apply offsets cor

Re: [jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2018-10-26 Thread Michael Sokolov
The current situation is that it is impossible to apply offsets correctly in a TokenFilter. It seems to work OK most of the time, but truly correct behavior relies on prior components in the chain not having altered the length of tokens, which some of them occasionally do. For complete correctness

Re: Does ConcurrentMergeScheduler actually do smaller merges first?

2018-10-10 Thread Michael Sokolov
If maxMergeCount was 2, you could get into a situation with three large merges I think; the largest would be paused, but the others could still take > 10 mins to complete. Are you sure that your observation is at odds with what the document says the scheduler is doing? On Wed, Oct 10, 2018 at 2:28

Re: [jira] [Commented] (LUCENE-8516) Make WordDelimiterGraphFilter a Tokenizer

2018-09-30 Thread Michael Sokolov
My current usage of this filter requires it to be a filter, since I need to precede it with other filters. I think the idea of not touching offsets preserves more flexibility, and since the offsets are already unreliable, we wouldn't be losing much. On Sun, Sep 30, 2018, 11:32 AM Alan Woodward (JI

Re: Closing a JIRA issue

2018-08-31 Thread Michael Sokolov
iven you that role, Michael, please see if you see > the Resolve button now. > > Cassandra > > On Fri, Aug 31, 2018 at 11:09 AM Uwe Schindler wrote: > >> Hi, >> >> When back in office, I will check the project roles of Lucene and Sole >> Jira projects. &

Re: Closing a JIRA issue

2018-08-31 Thread Michael Sokolov
> So, if you do not see it, the permissions may be in play. I will leave > the issue as is, to let the discrepancy to be figured out. > > Regards, > Alex. > > On 29 August 2018 at 15:56, Michael Sokolov wrote: > > This old issue was still assigned to me: > > https:/

Closing a JIRA issue

2018-08-29 Thread Michael Sokolov
This old issue was still assigned to me: https://issues.apache.org/jira/browse/LUCENE-3318. I had worked on it seven years ago, but it is no longer relevant today, and I'd like to close it, but I don't see any UI affordance for doing that in JIRA. Am I missing permissions? Is the issue in some weir

Re: javadoc linting on JDK10+

2018-08-29 Thread Michael Sokolov
Michael Sokolov wrote: > I am trying to run ant precommit (on master) and it fails for me with this > message: > > -ecj-javadoc-lint-unsupported: > > BUILD FAILED > /home/ > ANT.AMAZON.COM/sokolovm/workspace/lbench/lucene_baseline/lucene/common-build.xml:2076: > Lintin

javadoc linting on JDK10+

2018-08-29 Thread Michael Sokolov
I am trying to run ant precommit (on master) and it fails for me with this message: -ecj-javadoc-lint-unsupported: BUILD FAILED /home/ ANT.AMAZON.COM/sokolovm/workspace/lbench/lucene_baseline/lucene/common-build.xml:2076: Linting documentation with ECJ is not supported on this Java version (unkno

Re: benchmark drop for PrimaryKey

2018-08-24 Thread Michael Sokolov
ke into account the fact that the default > codec changed. However, I did not add backward-codecs.jar to the classpath, > you should rebuild the index that you use for benchmarking so that it uses > the Lucene80 codec instead of Lucene70. > > Le ven. 24 août 2018 à 02:03, Michael Sokolov a

Re: benchmark drop for PrimaryKey

2018-08-23 Thread Michael Sokolov
@ def run(): - idFieldPostingsFormat='Lucene50', + idFieldPostingsFormat='FST50', On Thu, Aug 23, 2018 at 5:52 PM Michael Sokolov wrote: > OK thanks. I guess this benchmark must be run on a large-enough

Re: benchmark drop for PrimaryKey

2018-08-23 Thread Michael Sokolov
OK thanks. I guess this benchmark must be run on a large-enough index that it doesn't fit entirely in RAM already anyway? When I ran it locally using the vanilla benchmark instructions, I believe the generated index was quite small (wikimedium10k). At any rate, I don't have any specific use case y

LUCENE-765

2018-08-23 Thread Michael Sokolov
Can I interest someone in reviewing my patch for https://issues.apache.org/jira/browse/LUCENE-765? It's additional javadoc for in the index package I was rooting around for some low-impact helpful thing to do here, and found this on a list of "newdev" issues. It's fairly high-level but should be h

benchmark drop for PrimaryKey

2018-08-23 Thread Michael Sokolov
I happened to stumble across this chart https://home.apache.org/~mikemccand/lucenebench/PKLookup.html showing a pretty drastic drop in this benchmark on 5/13. I looked at the commits between the previous run and this one and did some investigation, trying to do some git bisect to find the problem u

Re: [jira] [Commented] (LUCENE-2562) Make Luke a Lucene/Solr Module

2018-08-16 Thread Michael Sokolov
Oh! Nice -- I'll have a look. I had started tinkering with my own, but it would be nice if it already existed thanks! On Thu, Aug 16, 2018 at 10:42 AM Tomoko Uchida (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:co

Re: Synonyms + autoGeneratePhraseQueries

2018-07-26 Thread Michael Sokolov
Did you mean q=oow in your example? As written, I don't see how there is a problem. On Thu, Jul 26, 2018 at 8:41 AM Andrea Gazzarini wrote: > Hi, still fighting with synonyms, I have another question. > I'm not understanding the role, and the effect, of the > "autoGeneratePhraseQueries" attribut

Re: SynonymGraphFilter followed by StopFilter

2018-07-26 Thread Michael Sokolov
> In general I’d avoid index-time synonyms in lucene because synonyms can create graphs (eg if a single term gets expanded to several terms), and we can’t index graphs correctly. I wonder what it would take to address this. I guess the blast radius of adding a token "width" could be pretty large.

Re: [jira] [Reopened] (LUCENE-8389) Could not limit Lucene's memory consumption

2018-07-09 Thread Michael Sokolov
Can you run a mirror instance and swap traffic, performing reindexing on an online system, and then bring it online when complete? On Sun, Jul 8, 2018, 7:46 PM changchun huang (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-8389?page=com.atlassian.jira.plugin.system.issue

Re: [jira] [Created] (LUCENE-8389) Could not limit Lucene's memory consumption

2018-07-06 Thread Michael Sokolov
You should really try asking on an Atlassian support forum since Jira is their project and they support it. This bug database is for tracking issues about Lucene itself. Also please note that Lucene 3 is many years old now, and no longer receiving bug fixes. The current version is 7, soon to be 8,

Re: [jira] [Created] (LUCENE-8319) A Time-limiting collector that works with CollectorManagers

2018-05-18 Thread Michael Sokolov
Would it make sense to change TimeExceededException so it extends CollectionTerminatedException? On Wed, May 16, 2018 at 4:29 PM, Tony Xu (JIRA) wrote: > Tony Xu created LUCENE-8319: > --- > > Summary: A Time-limiting collector that works with > Collector

Re: [jira] [Commented] (LUCENE-8273) Add a BypassingTokenFilter

2018-04-24 Thread Michael Sokolov
+1 On Tue, Apr 24, 2018 at 9:58 AM, Alan Woodward (JIRA) wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-8273?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=16449897#comment-16449897 ] > > Alan Woodward commented on LUCENE-8273: > -

Re: [jira] [Commented] (LUCENE-8248) Rename MergePolicyWrapper to FilterMergePolicy and override all of MergePolicy

2018-04-13 Thread Michael Sokolov
yes, thanks! On Fri, Apr 13, 2018 at 7:05 PM, Michael McCandless (JIRA) wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-8248?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=16438060#comment-16438060 ] > > Michael McCandless commented on

Re: [jira] [Commented] (LUCENE-8248) Make MergePolicy.setMaxCFSSegmentSizeMB final

2018-04-10 Thread Michael Sokolov
Ah true that would be messy! I'll update the patch. On Tue, Apr 10, 2018 at 7:26 PM, Michael McCandless (JIRA) wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-8248?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=16433177#comment-16433177

Re: [jira] [Commented] (LUCENE-8240) Support different analysis per field instance

2018-04-05 Thread Michael Sokolov
Ok that was actually my first implementation. It was a lot messier. I'll follow up with details when I get back to a keyboard On Thu, Apr 5, 2018, 9:09 AM Adrien Grand (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/LUCENE-8240?page=com.atlassian.jira.plugin.system.issuetabpanels

WordDelimiterFilter javadocs are off base

2018-04-04 Thread Michael Sokolov
The javadocs for both WDF and WDGF include a pretty detailed discussion about the proper use of the "combinations" parameter, but no such parameter exists. I don't know the history here, but it sounds as if the docs might be referring to some previous incarnation of this filter, perhaps in the cont

Re: [jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-03-04 Thread Michael Sokolov
Perhaps Robert is a fan of Object.clone() On Feb 28, 2018 9:59 AM, "Bruno Roustant (JIRA)" wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-8159?page= > com.atlassian.jira.plugin.system.issuetabpanels:comment- > tabpanel&focusedCommentId=16380407#comment-16380407 ] > > Bruno Roustan

Re: pro coding style

2012-12-01 Thread Michael Sokolov
On 12/1/2012 7:59 AM, Per Steffensen wrote: It is all about information - git has it, SVN doesnt. And my logical sence tells me that is has to be git and not github! :-) Now tell me that I am stupid :-) This kind of information (merge tracking) has been in svn since 1.5 (see http://subversio

XmlCharFilter

2011-06-14 Thread Michael Sokolov
I work with a lot of XML data sources and have needed to implement an analysis chain for Solr/Lucene that accepts XML. In the course of doing that, I found I needed something very much like HTMLCharFilter, but that does standard XML parsing (understands XML entities defined in an internal or ex

Re: Solr Config XML DTD's

2011-05-04 Thread Michael Sokolov
I'm not sure you will find anyone wanting to put in this effort now, but another suggestion for a general approach might be: 1 very basic static analysis to catch what you can - this should be a pretty minimal effort only given what can reasonably be achieved 2 throw runtime errors as Hoss sa

Re: Re: Solr Config XML DTD's

2011-05-01 Thread Michael Sokolov
My first post too - but if I can offer a suggestion - there are more modern XML validation technologies available than DTD. I would heartily recommend RelaxNG/Compact notation (see http://relaxng.org/compact-tutorial-20030326.html) - you can generate Relax from a DTD, but it is more expressive

<    1   2   3   4   5   6