from:"Robert Muir"



[ 
https://issues.apache.org/jira/browse/LUCENE-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521787#comment-16521787
 ] 

Robert Muir commented on LUCENE-8368:
-

I also see 4M hits/sec if instead of MatchAllDocsQuery on the OSM, i use 
newBoxQuery(London) to try to better approach the worst-case. Besides removing 
the binary search, it would be really good to support [~jpountz]'s Grid to do 
this. It is dumb to descend the tree unless it is an edge case. Conceptually 
that Grid already just associates an integer value with each sub-box so it 
might as well be the int of the poly if it fully contains it, but java makes it 
hard... I will try to think on that.

> facet by polygon
> 
>
> Key: LUCENE-8368
> URL: https://issues.apache.org/jira/browse/LUCENE-8368
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>    Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8368.patch
>
>
> This can give some nice flexibility if you are working with search results on 
> a map. Of course if everything about your use-case is static, its better to 
> compute this up-front and index string values, but its not always the case. 
> Also it can be helpful if your polygons are changing often, since you don't 
> have to reindex.
> Polygon2D already supports multipolygons, but today it only returns a boolean 
> value. This patch adds a {{find}} method that returns the polygon that 
> actually matched, or -1 if it doesn't match. {{contains}} is then just 
> written as {{find >= 0}}.
> Then we can solve the problem with just some sugar over the existing range 
> faceting, as each multipolygon is just a range of ids coming back from 
> {{find}} that correspond with it. e.g. if you were faceting by country, you 
> might have ~200 countries with 100,000 total polygons, and polygons 
> 22,000-32,000 correspond to Canada or whatever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8368) facet by polygon



[ 
https://issues.apache.org/jira/browse/LUCENE-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521755#comment-16521755
 ] 

Robert Muir commented on LUCENE-8368:
-

On the OSM benchmark it can facet-by-london-borough at ~ 20M hits/sec on my 
machine, better than I thought it would do. 

There is some smelly stuff to figure out first though:
* the LongValues abstraction used here doesn't support SortedNumeric, that's a 
problem for LatLonDV. It seems to also be an issue for the numeric range facet 
classes here.
* the re-use of range stuff makes for less code, but we really shouldn't be 
doing binary search, since our ranges are 100% dense.
* maybe not so great in the API to force construction of a polygon2D in every 
query? But when i benchmarked with the boroughs polygons (33 polygons, 186,318 
total vertices), this didn't seem to matter either.

> facet by polygon
> 
>
> Key: LUCENE-8368
> URL: https://issues.apache.org/jira/browse/LUCENE-8368
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>    Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8368.patch
>
>
> This can give some nice flexibility if you are working with search results on 
> a map. Of course if everything about your use-case is static, its better to 
> compute this up-front and index string values, but its not always the case. 
> Also it can be helpful if your polygons are changing often, since you don't 
> have to reindex.
> Polygon2D already supports multipolygons, but today it only returns a boolean 
> value. This patch adds a {{find}} method that returns the polygon that 
> actually matched, or -1 if it doesn't match. {{contains}} is then just 
> written as {{find >= 0}}.
> Then we can solve the problem with just some sugar over the existing range 
> faceting, as each multipolygon is just a range of ids coming back from 
> {{find}} that correspond with it. e.g. if you were faceting by country, you 
> might have ~200 countries with 100,000 total polygons, and polygons 
> 22,000-32,000 correspond to Canada or whatever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8368) facet by polygon



[ 
https://issues.apache.org/jira/browse/LUCENE-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521706#comment-16521706
 ] 

Robert Muir commented on LUCENE-8368:
-

I didn't yet hook up the OSM benchmark for this, would be good to know. If its 
too sluggish, we can emphasize things like RandomSamplingFacetsCollector and 
fastMatchQuery more in the javadocs which will help.

> facet by polygon
> 
>
> Key: LUCENE-8368
> URL: https://issues.apache.org/jira/browse/LUCENE-8368
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>    Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8368.patch
>
>
> This can give some nice flexibility if you are working with search results on 
> a map. Of course if everything about your use-case is static, its better to 
> compute this up-front and index string values, but its not always the case. 
> Also it can be helpful if your polygons are changing often, since you don't 
> have to reindex.
> Polygon2D already supports multipolygons, but today it only returns a boolean 
> value. This patch adds a {{find}} method that returns the polygon that 
> actually matched, or -1 if it doesn't match. {{contains}} is then just 
> written as {{find >= 0}}.
> Then we can solve the problem with just some sugar over the existing range 
> faceting, as each multipolygon is just a range of ids coming back from 
> {{find}} that correspond with it. e.g. if you were faceting by country, you 
> might have ~200 countries with 100,000 total polygons, and polygons 
> 22,000-32,000 correspond to Canada or whatever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8368) facet by polygon



 [ 
https://issues.apache.org/jira/browse/LUCENE-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-8368:

Attachment: LUCENE-8368.patch

> facet by polygon
> 
>
> Key: LUCENE-8368
> URL: https://issues.apache.org/jira/browse/LUCENE-8368
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>    Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8368.patch
>
>
> This can give some nice flexibility if you are working with search results on 
> a map. Of course if everything about your use-case is static, its better to 
> compute this up-front and index string values, but its not always the case. 
> Also it can be helpful if your polygons are changing often, since you don't 
> have to reindex.
> Polygon2D already supports multipolygons, but today it only returns a boolean 
> value. This patch adds a {{find}} method that returns the polygon that 
> actually matched, or -1 if it doesn't match. {{contains}} is then just 
> written as {{find >= 0}}.
> Then we can solve the problem with just some sugar over the existing range 
> faceting, as each multipolygon is just a range of ids coming back from 
> {{find}} that correspond with it. e.g. if you were faceting by country, you 
> might have ~200 countries with 100,000 total polygons, and polygons 
> 22,000-32,000 correspond to Canada or whatever.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8368) facet by polygon

Robert Muir created LUCENE-8368:
---

 Summary: facet by polygon
 Key: LUCENE-8368
 URL: https://issues.apache.org/jira/browse/LUCENE-8368
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Robert Muir


This can give some nice flexibility if you are working with search results on a 
map. Of course if everything about your use-case is static, its better to 
compute this up-front and index string values, but its not always the case. 
Also it can be helpful if your polygons are changing often, since you don't 
have to reindex.

Polygon2D already supports multipolygons, but today it only returns a boolean 
value. This patch adds a {{find}} method that returns the polygon that actually 
matched, or -1 if it doesn't match. {{contains}} is then just written as {{find 
>= 0}}.

Then we can solve the problem with just some sugar over the existing range 
faceting, as each multipolygon is just a range of ids coming back from {{find}} 
that correspond with it. e.g. if you were faceting by country, you might have 
~200 countries with 100,000 total polygons, and polygons 22,000-32,000 
correspond to Canada or whatever.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8367) Make per-dimension drill down optional for each facet dimension



[ 
https://issues.apache.org/jira/browse/LUCENE-8367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520016#comment-16520016
 ] 

Robert Muir commented on LUCENE-8367:
-

Looking at the code it seems like those ValueSources already call equals/etc on 
the ranges today, so it seems good you fixed that.

Should the DoubleRange equals() compare bits for safety like Double.equals()? 
Otherwise with == its a bit smelly and buggy (-0 vs 0 and so on).

> Make per-dimension drill down optional for each facet dimension
> ---
>
> Key: LUCENE-8367
> URL: https://issues.apache.org/jira/browse/LUCENE-8367
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8367.patch
>
>
> Today, when you index a {{FacetField}} with path {{foo/bar,}} we index two 
> drill down terms onto the document: {{foo}} and {{foo/bar}}.
> But I suspect some users (like me!) don't need to drilldown just on {{foo}} 
> (effectively "find all documents that have any value for this facet 
> dimension"), so I added an option to {{FacetsConfig}} to let you specify 
> per-dimension whether you need to drill down (defaults to true, matching 
> current behavior).
> I also added {{hashCode}} and {{equals}} to the {{LongRange}} and 
> {{DoubleRange}} classes in facets module, and improved {{CheckIndex}} a bit 
> to print the total %deletions across the index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-8366) upgrade to icu 62.1



 [ 
https://issues.apache.org/jira/browse/LUCENE-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-8366.
-
   Resolution: Fixed
Fix Version/s: 7.5
   trunk

master: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/2ea416ee
branch_7x: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/5b5b09c8

> upgrade to icu 62.1
> ---
>
> Key: LUCENE-8366
> URL: https://issues.apache.org/jira/browse/LUCENE-8366
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>    Reporter: Robert Muir
>Priority: Major
> Fix For: trunk, 7.5
>
> Attachments: LUCENE-8366.patch
>
>
> This gives unicode 11 support.
> Also emoji tokenization is simpler and it gives a way to have better 
> tokenization for emoji from the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7314) Graduate InetAddressPoint and LatLonPoint to core



[ 
https://issues.apache.org/jira/browse/LUCENE-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519825#comment-16519825
 ] 

Robert Muir commented on LUCENE-7314:
-

+1 patch looks good.

> Graduate InetAddressPoint and LatLonPoint to core
> -
>
> Key: LUCENE-7314
> URL: https://issues.apache.org/jira/browse/LUCENE-7314
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7314.patch
>
>
> Maybe we should graduate these fields (and related queries) to core for 
> Lucene 6.1?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8364) Refactor and clean up core geo api



[ 
https://issues.apache.org/jira/browse/LUCENE-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519478#comment-16519478
 ] 

Robert Muir commented on LUCENE-8364:
-

Also there is another problem with having Polygon create Polygon2D 
datastructures, there is not a one-to-one relationship between the two.

Anything using this should create Polygon2D explicitly itself because it has 
many-to-one relationship:

{code}
  /** Builds a Polygon2D from multipolygon */
public static Polygon2D create(Polygon... polygons) {
{code}

This is really important: since for multipolygons it builds a 2-stage tree. We 
don't want to encourage users creating these things for individual polygons and 
using booleanquery or something like that, it will result in stuff that runs in 
linear time.

> Refactor and clean up core geo api
> --
>
> Key: LUCENE-8364
> URL: https://issues.apache.org/jira/browse/LUCENE-8364
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Major
> Attachments: LUCENE-8364.patch
>
>
> The core geo API is quite disorganized and confusing. For example there is 
> {{Polygon}} for creating an instance of polygon vertices and holes and 
> {{Polygon2D}} for computing relations between points and polygons. There is 
> also a {{PolygonPredicate}} and {{DistancePredicate}} in {{GeoUtils}} for 
> computing point in polygon and point distance relations, respectively, and a 
> {{GeoRelationUtils}} utility class which is no longer used for anything. This 
> disorganization is due to the organic improvements of simple {{LatLonPoint}} 
> indexing and search features and a little TLC is needed to clean up api to 
> make it more approachable and easy to understand. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8361) Make TestRandomChains check that filters preserve positions



[ 
https://issues.apache.org/jira/browse/LUCENE-8361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519406#comment-16519406
 ] 

Robert Muir commented on LUCENE-8361:
-

Both those limit-filters are broken and buggy by default, they won't consume 
all the tokens unless you pass some special boolean.

> Make TestRandomChains check that filters preserve positions
> ---
>
> Key: LUCENE-8361
> URL: https://issues.apache.org/jira/browse/LUCENE-8361
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-8361.patch
>
>
> Follow-up of LUCENE-8360: it is a bit disappointing that we only found this 
> issue because of a newly introduced token filter. I'm wondering that we might 
> be able to make TestRandomChains detect more bugs by verifying that the sum 
> of position increments is preserved through the whole analysis chain.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8366) upgrade to icu 62.1



 [ 
https://issues.apache.org/jira/browse/LUCENE-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-8366:

Attachment: LUCENE-8366.patch

> upgrade to icu 62.1
> ---
>
> Key: LUCENE-8366
> URL: https://issues.apache.org/jira/browse/LUCENE-8366
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>    Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8366.patch
>
>
> This gives unicode 11 support.
> Also emoji tokenization is simpler and it gives a way to have better 
> tokenization for emoji from the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8366) upgrade to icu 62.1

Robert Muir created LUCENE-8366:
---

 Summary: upgrade to icu 62.1
 Key: LUCENE-8366
 URL: https://issues.apache.org/jira/browse/LUCENE-8366
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Robert Muir


This gives unicode 11 support.

Also emoji tokenization is simpler and it gives a way to have better 
tokenization for emoji from the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Lucene/Solr 8.0

2018-06-20 Thread Robert Muir

How can the end user actually use the biggest new feature: impacts and
BMW? As far as I can tell, the issue to actually implement the
necessary API changes (IndexSearcher/TopDocs/etc) is still open and
unresolved, although there are some interesting ideas on it. This
seems like a really big missing piece, without a proper API, the stuff
is not really usable. I also can't imagine a situation where the API
could be introduced in a followup minor release because it would be
too invasive.

On Mon, Jun 18, 2018 at 1:19 PM, Adrien Grand  wrote:
> Hi all,
>
> I would like to start discussing releasing Lucene/Solr 8.0. Lucene 8 already
> has some good changes around scoring, notably cleanups to
> similarities[1][2][3], indexing of impacts[4], and an implementation of
> Block-Max WAND[5] which, once combined, allow to run queries faster when
> total hit counts are not requested.
>
> [1] https://issues.apache.org/jira/browse/LUCENE-8116
> [2] https://issues.apache.org/jira/browse/LUCENE-8020
> [3] https://issues.apache.org/jira/browse/LUCENE-8007
> [4] https://issues.apache.org/jira/browse/LUCENE-4198
> [5] https://issues.apache.org/jira/browse/LUCENE-8135
>
> In terms of bug fixes, there is also a bad relevancy bug[6] which is only in
> 8.0 because it required a breaking change[7] to be implemented.
>
> [6] https://issues.apache.org/jira/browse/LUCENE-8031
> [7] https://issues.apache.org/jira/browse/LUCENE-8134
>
> As usual, doing a new major release will also help age out old codecs, which
> in-turn make maintenance easier: 8.0 will no longer need to care about the
> fact that some codecs were initially implemented with a random-access API
> for doc values, that pre-7.0 indices encoded norms differently, or that
> pre-6.2 indices could not record an index sort.
>
> I also expect that we will come up with ideas of things to do for 8.0 as we
> feel that the next major is getting closer. In terms of planning, I was
> thinking that we could target something like october 2018, which would be
> 12-13 months after 7.0 and 3-4 months from now.
>
> From a Solr perspective, the main change I'm aware of that would be worth
> releasing a new major is the Star Burst effort. Is it something we want to
> get in for 8.0?
>
> Adrien

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8364) Refactor and clean up core geo api



[ 
https://issues.apache.org/jira/browse/LUCENE-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518764#comment-16518764
 ] 

Robert Muir commented on LUCENE-8364:
-

Also the relate/relatePoint changes to Polygon are a big performance trap: this 
class exists solely as a thing to pass to queries. we shouldnt dynamically 
build large data structures and stuff here, and add complexity such as the 
caching and stuff it has, I really think this doesn't belong.

> Refactor and clean up core geo api
> --
>
> Key: LUCENE-8364
> URL: https://issues.apache.org/jira/browse/LUCENE-8364
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Major
> Attachments: LUCENE-8364.patch
>
>
> The core geo API is quite disorganized and confusing. For example there is 
> {{Polygon}} for creating an instance of polygon vertices and holes and 
> {{Polygon2D}} for computing relations between points and polygons. There is 
> also a {{PolygonPredicate}} and {{DistancePredicate}} in {{GeoUtils}} for 
> computing point in polygon and point distance relations, respectively, and a 
> {{GeoRelationUtils}} utility class which is no longer used for anything. This 
> disorganization is due to the organic improvements of simple {{LatLonPoint}} 
> indexing and search features and a little TLC is needed to clean up api to 
> make it more approachable and easy to understand. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8364) Refactor and clean up core geo api



[ 
https://issues.apache.org/jira/browse/LUCENE-8364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518757#comment-16518757
 ] 

Robert Muir commented on LUCENE-8364:
-

Just looking, I have a few concerns:
* what is the goal of all the new abstractions? Abstractions have a significant 
cost, and I don't think we should be building a geo library here. We should 
just make the searches and stuff work.
* why does Polygon have new methods such as relate(), relatePoint() that are 
not used anywhere. We shouldn't add unnecessary stuff like that, we should keep 
this stuff minimal.
* the hashcode/equals on Polygon2d is unnecessary. It is an implementation 
detail and such methods should not be used. For example all queries just use 
equals() with the Polygon.
* methods like maxLon() on Polygon are unnecessary. These are already final 
variables so we don't need to wrap them in methods. Additionally such method 
names don't follow standard java notation: it seems to just add noise.
* some of the checks e.g. in Polygon are unnecessary. We don't need 
checkVertexIndex when the user already gets a correct exception 
(IndexOutOfBounds).

Maybe, it would be easier to split up the proposed changes so its easier to 
review. Especially for proposing any new abstract classes as I want to make 
sure that we really get value out of any abstractions, due to their high cost.

> Refactor and clean up core geo api
> --
>
> Key: LUCENE-8364
> URL: https://issues.apache.org/jira/browse/LUCENE-8364
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nicholas Knize
>Priority: Major
> Attachments: LUCENE-8364.patch
>
>
> The core geo API is quite disorganized and confusing. For example there is 
> {{Polygon}} for creating an instance of polygon vertices and holes and 
> {{Polygon2D}} for computing relations between points and polygons. There is 
> also a {{PolygonPredicate}} and {{DistancePredicate}} in {{GeoUtils}} for 
> computing point in polygon and point distance relations, respectively, and a 
> {{GeoRelationUtils}} utility class which is no longer used for anything. This 
> disorganization is due to the organic improvements of simple {{LatLonPoint}} 
> indexing and search features and a little TLC is needed to clean up api to 
> make it more approachable and easy to understand. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Status of solr tests

2018-06-15 Thread Robert Muir

can we disable this bot already?

On Fri, Jun 15, 2018, 7:25 PM Martin Gainty  wrote:

> Erick-
>
> appears that style mis-application may be categorised as INFO
> are mixed in with SEVERE errors
>
> Would it make sense to filter the errors based on severity ?
>
>
> https://docs.oracle.com/javase/7/docs/api/java/util/logging/Level.html
> Level (Java Platform SE 7 ) - Oracle Help Center
> 
> docs.oracle.com
> The Level class defines a set of standard logging levels that can be used
> to control logging output. The logging Level objects are ordered and are
> specified by ordered integers.
> if you know Severity you can triage the SEVERE errors before working down
> to INFO errors
>
> WDYT?
> Martin
> __
>
>
>
>
> --
> *From:* Erick Erickson 
> *Sent:* Friday, June 15, 2018 1:05 PM
> *To:* dev@lucene.apache.org; Mark Miller
> *Subject:* Re: Status of solr tests
>
> Mark (and everyone).
>
> I'm trying to be somewhat conservative about what I BadApple, at this
> point it's only things that have failed every week for the last 4.
> Part of that conservatism is to avoid BadApple'ing tests that are
> failing and _should_ fail.
>
> I'm explicitly _not_ delving into any of the causes at all at this
> point, it's overwhelming until we reduce the noise as everyone knows.
>
> So please feel totally free to BadApple anything you know is flakey,
> it won't intrude on my turf ;)
>
> And since I realized I can also report tests that have _not_ failed in
> a month that _are_ BadApple'd, we can be a little freer with
> BadApple'ing tests since there's a mechanism for un-annotating them
> without a lot of tedious effort.
>
> FWIW.
>
> On Fri, Jun 15, 2018 at 9:09 AM, Mark Miller 
> wrote:
> > There is an okay chance I'm going to start making some improvements here
> as
> > well. I've been working on a very stable set of tests on my starburst
> branch
> > and will slowly bring in test fixes over time (I've already been making
> some
> > on that branch for important tests). We should currently be defaulting to
> > tests.badapples=false on all solr test runs - it's a joke to try and get
> a
> > clean run otherwise, and even then somehow 4 or 5 tests that fail
> somewhat
> > commonly have so far avoided Erick's @BadApple hack and slash. They are
> bad
> > appled on my dev branch now, but that is currently where any time I have
> is
> > spent rather than on the main dev branches.
> >
> > Also, too many flakey tests are introduced because devs are not beasting
> or
> > beasting well before committing new heavy tests. Perhaps we could add
> some
> > docs around that.
> >
> > We have built in beasting support, we need to emphasize that a couple
> passes
> > on a new test is not sufficient to test it's quality.
> >
> > - Mark
> >
> > On Fri, Jun 15, 2018 at 9:46 AM Erick Erickson 
> > wrote:
> >>
> >> (Sg) All very true. You're not alone in your frustration.
> >>
> >> I've been trying to at least BadApple tests that fail consistently, so
> >> another option could be to disable BadApple'd tests. My hope has been
> >> to get to the point of being able to reliably get clean runs, at least
> >> when BadApple'd tests are disabled.
> >>
> >> From that point I want to draw a line in the sand and immediately
> >> address tests that fail that are _not_ BadApple'd. At least then we'll
> >> stop getting _worse_. And then we can work on the BadApple'd tests.
> >> But as David says, that's not going to be any time soon. It's been a
> >> couple of months that I've been trying to just get the tests
> >> BadApple'd without even trying to fix any of them.
> >>
> >> It's particularly pernicious because with all the noise we don't see
> >> failures we _should_ see.
> >>
> >> So I don't have any good short-term answer either. We've built up a
> >> very large technical debt in the testing. The first step is to stop
> >> adding more debt, which is what I've been working on so far. And
> >> that's the easy part
> >>
> >> Siigghh
> >>
> >> Erick
> >>
> >>
> >> On Fri, Jun 15, 2018 at 5:29 AM, David Smiley  >
> >> wrote:
> >> > (Sigh) I sympathize with your points Simon.  I'm +1 to modify the
> >> > Lucene-side JIRA QA bot (Yetus) to not execute Solr tests.  We can and
> >> > are
> >> > trying to improve the stability of the Solr tests but even
> >> > optimistically
> >> > the practical reality is that it won't be good enough anytime soon.
> >> > When we
> >> > get there, we can reverse this.
> >> >
> >> > On Fri, Jun 15, 2018 at 3:32 AM Simon Willnauer
> >> > 
> >> > wrote:
> >> >>
> >> >> folks,
> >> >>
> >> >> I got more active working on IndexWriter and Soft-Deletes etc. in the
> >> >> last couple of weeks. It's a blast again and I really enjoy it. The
> >> >> one thing that is IMO not acceptable is the status of solr tests. I
> >> >> tried so many times to get them

[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2018-06-12 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510220#comment-16510220
 ] 

Robert Muir commented on LUCENE-8041:
-

{quote}
+1 to make term vectors consistent across the index; it has always been strange 
that Lucene allows this.  Maybe open a separate issue for that?
{quote}

This issue specifically asks why there is an iterator at all in the 
description, thats why i explained it.

But i also am concerned about this issue because i don't think its a real 
bottleneck for anyone. I don't want us doing anything risky that could 
potentially hurt ordinary users for some esoteric abuse case with a million 
fields: it would be better to just stay with treemap.

It is fine to sort a list in the constructor, or use a linkedhashmap. This 
won't hurt ordinary users, it will just cost more ram for abuse cases, so I am 
fine. I really don't want to see sneaky optimizations trying to avoid sorts or 
any of that, it does not belong here, this needs to be simple, clear, and safe. 
Instead any serious effort should go into trying to remove the problematic api 
(term vectors stuff), then it can even simpler since we won't need two data 
structures.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2018-06-11 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509023#comment-16509023
 ] 

Robert Muir commented on LUCENE-8041:
-

{quote}
That sounds like the cart leading the horse (allowing how CheckIndex works 
today prevent us from remaking how we want Lucene to be tomorrow). Can't we 
just relax what CheckIndex checks here – like have it check but report a 
warning if only some docs have TVs and others not which is generally not 
normal? I think that's what you're getting at but I'm not sure. I've only 
looked at CheckIndex in passing.
{quote}

That's absolutely not the case at all. The user is allowed to do this, hence 
checkindex must validate it. Please don't make checkindex the bad guy here, its 
not. The problem is related to indexwriter allowing users to do this.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2018-06-11 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16508692#comment-16508692
 ] 

Robert Muir commented on LUCENE-8041:
-

This has the downside that it sorts all fields on every call to iterator(). My 
concern is mainly that it will introduce performance problems down the line, 
ones that are difficult to find/debug because of java's syntactic sugar around 
iterator(). Especially if someone is using MultiFields (slow-wrapper crap), 
they will be doing a bunch of sorts on each segment, then merging those, and 
all hidden behind a single call to iterator().

I still feel the best would be to remove this map entirely: then you can be 
sure there aren't traps. The only thing blocking this is the fact that 
term-vector options are configurable per-doc, which doesnt make sense anyway.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8165) ban Arrays.copyOfRange with forbidden APIs

2018-06-07 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504650#comment-16504650
 ] 

Robert Muir commented on LUCENE-8165:
-

I think these commits may also fix LUCENE-8164 too: I haven't yet tried 
re-running the test yet to see if it now hits exception.

> ban Arrays.copyOfRange with forbidden APIs
> --
>
> Key: LUCENE-8165
> URL: https://issues.apache.org/jira/browse/LUCENE-8165
> Project: Lucene - Core
>  Issue Type: Bug
>    Reporter: Robert Muir
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: LUCENE-8165.patch, LUCENE-8165_copy_of.patch, 
> LUCENE-8165_copy_of_range.patch, LUCENE-8165_start.patch, 
> LUCENE-8165_start.patch
>
>
> This method is no good, because instead of throwing AIOOBE for bad bounds, it 
> will silently fill with zeros (essentially silent corruption). Unfortunately 
> it is used in quite a few places so replacing it with e.g. arrayCopy may 
> uncover some interesting surprises.
> See LUCENE-8164 for motivation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8165) ban Arrays.copyOfRange with forbidden APIs

2018-06-07 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504639#comment-16504639
 ] 

Robert Muir commented on LUCENE-8165:
-

Thanks Adrien: (also for branching first!). I'm sorry I was slow / holding up 
getting the fixes in: i was worried about the risks of the changes too.

> ban Arrays.copyOfRange with forbidden APIs
> --
>
> Key: LUCENE-8165
> URL: https://issues.apache.org/jira/browse/LUCENE-8165
> Project: Lucene - Core
>  Issue Type: Bug
>    Reporter: Robert Muir
>Priority: Major
> Fix For: master (8.0), 7.5
>
> Attachments: LUCENE-8165.patch, LUCENE-8165_copy_of.patch, 
> LUCENE-8165_copy_of_range.patch, LUCENE-8165_start.patch, 
> LUCENE-8165_start.patch
>
>
> This method is no good, because instead of throwing AIOOBE for bad bounds, it 
> will silently fill with zeros (essentially silent corruption). Unfortunately 
> it is used in quite a few places so replacing it with e.g. arrayCopy may 
> uncover some interesting surprises.
> See LUCENE-8164 for motivation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-06-07 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504554#comment-16504554
 ] 

Robert Muir commented on LUCENE-8273:
-

If shinglefilter is the buggy one it should be banned from the test for sure, 
and a issue opened for its bugginess.

Its not a test coverage concern: this can't replace unit tests: it exists to 
find new buggy interactions between the analysis components, like this one here.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8326) More Like This Params Refactor

2018-06-06 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16503137#comment-16503137
 ] 

Robert Muir commented on LUCENE-8326:
-

I feel the same way as before: we shouldn't split up a class in a 
user-impacting way just because its 1000 lines of code.  To the user it does 
not matter, they just see one class with getters and setters and its easy.

If it really needs to be split up, can we try to do it in a way that doesn't 
impact users, e.g. move some code into package-private implementation classes?

> More Like This Params Refactor
> --
>
> Key: LUCENE-8326
> URL: https://issues.apache.org/jira/browse/LUCENE-8326
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/query/scoring
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: LUCENE-8326.patch, LUCENE-8326.patch, LUCENE-8326.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> More Like This ca be refactored to improve the code readability, test 
> coverage and maintenance.
> Scope of this Jira issue is to start the More Like This refactor from the 
> More Like This Params.
> This Jira will not improve the current More Like This but just keep the same 
> functionality with a refactored code.
> Other Jira issues will follow improving the overall code readability, test 
> coverage and maintenance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range



 [ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-7960.
-
   Resolution: Fixed
Fix Version/s: master (8.0)
   7.4

Thank you [~iwesp] !

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Assignee: Robert Muir
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8348) Remove [Edge]NgramTokenizer min/max defaults consistent with Filter

Robert Muir created LUCENE-8348:
---

 Summary: Remove [Edge]NgramTokenizer min/max defaults consistent 
with Filter
 Key: LUCENE-8348
 URL: https://issues.apache.org/jira/browse/LUCENE-8348
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
 Environment: LUCENE-7960 fixed a good deal of trappiness here for the 
tokenfilters, there aren't ridiculous default min/max values such as 1,2. 

Also javadocs are enhanced to present a nice warning about using large ranges: 
it seems to surprise people that min=small, max=huge eats up a ton of 
resources, but its really like creating (huge-small) separate n-gram indexes, 
so of course its expensive.

Finally it keeps it easy to do the typical, more efficient fixed ngram case, vs 
forcing someone to do min=X,max=X range which is unintuitive.

We should improve the tokenizers in the same way.
Reporter: Robert Muir






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7690) TestSimpleTextPointsFormat.testWithExceptions() failure



[ 
https://issues.apache.org/jira/browse/LUCENE-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501124#comment-16501124
 ] 

Robert Muir commented on LUCENE-7690:
-

Sorry for the wrong messages: my dyslexia in commit message.

> TestSimpleTextPointsFormat.testWithExceptions() failure
> ---
>
> Key: LUCENE-7690
> URL: https://issues.apache.org/jira/browse/LUCENE-7690
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Steve Rowe
>Assignee: Michael McCandless
>Priority: Major
> Fix For: 6.5, 7.0
>
>
> Reproducing branch_6x seed from 
> [https://jenkins.thetaphi.de/job/Lucene-Solr-6.x-MacOSX/690/]:
> {noformat}
>[junit4] Suite: 
> org.apache.lucene.codecs.simpletext.TestSimpleTextPointsFormat
>[junit4] IGNOR/A 0.02s J0 | TestSimpleTextPointsFormat.testRandomBinaryBig
>[junit4]> Assumption #1: 'nightly' test group is disabled (@Nightly())
>[junit4]   2> NOTE: reproduce with: ant test  
> -Dtestcase=TestSimpleTextPointsFormat -Dtests.method=testWithExceptions 
> -Dtests.seed=CCE1E867577CFFF6 -Dtests.slow=true -Dtests.locale=uk-UA 
> -Dtests.timezone=Asia/Qatar -Dtests.asserts=true 
> -Dtests.file.encoding=ISO-8859-1
>[junit4] ERROR   0.93s J0 | TestSimpleTextPointsFormat.testWithExceptions 
> <<<
>[junit4]> Throwable #1: java.lang.IllegalStateException: this writer 
> hit an unrecoverable error; cannot complete forceMerge
>[junit4]>  at 
> __randomizedtesting.SeedInfo.seed([CCE1E867577CFFF6:6EB2741BD8F2B00C]:0)
>[junit4]>  at 
> org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1931)
>[junit4]>  at 
> org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1881)
>[junit4]>  at 
> org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:429)
>[junit4]>  at 
> org.apache.lucene.index.BasePointsFormatTestCase.verify(BasePointsFormatTestCase.java:701)
>[junit4]>  at 
> org.apache.lucene.index.BasePointsFormatTestCase.testWithExceptions(BasePointsFormatTestCase.java:224)
>[junit4]>  at java.lang.Thread.run(Thread.java:745)
>[junit4]> Caused by: org.apache.lucene.index.CorruptIndexException: 
> Problem reading index from 
> MockDirectoryWrapper(NIOFSDirectory@/Users/jenkins/workspace/Lucene-Solr-6.x-MacOSX/lucene/build/codecs/test/J0/temp/lucene.codecs.simpletext.TestSimpleTextPointsFormat_CCE1E867577CFFF6-001/tempDir-001
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@4d6de658) 
> (resource=MockDirectoryWrapper(NIOFSDirectory@/Users/jenkins/workspace/Lucene-Solr-6.x-MacOSX/lucene/build/codecs/test/J0/temp/lucene.codecs.simpletext.TestSimpleTextPointsFormat_CCE1E867577CFFF6-001/tempDir-001
>  lockFactory=org.apache.lucene.store.NativeFSLockFactory@4d6de658))
>[junit4]>  at 
> org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:140)
>[junit4]>  at 
> org.apache.lucene.index.SegmentReader.(SegmentReader.java:74)
>[junit4]>  at 
> org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
>[junit4]>  at 
> org.apache.lucene.index.ReadersAndUpdates.getReaderForMerge(ReadersAndUpdates.java:617)
>[junit4]>  at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4293)
>[junit4]>  at 
> org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3940)
>[junit4]>  at 
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:588)
>[junit4]>  at 
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:626)
>[junit4]> Caused by: java.io.FileNotFoundException: a random 
> IOException (_0.inf)
>[junit4]>  at 
> org.apache.lucene.store.MockDirectoryWrapper.maybeThrowIOExceptionOnOpen(MockDirectoryWrapper.java:575)
>[junit4]>  at 
> org.apache.lucene.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:744)
>[junit4]>  at 
> org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137)
>[junit4]>  at 
> org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1072)
>[junit4]>  at 
> org.apache.lucene.codecs.simpletext.SimpleTextFieldInfosFormat.read(SimpleTextFieldInfosFormat.java:73)
>[junit4]>  at 
> org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.j

discuss: stop adding 'via' from CHANGES.txt entries (take two)

2018-06-04 Thread Robert Muir

I raised this issue a few years ago, and no consensus was reached [1]

I'm asking if we can take the time to revisit the issue. Back then it
was subversion days, and you had "patch-uploaders" and "contributors".
With git now, I believe the situation is even a bit more extreme,
because the committer is the contributor and the lucene "committer"
was really the "pusher".

On the other hand, there were some reasons against removing this
before. In particular some mentioned that it conveyed meaning about
who might be the best person to ping about a particular area of the
code. If this is still the case, I'd ask that we discuss alternative
ways that it could be accomplished (such as wiki page perhaps
linked-to HowToContribute that ppl can edit).

I wrote a new summary/argument inline, but see the linked thread for
the previous discussion:

In the past CHANGES.txt entries from a contributor have also had the
name of the committer with a 'via' entry.

e.g.:

LUCENE-1234: optimized FooBar. (Jane Doe via Joe Schmoe).

I propose we stop adding the committer name (via Joe Schmoe). It seems
to diminish the value of the contribution. It reminds me of a
professor adding a second author by default or something like that. If
someone really wants to know who committed the change, I think its
fair that they look at version control history?

1.
http://mail-archives.apache.org/mod_mbox/lucene-dev/201206.mbox/%3CCAOdYfZW65MXrzyRPsvBD0C6c4X%2BLuQX4oVec%3DyR_PCPgTQrnhQ%40mail.gmail.com%3E

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Assigned] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range



 [ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir reassigned LUCENE-7960:
---

Assignee: Robert Muir

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Assignee: Robert Muir
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8165) ban Arrays.copyOfRange with forbidden APIs

2018-06-01 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497871#comment-16497871
 ] 

Robert Muir commented on LUCENE-8165:
-

yeah, lets split it up.

> ban Arrays.copyOfRange with forbidden APIs
> --
>
> Key: LUCENE-8165
> URL: https://issues.apache.org/jira/browse/LUCENE-8165
> Project: Lucene - Core
>  Issue Type: Bug
>    Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8165_copy_of_range.patch, 
> LUCENE-8165_start.patch, LUCENE-8165_start.patch
>
>
> This method is no good, because instead of throwing AIOOBE for bad bounds, it 
> will silently fill with zeros (essentially silent corruption). Unfortunately 
> it is used in quite a few places so replacing it with e.g. arrayCopy may 
> uncover some interesting surprises.
> See LUCENE-8164 for motivation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8165) ban Arrays.copyOfRange with forbidden APIs



[ 
https://issues.apache.org/jira/browse/LUCENE-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497498#comment-16497498
 ] 

Robert Muir commented on LUCENE-8165:
-

This looks good, thanks! I had forgotten about this issue, great to have more 
progress.

> ban Arrays.copyOfRange with forbidden APIs
> --
>
> Key: LUCENE-8165
> URL: https://issues.apache.org/jira/browse/LUCENE-8165
> Project: Lucene - Core
>  Issue Type: Bug
>    Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-8165_copy_of_range.patch, 
> LUCENE-8165_start.patch, LUCENE-8165_start.patch
>
>
> This method is no good, because instead of throwing AIOOBE for bad bounds, it 
> will silently fill with zeros (essentially silent corruption). Unfortunately 
> it is used in quite a few places so replacing it with e.g. arrayCopy may 
> uncover some interesting surprises.
> See LUCENE-8164 for motivation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8342) Can we enforce more properties on fields like uniquness



[ 
https://issues.apache.org/jira/browse/LUCENE-8342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497263#comment-16497263
 ] 

Robert Muir commented on LUCENE-8342:
-

+1: it would be a big improvement to validate this in IW, merge, checkindex, 
etc.

> Can we enforce more properties on fields like uniquness 
> 
>
> Key: LUCENE-8342
> URL: https://issues.apache.org/jira/browse/LUCENE-8342
> Project: Lucene - Core
>  Issue Type: New Feature
>Affects Versions: master (8.0)
>Reporter: Simon Willnauer
>Priority: Major
>
> This is a spin-off from LUCENE-8335 where we discuss adding a boolean to 
> FieldInfo to check if the field is used for soft deletes. This has been a a 
> very delicate line we drew in the past but if we take a step back and think 
> about how we'd design a feature like IW#updateDocument(Term, Document) today 
> would we allow passing different fields to this API? It's just one example 
> ie. storing floats and ints in the same DV field is a different one. I 
> personally think it would be a good idea to be more strict on that end and I 
> wonder what others think.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8335) Do not allow changing soft-deletes field



[ 
https://issues.apache.org/jira/browse/LUCENE-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497261#comment-16497261
 ] 

Robert Muir commented on LUCENE-8335:
-

as a followup, I think lets support checkindex validation (e.g. LUCENE-8341). I 
am happy to see LUCENE-8342 opened up, thats probably an easier win than int vs 
float anyway, but it addresses the kind of concerns i had here, I think its 
important to also enforce stuff for typical usecases as well.

> Do not allow changing soft-deletes field
> 
>
> Key: LUCENE-8335
> URL: https://issues.apache.org/jira/browse/LUCENE-8335
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.4, master (8.0)
>Reporter: Nhat Nguyen
>Assignee: Simon Willnauer
>Priority: Minor
> Attachments: LUCENE-8335.patch
>
>
> Today we do not enforce an index to use a single soft-deletes field. A user 
> can create an index with one soft-deletes field then open an IW with another 
> field or add an index with a different soft-deletes field. This should not be 
> allowed and reported the error to users as soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8341) Record soft deletes in SegmentCommitInfo



[ 
https://issues.apache.org/jira/browse/LUCENE-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497237#comment-16497237
 ] 

Robert Muir commented on LUCENE-8341:
-

Yes I agree it is a shame CheckIndex cannot enforce the check yet. But there 
are many other such possible checks (ones i would argue are more 
mainstream/bigger wins) that it also can't do for similar reasons. That was the 
reason why I made the point on the LUCENE-8335 issue, lets just figure it out 
there.

>  Record soft deletes in SegmentCommitInfo
> -
>
> Key: LUCENE-8341
> URL: https://issues.apache.org/jira/browse/LUCENE-8341
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.4, master (8.0)
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-8341.patch, LUCENE-8341.patch, LUCENE-8341.patch
>
>
>  This change add the number of documents that are soft deletes but
> not hard deleted to the segment commit info. This is the last step
> towards making soft deletes as powerful as hard deltes since now the
> number of document can be read from commit points without opening a
> full blown reader. This also allows merge posliies to make decisions
> without requiring an NRT reader to get the relevant statistics. This
> change doesn't enforce any field to be used as soft deletes and the 
> statistic
> is maintained per segment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8335) Do not allow changing soft-deletes field

2018-05-30 Thread Robert Muir (JIRA)



[ 
https://issues.apache.org/jira/browse/LUCENE-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495109#comment-16495109
 ] 

Robert Muir commented on LUCENE-8335:
-

Mike, I think nearly all of your same arguments could be made for preventing 
the change of an IntPoint field to a FloatPoint. But as it is now, any "schema" 
stuff in lucene is so minimum that, it doesn't know the difference. In the past 
the reasoning has been to keep it minimal and leave that to the consuming app. 
I don't think it makes sense that it can only track expert usecases.

> Do not allow changing soft-deletes field
> 
>
> Key: LUCENE-8335
> URL: https://issues.apache.org/jira/browse/LUCENE-8335
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.4, master (8.0)
>Reporter: Nhat Nguyen
>Assignee: Simon Willnauer
>Priority: Minor
> Attachments: LUCENE-8335.patch
>
>
> Today we do not enforce an index to use a single soft-deletes field. A user 
> can create an index with one soft-deletes field then open an IW with another 
> field or add an index with a different soft-deletes field. This should not be 
> allowed and reported the error to users as soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8335) Do not allow changing soft-deletes field

2018-05-27 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491970#comment-16491970
 ] 

Robert Muir commented on LUCENE-8335:
-

i dont think lucene needs to enforce this. from my perspective its just a 
docvalues field. given that lucene doesnt even know the difference between a 
integer amd a rloat field, i dont think it should be tracking expert shit for 
elasticsearch.

> Do not allow changing soft-deletes field
> 
>
> Key: LUCENE-8335
> URL: https://issues.apache.org/jira/browse/LUCENE-8335
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.4, master (8.0)
>Reporter: Nhat Nguyen
>Assignee: Simon Willnauer
>Priority: Minor
> Attachments: LUCENE-8335.patch
>
>
> Today we do not enforce an index to use a single soft-deletes field. A user 
> can create an index with one soft-deletes field then open an IW with another 
> field or add an index with a different soft-deletes field. This should not be 
> allowed and reported the error to users as soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8330) Detach IndexWriter from MergePolicy

2018-05-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16489155#comment-16489155
 ] 

Robert Muir commented on LUCENE-8330:
-

+1

> Detach IndexWriter from MergePolicy
> ---
>
> Key: LUCENE-8330
> URL: https://issues.apache.org/jira/browse/LUCENE-8330
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.4, master (8.0)
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-8330.patch, LUCENE-8330.patch, LUCENE-8330.patch
>
>
>  This change introduces a new MergePolicy.MergeContext interface
> that is easy to mock and cuts over all instances of IW to MergeContext.
> Since IW now implements MergeContext the cut over is straight forward.
> This reduces the exposed API available in MP dramatically and allows
> efficient testing without relying on IW to improve the coverage and
> testability of our MP implementations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8330) Detach IndexWriter from MergePolicy

2018-05-24 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488912#comment-16488912
 ] 

Robert Muir commented on LUCENE-8330:
-

Can we remove the hasDeletions from the interface? It makes the interface 
harder/seems redundant since its just numDeletesToMerge > 0. I never understood 
the crazy info.info.dir == writer.getDirectory() from before, but i didn't have 
the time to change it to an assert and see why its really needed from tests.  
At least if its really needed, the javadocs of the interface should explain 
enough so we understand why it has this crazy check. 

In general though, the cleanup will be great for tests.

> Detach IndexWriter from MergePolicy
> ---
>
> Key: LUCENE-8330
> URL: https://issues.apache.org/jira/browse/LUCENE-8330
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 7.4, master (8.0)
>Reporter: Simon Willnauer
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: LUCENE-8330.patch
>
>
>  This change introduces a new MergePolicy.MergeContext interface
> that is easy to mock and cuts over all instances of IW to MergeContext.
> Since IW now implements MergeContext the cut over is straight forward.
> This reduces the exposed API available in MP dramatically and allows
> efficient testing without relying on IW to improve the coverage and
> testability of our MP implementations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8326) More Like This Params Refactor

2018-05-23 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488345#comment-16488345
 ] 

Robert Muir commented on LUCENE-8326:
-

First looking at the API change, it would be good to understand the goals. 

This change wraps 8 or 9 existing setters such as {{setMinTermLen}} with a 
"configuration class". There is also another class related to boosts. But 
everything is still just as mutable as before, so from my perspective it only 
adds additional indirection/abstraction which is undesired.

If we want to make MLT immutable or something like that, we should first figure 
out if that's worth it. From my perspective, I'm not sold on this for 
MoreLikeThis itself, since its lightweight and stateless, and since I can't see 
a way for MoreLikeThisQuery to cache efficiently.

On the other hand MoreLikeThisQuery is kind of a mess, but that isn't addressed 
with the refactoring. Really all queries should be immutable for caching 
purposes, and should all have correct equals/hashcode: but seems like its a 
lost cause with MoreLikeThisQuery since it does strange stuff in rewrite: its 
not really a per-segment thing. Because of how the query works, its not obvious 
to me if/how we can improve it with immutability...

Also currently MoreLikeThisQuery doesn't accept MoreLikeThis as a parameter or 
anything, and only uses it internally. So as it stands (also with this patch) 
it still has a "duplicate" API of all the parameters, which isn't great.

So I think if we want to change the API for this stuff, we should figure out 
what the goals are? If its just to say, consolidate api between MoreLikeThis 
and MoreLikeThisQuery, I can buy into that (although I have never used the 
latter myself, only the former). However the other queries use builders for 
such purposes so that's probably something to consider.

For the solr changes, my only comment would be that instead of running actual 
queries, isn't it good enough to just test that XYZ configuration produces a 
correct MLT object? Otherwise the test seems fragile from my perspective.

> More Like This Params Refactor
> --
>
> Key: LUCENE-8326
> URL: https://issues.apache.org/jira/browse/LUCENE-8326
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: LUCENE-8326.patch, LUCENE-8326.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> More Like This ca be refactored to improve the code readability, test 
> coverage and maintenance.
> Scope of this Jira issue is to start the More Like This refactor from the 
> More Like This Params.
> This Jira will not improve the current More Like This but just keep the same 
> functionality with a refactored code.
> Other Jira issues will follow improving the overall code readability, test 
> coverage and maintenance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8326) More Like This Params Refactor

2018-05-23 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488317#comment-16488317
 ] 

Robert Muir commented on LUCENE-8326:
-

Sorry, it may have been my fault. I checked "patch attached" when moving the 
issue to let the automated checks run

> More Like This Params Refactor
> --
>
> Key: LUCENE-8326
> URL: https://issues.apache.org/jira/browse/LUCENE-8326
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: LUCENE-8326.patch, LUCENE-8326.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> More Like This ca be refactored to improve the code readability, test 
> coverage and maintenance.
> Scope of this Jira issue is to start the More Like This refactor from the 
> More Like This Params.
> This Jira will not improve the current More Like This but just keep the same 
> functionality with a refactored code.
> Other Jira issues will follow improving the overall code readability, test 
> coverage and maintenance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8325) smartcn analyzer can't handle SURROGATE char

2018-05-23 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487158#comment-16487158
 ] 

Robert Muir commented on LUCENE-8325:
-

+1, thank you for fixing this.

> smartcn analyzer can't handle SURROGATE char
> 
>
> Key: LUCENE-8325
> URL: https://issues.apache.org/jira/browse/LUCENE-8325
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: chengpohi
>Priority: Minor
>  Labels: newbie, patch
> Attachments: handle_surrogate_char_for_smartcn_2018-05-23.patch
>
>
> This issue is from [https://github.com/elastic/elasticsearch/issues/30739]
> smartcn analyzer can't handle SURROGATE char, Example:
>  
>  
> {code:java}
> Analyzer ca = new SmartChineseAnalyzer(); 
> String sentence = "\uD862\uDE0F"; // 訏 a surrogate char 
> TokenStream tokenStream = ca.tokenStream("", sentence); 
> CharTermAttribute charTermAttribute = 
> tokenStream.addAttribute(CharTermAttribute.class); 
> tokenStream.reset(); 
> while (tokenStream.incrementToken()) { 
> String term = charTermAttribute.toString(); 
> System.out.println(term); 
> } 
> {code}
>  
> In the above code snippet will output: 
>  
> {code:java}
> ? 
> ? 
> {code}
>  
>  and I have created a *PATCH* to try to fix this, please help review(since 
> *smartcn* only support *GBK* char, so it's only just handle it as a *single 
> char*).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483907#comment-16483907
 ] 

Robert Muir commented on LUCENE-8311:
-

Yeah, I was thinking more along the lines of LowPhrase (still exact scoring). 
Sloppy is a whole nother beast :)

> Leverage impacts for phrase queries
> ---
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8311.patch
>
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [More Like This] I would like to contribute

2018-05-22 Thread Robert Muir

On Tue, May 22, 2018 at 7:03 AM, Alessandro Benedetti 
wrote:

> Hi Robert,
> thanks for the feedback.
> I read your comment last year and you I agreed completely.
> So I started step by step, with the refactor first.
>
> The first contribution is isolating a part of the refactor, so no
> functional change in the algorithms nor a complete refactor in place.
> I basically tried to decompose the refactor is unitary pull requests as
> small as possible.
> It just focused on the MLT parameters first to reduce the size of the
> original MoreLikeThis class ( and relegating the parameter modelling
> responsibility to a separate class)
>
> https://issues.apache.org/jira/browse/SOLR-12299
>
> The reason I used SOLR is because the refactor affects some Solr
> components using the MLT.
> But I agree with you, it can (should) be moved to LUCENE ( I tried via
> JIRA but I don't think I have the right permissions).
>
> Should I just create a new JIRA issue completely ( closing the SOLR one)
> or some JIRA admin can directly move the Jira to the LUCENE project ?
>
>
I moved it: https://issues.apache.org/jira/browse/LUCENE-8326

[jira] [Moved] (LUCENE-8326) More Like This Params Refactor


 [ 
https://issues.apache.org/jira/browse/LUCENE-8326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir moved SOLR-12299 to LUCENE-8326:


 Security: (was: Public)
  Component/s: (was: MoreLikeThis)
Lucene Fields: New,Patch Available
  Key: LUCENE-8326  (was: SOLR-12299)
  Project: Lucene - Core  (was: Solr)

> More Like This Params Refactor
> --
>
> Key: LUCENE-8326
> URL: https://issues.apache.org/jira/browse/LUCENE-8326
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Alessandro Benedetti
>Priority: Major
> Attachments: SOLR-12299.patch, SOLR-12299.patch, SOLR-12299.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> More Like This ca be refactored to improve the code readability, test 
> coverage and maintenance.
> Scope of this Jira issue is to start the More Like This refactor from the 
> More Like This Params.
> This Jira will not improve the current More Like This but just keep the same 
> functionality with a refactored code.
> Other Jira issues will follow improving the overall code readability, test 
> coverage and maintenance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8311) Leverage impacts for phrase queries


[ 
https://issues.apache.org/jira/browse/LUCENE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483764#comment-16483764
 ] 

Robert Muir commented on LUCENE-8311:
-

I wonder if its difficult to test with another similarity such as a DFR model? 
I'm only asking because I'm a little concerned that the bogus way we compute 
"phrase IDF" for BM25Similarity & ClassicSimilarity is getting in your way. 

All the other models use a more sane approach (scores like a disjunction 
internally). BM25 carried along the brain damage of ClassicSimilarity just 
because it was trying to minimize differences, but not for any particular good 
reason.

> Leverage impacts for phrase queries
> ---
>
> Key: LUCENE-8311
> URL: https://issues.apache.org/jira/browse/LUCENE-8311
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8311.patch
>
>
> Now that we expose raw impacts, we could leverage them for phrase queries.
> For instance for exact phrases, we could take the minimum term frequency for 
> each unique norm value in order to get upper bounds of the score for the 
> phrase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8325) smartcn analyzer can't handle SURROGATE char


[ 
https://issues.apache.org/jira/browse/LUCENE-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483731#comment-16483731
 ] 

Robert Muir commented on LUCENE-8325:
-

Patch looks good but i would change two things:

The constant SURROGATE_PAIR should just be renamed SURROGATE (since its a char 
and not an int).
The change to HHMMSegmenter,getCharTypes() to walk codepoints seems confusing, 
because it means the returned array would have some slots uninitialized (and 
CharType 0 = DELIMITER). I don't think this method needs to walk codepoints, it 
can just call getCharType() on every char like before.


> smartcn analyzer can't handle SURROGATE char
> 
>
> Key: LUCENE-8325
> URL: https://issues.apache.org/jira/browse/LUCENE-8325
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: chengpohi
>Priority: Minor
>  Labels: newbie, patch
> Attachments: handle-surrogate-char-for-smartcn.patch
>
>
> This issue is from [https://github.com/elastic/elasticsearch/issues/30739]
> smartcn analyzer can't handle SURROGATE char, Example:
>  
>  
> {code:java}
> Analyzer ca = new SmartChineseAnalyzer(); 
> String sentence = "\uD862\uDE0F"; // 訏 a surrogate char 
> TokenStream tokenStream = ca.tokenStream("", sentence); 
> CharTermAttribute charTermAttribute = 
> tokenStream.addAttribute(CharTermAttribute.class); 
> tokenStream.reset(); 
> while (tokenStream.incrementToken()) { 
> String term = charTermAttribute.toString(); 
> System.out.println(term); 
> } 
> {code}
>  
> In the above code snippet will output: 
>  
> {code:java}
> ? 
> ? 
> {code}
>  
>  and I have created a *PATCH* to try to fix this, please help review(since 
> *smartcn* only support *GBK* char, so it's only just handle it as a *single 
> char*).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483721#comment-16483721
 ] 

Robert Muir commented on LUCENE-7960:
-

looks good. thank you for making the updates.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch, 
> LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [More Like This] I would like to contribute

2018-05-22 Thread Robert Muir

For proposed api, behavior changes or refactoring to these classes, I
really recommend using LUCENE issues for those instead of SOLR ones.
Otherwise they can get missed.

As far as feedback, personally I tried to give it on LUCENE-7498 a year ago
but wasn't sure what happened as further comments dropped off. As I
mentioned there, I definitely think changing the algorithm to MoreLikeThis
is a big deal and really shouldn't be mixed in with refactorings or api
changes: it makes for too much to worry about at once. Just changing the
algorithm is a big deal: since this class supports blind relevance feedback
it means we can do some rough measurements with relevance tests before
doing that. As I have personally not seen the BM25 algorithm used for these
purposes anywhere, that's why I was concerned/curious about performance.

On Mon, May 21, 2018 at 7:23 AM, Alessandro Benedetti 
wrote:

> Hi gents,
> I have spent some time in the last year or so working on the Lucene More
> Like This ( and related Solr components ) .
>
> Initially I just wanted to improve it, adding BM25[1] but then I noted a
> lot of areas of possible improvements.
>
> I started then with a refactor of the functionality with these objectives
> in mind :
>
> 1) make the MLT more readable
> 2) make the MLT more modular and easy to extend
> 3) make the MLT more tested
>
> *This is just a start, I want to invest significant time with my company
> to work on the functionality and contribute it back.*
>
> I split my effort in small Pull Requests to make it easy a review and
> possible contribution.
>
> Unfortunately I didn't get much feedback so far.
> The More Like This functionality seems mostly abandoned.
> I tried also to contact one of the last committers that apparently got
> involved in the developments ( Mark Harwood mharw...@apache.org ), but I
> had no luck.
>
> This is the current Jira Issue that start with a first small refactor +
> tests :
>
> https://issues.apache.org/jira/browse/SOLR-12299
>
> I would love to contribute it and much more, but I need some feedback and
> review ( unfortunately I am not a committer yet).
>
> Let me know what can I do to speed up the process from my side.
>
> Regards
>
> [1] https://issues.apache.org/jira/browse/LUCENE-7498
>
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> www.sease.io
>

[jira] [Commented] (LUCENE-8312) Leverage impacts for SynonymQuery


[ 
https://issues.apache.org/jira/browse/LUCENE-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480733#comment-16480733
 ] 

Robert Muir commented on LUCENE-8312:
-

I didn't mean to imply it had to be solved on the issue, just revisited in the 
future (especially if we want to use this approach for e.g. PhraseQuery). But 
the factored out interface looks good!

> Leverage impacts for SynonymQuery
> -
>
> Key: LUCENE-8312
> URL: https://issues.apache.org/jira/browse/LUCENE-8312
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8312.patch, LUCENE-8312.patch
>
>
> Now that we expose raw impacts, we could leverage them for synonym queries.
> It would be a matter of summing up term frequencies for each unique norm 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [JENKINS-EA] Lucene-Solr-7.x-Windows (64bit/jdk-11-ea+5) - Build # 597 - Still Unstable!

2018-05-18 Thread Robert Muir

I am also +1 for throwing IllegalStateException in the WindowsFS
constructor if Constants.WINDOWS == true. This way the failure will be
100% clear next time.

On Fri, May 18, 2018 at 9:42 AM, Robert Muir <rcm...@gmail.com> wrote:
> WindowsFS does not work on window:
> https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L168-L169
>
> But these new tests instantiate WindowsFS explicitly without an assume
> for Constants.WINDOWS, creating the failures...
>
> On Fri, May 18, 2018 at 9:33 AM, Policeman Jenkins Server
> <jenk...@thetaphi.de> wrote:
>> Build: https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Windows/597/
>> Java: 64bit/jdk-11-ea+5 -XX:-UseCompressedOops -XX:+UseG1GC
>>
>> 16 tests failed.
>> FAILED:  
>> org.apache.lucene.store.TestHardLinkCopyDirectoryWrapper.testRenameWithHardLink
>>
>> Error Message:
>> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\source.txt
>>  -> 
>> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\target.txt
>>
>> Stack Trace:
>> java.nio.file.AccessDeniedException: 
>> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\source.txt
>>  -> 
>> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\target.txt
>> at 
>> __randomizedtesting.SeedInfo.seed([C04EE088CC1F0598:3E534FF9C8898063]:0)
>> at 
>> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89)
>> at 
>> java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
>> at 
>> java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:298)
>> at 
>> java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:288)
>> at 
>> org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147)
>> at 
>> org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147)
>> at 
>> org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147)
>> at org.apache.lucene.mockfile.WindowsFS.move(WindowsFS.java:133)
>> at java.base/java.nio.file.Files.move(Files.java:1413)
>> at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:303)
>> at 
>> org.apache.lucene.store.TestHardLinkCopyDirectoryWrapper.testRenameWithHardLink(TestHardLinkCopyDirectoryWrapper.java:114)
>> at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
>> Method)
>> at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at 
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>> at 
>> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1737)
>> at 
>> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:934)
>> at 
>> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:970)
>> at 
>> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:984)
>> at 
>> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
>> at 
>> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
>> at 
>> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
>> at 
>> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
>> at 
>> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
>> at 
>> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>> at 
>> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:

Re: [JENKINS-EA] Lucene-Solr-7.x-Windows (64bit/jdk-11-ea+5) - Build # 597 - Still Unstable!

2018-05-18 Thread Robert Muir

WindowsFS does not work on window:
https://github.com/apache/lucene-solr/blob/e2521b2a8baabdaf43b92192588f51e042d21e97/lucene/test-framework/src/java/org/apache/lucene/util/TestRuleTemporaryFilesCleanup.java#L168-L169

But these new tests instantiate WindowsFS explicitly without an assume
for Constants.WINDOWS, creating the failures...

On Fri, May 18, 2018 at 9:33 AM, Policeman Jenkins Server
 wrote:
> Build: https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Windows/597/
> Java: 64bit/jdk-11-ea+5 -XX:-UseCompressedOops -XX:+UseG1GC
>
> 16 tests failed.
> FAILED:  
> org.apache.lucene.store.TestHardLinkCopyDirectoryWrapper.testRenameWithHardLink
>
> Error Message:
> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\source.txt
>  -> 
> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\target.txt
>
> Stack Trace:
> java.nio.file.AccessDeniedException: 
> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\source.txt
>  -> 
> C:\Users\jenkins\workspace\Lucene-Solr-7.x-Windows\lucene\build\misc\test\J1\temp\lucene.store.TestHardLinkCopyDirectoryWrapper_C04EE088CC1F0598-001\tempDir-009\target.txt
> at 
> __randomizedtesting.SeedInfo.seed([C04EE088CC1F0598:3E534FF9C8898063]:0)
> at 
> java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:89)
> at 
> java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
> at java.base/sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:298)
> at 
> java.base/sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:288)
> at 
> org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147)
> at 
> org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147)
> at 
> org.apache.lucene.mockfile.FilterFileSystemProvider.move(FilterFileSystemProvider.java:147)
> at org.apache.lucene.mockfile.WindowsFS.move(WindowsFS.java:133)
> at java.base/java.nio.file.Files.move(Files.java:1413)
> at org.apache.lucene.store.FSDirectory.rename(FSDirectory.java:303)
> at 
> org.apache.lucene.store.TestHardLinkCopyDirectoryWrapper.testRenameWithHardLink(TestHardLinkCopyDirectoryWrapper.java:114)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.base/java.lang.reflect.Method.invoke(Method.java:564)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1737)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:934)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:970)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:984)
> at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
> at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
> at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
> at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
> at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:943)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:829)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:879)
> at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:890)
> at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
> at 
>

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480559#comment-16480559
 ] 

Robert Muir commented on LUCENE-8273:
-

is the TokenBuffer class in the patch actually used?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, 
> LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8321) Allow composite readers to have more than 2B documents


[ 
https://issues.apache.org/jira/browse/LUCENE-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480519#comment-16480519
 ] 

Robert Muir commented on LUCENE-8321:
-

Also I think the IW accounting needs to stay. Considering we can reasonably 
merge segments of ~ 1B docs then i think it makes sense to up the limit to 16B 
or so, but any higher gets into trappy territory. Strongly feel it can't be 
"unlimited" as long as a single segment is limited.

But I'm concerned this small increase is worth the complexity cost: both on 
users and on the code: it certainly won't make things any simpler. Also I can 
see people complaining about what seems like an "arbitrary" limit in the code, 
even though its no more arbitrary than 2B. But we could try it out and see what 
it looks like?

> Allow composite readers to have more than 2B documents
> --
>
> Key: LUCENE-8321
> URL: https://issues.apache.org/jira/browse/LUCENE-8321
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> I would like to start discussing removing the limit of ~2B documents that we 
> have for indices, while still enforcing it at the segment level for practical 
> reasons.
> Postings, stored fields, and all other codec APIs would keep working on 
> integers to represent doc ids. Only top-level doc ids and numbers of 
> documents would need to move to a long. I say "only" because we now mostly 
> consume indices per-segment, but there is still a number of places where we 
> identify documents by their top-level doc ID like {{IndexReader#document}}, 
> top-docs collectors, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8312) Leverage impacts for SynonymQuery


[ 
https://issues.apache.org/jira/browse/LUCENE-8312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480498#comment-16480498
 ] 

Robert Muir commented on LUCENE-8312:
-

I like the idea of factoring out the DISI, so that the particular search just 
wraps impactsenum. But its more than a little awkward that impactsenum extends 
postingsenum for such wrapping, because none of the postingsenum methods are 
actually needed: maybe this can be revisited?

> Leverage impacts for SynonymQuery
> -
>
> Key: LUCENE-8312
> URL: https://issues.apache.org/jira/browse/LUCENE-8312
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8312.patch
>
>
> Now that we expose raw impacts, we could leverage them for synonym queries.
> It would be a matter of summing up term frequencies for each unique norm 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8321) Allow composite readers to have more than 2B documents


[ 
https://issues.apache.org/jira/browse/LUCENE-8321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480479#comment-16480479
 ] 

Robert Muir commented on LUCENE-8321:
-

I have thought about this, I am personally against the idea because we won't be 
able to merge segments that large, hence creating a really big trap.

> Allow composite readers to have more than 2B documents
> --
>
> Key: LUCENE-8321
> URL: https://issues.apache.org/jira/browse/LUCENE-8321
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> I would like to start discussing removing the limit of ~2B documents that we 
> have for indices, while still enforcing it at the segment level for practical 
> reasons.
> Postings, stored fields, and all other codec APIs would keep working on 
> integers to represent doc ids. Only top-level doc ids and numbers of 
> documents would need to move to a long. I say "only" because we now mostly 
> consume indices per-segment, but there is still a number of places where we 
> identify documents by their top-level doc ID like {{IndexReader#document}}, 
> top-docs collectors, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-05-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478208#comment-16478208
 ] 

Robert Muir commented on LUCENE-8273:
-

{quote}
In {{TestConditionalFilter}}, converted most {{CannedTokenStream}}'s to 
{{MockTokenizer}}'s, which causes error {{IllegalStateException: end() called 
in wrong state=END!}} - I'm guessing you already know about this and are 
working on it
{quote}

Good approach, TestRandomChains is rather inefficient (basically an integration 
test) and its best to always make the fails reproduce with simpler unit tests.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch, 
> LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8313) SimScorer simplifications

2018-05-16 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477484#comment-16477484
 ] 

Robert Muir commented on LUCENE-8313:
-

+1

> SimScorer simplifications
> -
>
> Key: LUCENE-8313
> URL: https://issues.apache.org/jira/browse/LUCENE-8313
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8313.patch
>
>
> This is a follow-up to recent changes that already started simplifying 
> SimScorer, we can now:
>  - remove the notion of field from SimScorer (might help for LUCENE-8216)
>  - remove logic from LeafSimScorer to compute an upper bound of scores for 
> entires segments, this should be done on top of impacts now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-15 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476817#comment-16476817
 ] 

Robert Muir commented on LUCENE-7960:
-

{quote}
*) Even though I'm watching this issue, I'm not getting mails from Jira. Is 
this intentional for non-commiters?
{quote}

As far as I know, JIRA doesn't consider any roles. This is what the 
configuration says:

|Issue Commented| * All Watchers
 * Current Assignee
 * Reporter
 * Single Email Address (dev at lucene.apache.org)|

I added you to Contributors group so you can assign issues: maybe it helps. But 
it could be something SMTP-related or some other problem. Did you get any 
notifications when Shawn mentioned you on this issue?

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472733#comment-16472733
 ] 

Robert Muir commented on LUCENE-8273:
-

maybe ConditionalTokenFilter's toString() can be improved which will help in 
debugging chains like that.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8308) migrate KeywordRepeatFilter to conditional tokenstreams


[ 
https://issues.apache.org/jira/browse/LUCENE-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472261#comment-16472261
 ] 

Robert Muir commented on LUCENE-8308:
-

also maybe terminally deprecate KeywordRepeatFilter, i mean I doubt we can 
preserve compat here 100%: that is what major versions are for. If there was an 
"else" condition it would make it very easy on the user to migrate.

> migrate KeywordRepeatFilter to conditional tokenstreams
> ---
>
> Key: LUCENE-8308
> URL: https://issues.apache.org/jira/browse/LUCENE-8308
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
>
> we should deprecate KeywordAttribute in favor of LUCENE-8273 which gives the 
> analysis chain a real "if".
> But this isn't straightforward unless we address the KeywordRepeatFilter 
> which sends the token "both ways" down the branch condition. Maybe it can be 
> handled as a subclass.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472235#comment-16472235
 ] 

Robert Muir commented on LUCENE-8273:
-

I opened LUCENE-8308, I think we have to work our way through that one first. 
Thanks for the work here, great improvement.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8308) migrate KeywordRepeatFilter to conditional tokenstreams


[ 
https://issues.apache.org/jira/browse/LUCENE-8308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472229#comment-16472229
 ] 

Robert Muir commented on LUCENE-8308:
-

The other idea in mind being that we extend LUCENE-8273 to support "else" 
essentially. Then this repeater case would be more transparent/obvious, it 
fires for both the "if" and the "else".

> migrate KeywordRepeatFilter to conditional tokenstreams
> ---
>
> Key: LUCENE-8308
> URL: https://issues.apache.org/jira/browse/LUCENE-8308
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/analysis
>Reporter: Robert Muir
>Priority: Major
>
> we should deprecate KeywordAttribute in favor of LUCENE-8273 which gives the 
> analysis chain a real "if".
> But this isn't straightforward unless we address the KeywordRepeatFilter 
> which sends the token "both ways" down the branch condition. Maybe it can be 
> handled as a subclass.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (LUCENE-8308) migrate KeywordRepeatFilter to conditional tokenstreams

Robert Muir created LUCENE-8308:
---

 Summary: migrate KeywordRepeatFilter to conditional tokenstreams
 Key: LUCENE-8308
 URL: https://issues.apache.org/jira/browse/LUCENE-8308
 Project: Lucene - Core
  Issue Type: Task
  Components: modules/analysis
Reporter: Robert Muir


we should deprecate KeywordAttribute in favor of LUCENE-8273 which gives the 
analysis chain a real "if".

But this isn't straightforward unless we address the KeywordRepeatFilter which 
sends the token "both ways" down the branch condition. Maybe it can be handled 
as a subclass.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472184#comment-16472184
 ] 

Robert Muir commented on LUCENE-8273:
-

Thanks for debugging the failure: TestRandomChains is cruel but it works. 
Should we open a followup issue to clean up the analyzers and stuff? This is 
something i can help with.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8304) Add TermFrequencyQuery

2018-05-10 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470677#comment-16470677
 ] 

Robert Muir commented on LUCENE-8304:
-

It is inefficient: because it takes as input something like "the" AND freq>50. 
Currently this is done like a very-slow conjunction (the index is not used).

Maybe it would help to explain the use-case better for this query.

> Add TermFrequencyQuery
> --
>
> Key: LUCENE-8304
> URL: https://issues.apache.org/jira/browse/LUCENE-8304
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8304.patch
>
>
> This has come up a few times when writing query parsers.  It would be useful 
> to have a query that returned documents that match a term with a particular 
> frequency - eg, all docs where "patent" appears at least five times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8304) Add TermFrequencyQuery

2018-05-10 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470226#comment-16470226
 ] 

Robert Muir commented on LUCENE-8304:
-

Right, I don't think it makes sense to backport this query to 7.x because it 
has no hope of being efficient there.

> Add TermFrequencyQuery
> --
>
> Key: LUCENE-8304
> URL: https://issues.apache.org/jira/browse/LUCENE-8304
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8304.patch
>
>
> This has come up a few times when writing query parsers.  It would be useful 
> to have a query that returned documents that match a term with a particular 
> frequency - eg, all docs where "patent" appears at least five times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8304) Add TermFrequencyQuery

2018-05-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469107#comment-16469107
 ] 

Robert Muir commented on LUCENE-8304:
-

Yeah, I mean i don't think it has to block the issue (we could put the query in 
sandbox as-is and then figure this out), but it would be pretty cool to try to 
make use of the new ImpactsEnum at least to speed up advance()?

> Add TermFrequencyQuery
> --
>
> Key: LUCENE-8304
> URL: https://issues.apache.org/jira/browse/LUCENE-8304
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8304.patch
>
>
> This has come up a few times when writing query parsers.  It would be useful 
> to have a query that returned documents that match a term with a particular 
> frequency - eg, all docs where "patent" appears at least five times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8304) Add TermFrequencyQuery

2018-05-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468945#comment-16468945
 ] 

Robert Muir commented on LUCENE-8304:
-

Can we implement this where it works more efficiently based on term impacts 
instead?

> Add TermFrequencyQuery
> --
>
> Key: LUCENE-8304
> URL: https://issues.apache.org/jira/browse/LUCENE-8304
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8304.patch
>
>
> This has come up a few times when writing query parsers.  It would be useful 
> to have a query that returned documents that match a term with a particular 
> frequency - eg, all docs where "patent" appears at least five times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8303) Make LiveDocsFormat only responsible for (de)serialization of live docs

2018-05-09 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468822#comment-16468822
 ] 

Robert Muir commented on LUCENE-8303:
-

nice cleanup! Its also good that it doesn't cast the incoming bits to a 
fixedbitset when writing, that wasn't right at all...

> Make LiveDocsFormat only responsible for (de)serialization of live docs
> ---
>
> Key: LUCENE-8303
> URL: https://issues.apache.org/jira/browse/LUCENE-8303
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8303.patch
>
>
> We could simplify live docs by only making the format responsible from 
> reading/writing a Bits instance that represents live docs while today the 
> format is also involved to delete documents since it needs to be able to 
> provide mutable bits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468292#comment-16468292
 ] 

Robert Muir commented on LUCENE-7960:
-

{quote}
I made the min/max parameters required on the factory because the constructor 
without any size parameters is deprecated. Is this something you don't like at 
all, or something you would only want to see in master?
{quote}

what does it mean "not making that change in the backport to 7x" ?
As i suggested above: consider making the patch against master fully backwards 
compatible. We can review it, then it can be committed, merged cleanly and 
safely back to 7.x. After that, remove the deprecations in master in a separate 
dedicated commit.

It seems like more work, but I think its less work than trying to do a 
shortcut, because you can have confidence you don't break stuff. "Making 
changes during backports" seems like trouble, and having a confusing patch 
makes the code review hard. The current one is confusing because it isn't 
really appropriate for either master (it has deprecations) nor 7x (it breaks 
backwards)


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468161#comment-16468161
 ] 

Robert Muir commented on LUCENE-7960:
-

Also for the full ctor that allows a range, i really still think it needs some 
wording, a warning of sorts, that a big range is really the same 
(space/time-wise) as indexing the content N different ways. It may be also good 
to include the fact that if you pass {{true}} for preserveOriginal, its like 
indexing the content yet another time.

The ctor that just takes a fixed "n" for the n-gram doesn't need such warnings, 
its pretty safe.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468156#comment-16468156
 ] 

Robert Muir commented on LUCENE-7960:
-

The patch has a little confusion about back compat (e.g. breaks back compat 
with the factories by requiring parameters that were optional before, but 
leaves back compat in the tokenfilters), so I'm not sure if its geared at the 
master branch or not. Sometimes its easiest to make the patch with all the 
back-compat, commit it to master and merge it back, then make a separate commit 
to just master to remove the cruft, maybe its good in this case.

There are some cosmetic style changes such as moving attribute initialization 
into the ctor instead of inline, that is different than the style of all our 
other tokenfilters. It makes it hard to review the logic changes (have not 
looked at this, just the apis and docs).

As far as docs, I think there are easy wins. Lets take EdgeNGramTokenFilter 
just as an example.

For the ctor with all the parameters, it doesn't need to have documentation on 
what the other ctors do: they can have their own. It should only document the 
behavior and parameters like it does, so we can just remove its last line about 
that.

For the other ctors which are shortcuts/sugar, we can add a line such as this:
{code}
   * 
   * Behaves the same as {@link #EdgeNGramTokenFilter(TokenStream, int, int, 
boolean) 
   * EdgeNGramTokenFilter(input, minGram, maxGram, 
false)}
{code}

This helps make it clear what the shortcut/sugar is really doing with a 
clickable link, and it also helps the deprecated case, if someone has to 
transition their code.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467968#comment-16467968
 ] 

Robert Muir commented on LUCENE-7960:
-

then it would behave like you expect an n-gram filter to behave? min=max=4 or 
whatever. The two ints is really crazy/expert and doesn't match anyone's 
expectations about what n-grams are. Its also mega trappy as i mentioned above, 
it needs javadoc warnings.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467894#comment-16467894
 ] 

Robert Muir commented on LUCENE-7960:
-

Yes, I think we should deprecate. It helps ppl upgrade and shouldn't be too bad 
in this case.

If we currently have 1-arg (TokenStream) and 3-arg (TokenStream, int, int), and 
we want to end up at 2-arg (TokenStream, int) and 4-arg (TokenStream, int, int, 
boolean) then 7.x can temporarily have 4 constructors: the existing two of 
which are deprecated and forward to the new ones. Their javadoc can even 
explain what the forwarding is doing. master would just have the two new ones 
with no cruft.


> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467407#comment-16467407
 ] 

Robert Muir commented on LUCENE-8273:
-

For the new TermExclusionFilterFactory:

{code}
public static final String EXCLUDED_TOKENS = "excludeFile";
{code}

KeywordMarkerFilterFactory currently uses a different parameter name: 
"protected". So does WordDelimiterFilterFactory (which I think was the one that 
inspired this JIRA issue). Maybe we want it to be consistent, to support 
migrating away from that stuff?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range

2018-05-06 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465221#comment-16465221
 ] 

Robert Muir commented on LUCENE-7960:
-

There is no need to have only one constructor: two many parameters for the 
simple use case.

I already explained my preference as to what they should be:
* NgramWhateverFilter(TokenStream, int)
* NgramWhateverFilter(TokenStream, int, int, boolean)

So remove the no-arg constructor, which means there is no need for any default 
min/max.
It is also important that the factory match this. Whatever parameters are 
mandatory for the tokenfilter also needs to be mandatory in the factory, too. I 
will insist on it.

> NGram filters -- preserve the original token when it is outside the min/max 
> size range
> --
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms

2018-05-06 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465152#comment-16465152
 ] 

Robert Muir commented on LUCENE-7960:
-

Again I want to re-emphasize that anything more complex than a single boolean 
"preserveOriginal" is too much. If someone wants to remove too-short or 
too-long terms they can use LengthFilter for that. There is no need to have 
such complex stuff i the ngram filters itself.

Furthermore I still think we need to address the traps I mentioned about about 
these filters emitting too many tokens already before we then go and add an 
option to make them produce even more...

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11490) Add @since javadoc tags to the interesting Solr/Lucene classes

2018-05-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462592#comment-16462592
 ] 

Robert Muir commented on SOLR-11490:


I am against pre-3.1 or any other invalid versions in since tags. I'm gonna 
quote myself just to re-iterate what I already said.

{quote}
Just like how you marked HMMChineseTokenizerFactory as 4.8.0, that's fine. But 
lineage-wise (look at svn for that) you'd see its been around since 2.9, it was 
just named something different (SmartChinese).
{quote}

> Add @since javadoc tags to the interesting Solr/Lucene classes
> --
>
> Key: SOLR-11490
> URL: https://issues.apache.org/jira/browse/SOLR-11490
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alexandre Rafalovitch
>Assignee: Alexandre Rafalovitch
>Priority: Minor
>
> As per the discussion on the dev list, it may be useful to add Javadoc since 
> tags to significant (or even all) Java files.
> For user-facing files (such as analyzers, URPs, stream evaluators, etc) it 
> would be useful when trying to identifying whether a particular class only 
> comes later than user's particular version.
> For other classes, it may be useful for historical reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11490) Add @since javadoc tags to the interesting Solr/Lucene classes

2018-05-03 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462549#comment-16462549
 ] 

Robert Muir commented on SOLR-11490:


{quote}
We had agreed that pre-3.1 classes will get no since tag.
{quote}

Where was this? I see consensus above to simply label these as "3.1".
 

 

> Add @since javadoc tags to the interesting Solr/Lucene classes
> --
>
> Key: SOLR-11490
> URL: https://issues.apache.org/jira/browse/SOLR-11490
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Alexandre Rafalovitch
>Assignee: Alexandre Rafalovitch
>Priority: Minor
>
> As per the discussion on the dev list, it may be useful to add Javadoc since 
> tags to significant (or even all) Java files.
> For user-facing files (such as analyzers, URPs, stream evaluators, etc) it 
> would be useful when trying to identifying whether a particular class only 
> comes later than user's particular version.
> For other classes, it may be useful for historical reasons.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7964) Remove Solr fieldType XML example from Lucene AnalysisFactories JavaDoc


[ 
https://issues.apache.org/jira/browse/LUCENE-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461809#comment-16461809
 ] 

Robert Muir commented on LUCENE-7964:
-

I think this issue shouldn't try to explode the scope into documenting all 
possible parameters, autogenerating them, making them type-safe, or any other 
stuff.
This stuff is really just blocking all progress when it needs to be separate 
stuff, because we don't have that today.

What we have today is nonfunctional javadoc for java users, which is a real 
problem, e.g. 
https://mail-archives.apache.org/mod_mbox/lucene-java-user/201805.mbox/%3CCANdt40C_q8GX2_E9b%2B_qORZKojWANPfT%3DnzMuRg02WB3mbZe1w%40mail.gmail.com%3E
I think instead the lucene javadocs here really need to use examples that, 
well, work with lucene: e.g. reformulated with CustomAnalyzer.

> Remove Solr fieldType XML example from Lucene AnalysisFactories JavaDoc
> ---
>
> Key: LUCENE-7964
> URL: https://issues.apache.org/jira/browse/LUCENE-7964
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/javadocs
>Reporter: Jan Høydahl
>Priority: Trivial
> Fix For: 7.4, master (8.0)
>
>
> As proposed and discussed in this dev-list thread:
> https://lists.apache.org/thread.html/9add7e4a3ad28b307dc51532a609b423982922d734064f26f8104744@%3Cdev.lucene.apache.org%3E
> [~rcmuir] [~dsmiley] [~thetaphi]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461782#comment-16461782
 ] 

Robert Muir commented on LUCENE-7960:
-

Sorry, varargs are completely uncalled for here. Arguing for 250 booleans 
instead of just 1 boolean isn't going to work as a "negotiating" strategy to 
get back to 2. Please take my recommendations seriously.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461683#comment-16461683
 ] 

Robert Muir commented on LUCENE-7960:
-

my biggest concern is that these filters would then have two ctors:

* NGramTokenFilter(TokenStream)
* NGramTokenFilter(TokenStream, int, int, boolean, boolean)

The no-arg one starts looking more attractive to users at this point, and its 
mega-trappy (n=1,2)!!! That's the ctor that should be deprecated :)

In general I'll be honest, I don't like how trappy the apis are with these 
filters/tokenizers because of defaults like that. I also think its trappy they 
take a min and a max at all, because that's really creating (max-min) indexed 
fields all unioned into one. There aren't even any warnings about this. 

I haven't reviewed what the booleans of the patch does, but I am concerned that 
the use case may just be "keep original" which could be one boolean, or perhaps 
done in a different way entirely (e.g. KeywordRepeatFilter or perhaps something 
like LUCENE-8273). So if its acceptable to collapse it into one boolean that 
does that, I think that would be easier.

I feel like any defaults that our apis lead to (and when you have multiple 
ctors, then thats a default) should be something that will perform and scale 
well and work for the general case. For example n=4 has been shown to work well 
in many relevance experiments. At least we should make it easy for you to 
explicitly ask for something like that without passing many parameters.


> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms


[ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461514#comment-16461514
 ] 

Robert Muir commented on LUCENE-7960:
-

The patch doesn't add up to me. The description of this issue claims that the 
default behavior wouldn't be changed, but then the patch does just the opposite 
and makes the new parameters mandatory. 5 arguments is too many here, that's 
not usable IMO.

> NGram filters -- add option to keep short terms
> ---
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Shawn Heisey
>Priority: Major
> Attachments: LUCENE-7960.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458569#comment-16458569
 ] 

Robert Muir commented on LUCENE-8273:
-

sounds good. yeah i know the resource stuff/keywork marking is tricky, i looked 
at what the existing factory is doing and its pretty crazy. 

it seems you need to make the ConditionalTokenFilterFactory implement the 
resourceloaderaware stuff always, because its separately a bug that the current 
patch will "hide" the resourceloader from anything inside the if? So I think it 
should implement the interface and pass the loader in its inform() method to 
stuff inside. Maybe this leads towards a solution to what you need for the 
conditional part, too.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458507#comment-16458507
 ] 

Robert Muir commented on LUCENE-8273:
-

I'm also curious about common use cases where the condition is just matching a 
list of words, basically what the KeywordMarker factory provides today. Does 
the user have access to stuff like resource loaders to read from files / is it 
intuitive so they won't be reading the list of words in on every token or other 
mistakes? If we can make this simple, I think we can deprecate KeywordMarker 
and many other exception-list-type mechanisms hardcoded in all the filters, 
which would be a really nice cleanup. 

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter

2018-04-30 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458501#comment-16458501
 ] 

Robert Muir commented on LUCENE-8273:
-

I like the custom analyzer integration. Can we rename {{ifMatches}} just to 
{{if}}? This makes it more natural for the user to pair with the necessary 
{{endif}}, doesn't imply any regex matching, etc. I think its useful to mention 
the necessary endif in the javadocs for both these methods, if users forget to 
call it they should get a compile error from build(), but it may not be 
obvious. I would also add to the {{ifTerm}} docs that it is just sugar, with 
snippet of how to implement it with if + CharTermAttribute. This gives an 
example in case the user needs to work on some other attribute.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, 
> LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8142) Should codecs expose raw impacts?

2018-04-27 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456266#comment-16456266
 ] 

Robert Muir commented on LUCENE-8142:
-

+1

> Should codecs expose raw impacts?
> -
>
> Key: LUCENE-8142
> URL: https://issues.apache.org/jira/browse/LUCENE-8142
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8142.patch
>
>
> Follow-up of LUCENE-4198. Currently, call-sites of TermsEnum.impacts provide 
> a SimScorer so that the maximum score for the block can be computed. Should 
> ImpactsEnum instead return the (freq,norm) pairs and let callers deal with 
> max score computation?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8279) Improve CheckIndex on norms

2018-04-27 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456261#comment-16456261
 ] 

Robert Muir commented on LUCENE-8279:
-

+1

> Improve CheckIndex on norms
> ---
>
> Key: LUCENE-8279
> URL: https://issues.apache.org/jira/browse/LUCENE-8279
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8279.patch, LUCENE-8279.patch
>
>
> We should improve CheckIndex to make sure that terms and norms agree on which 
> documents have a value on an indexed field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16455543#comment-16455543
 ] 

Robert Muir commented on LUCENE-8273:
-

just imaging scenarios more, its probably useful if the thing can avoid 
corrupting graphs. By that i mean: conceptually the user has to understand that 
the filtering applies the condition based on the first token and that the 
filter gets whatever it pulls (based on its wanted context), and those are 
provided "graph-aligned" or something. I think its just inherent in what you 
are trying to do and not specific to the implementation: it needs to have some 
restrictions to avoid trouble? So maybe this filter should also consider 
positionLength...

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8279) Improve CheckIndex on norms


[ 
https://issues.apache.org/jira/browse/LUCENE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454244#comment-16454244
 ] 

Robert Muir commented on LUCENE-8279:
-

or even better maybe just move this check into the postings check so that it 
happens for each field without creating problematic memory usage. postings 
check already cross checks some stuff with fieldinfos...

> Improve CheckIndex on norms
> ---
>
> Key: LUCENE-8279
> URL: https://issues.apache.org/jira/browse/LUCENE-8279
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8279.patch
>
>
> We should improve CheckIndex to make sure that terms and norms agree on which 
> documents have a value on an indexed field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8279) Improve CheckIndex on norms


[ 
https://issues.apache.org/jira/browse/LUCENE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454241#comment-16454241
 ] 

Robert Muir commented on LUCENE-8279:
-

I am thinking of this one: 
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L1328

Maybe it could be moved to the TermIndexStatus or whatever so that the norms 
check could be moved to after the postings check and re-use it.

> Improve CheckIndex on norms
> ---
>
> Key: LUCENE-8279
> URL: https://issues.apache.org/jira/browse/LUCENE-8279
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8279.patch
>
>
> We should improve CheckIndex to make sure that terms and norms agree on which 
> documents have a value on an indexed field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8279) Improve CheckIndex on norms


[ 
https://issues.apache.org/jira/browse/LUCENE-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454222#comment-16454222
 ] 

Robert Muir commented on LUCENE-8279:
-

the check is implemented as a "slow" check, but don't we already construct the 
same bitset already to verify some postings list statistics such as docCount ?

> Improve CheckIndex on norms
> ---
>
> Key: LUCENE-8279
> URL: https://issues.apache.org/jira/browse/LUCENE-8279
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-8279.patch
>
>
> We should improve CheckIndex to make sure that terms and norms agree on which 
> documents have a value on an indexed field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Friendly reminder: please run precommit

2018-04-26 Thread Robert Muir

everyone here is collaborating: it causes confusion and takes up other
people's time when you break the build. I would ask to just run
precommit before committing. you don't have to sit and watch it, you
can go work on something else while it runs.

On Thu, Apr 26, 2018 at 9:23 AM, Karl Wright <daddy...@gmail.com> wrote:
> :-)
>
> 25 minutes is an eternity these days, Robert.  This is especially true when
> others are collaborating with what you are doing, as was the case here.  The
> other approach would be to create a branch, but I've been avoiding that on
> git.
>
> "ant documentation-lint" is what I'm looking for, thanks.
>
> Karl
>
>
> On Thu, Apr 26, 2018 at 8:21 AM, Robert Muir <rcm...@gmail.com> wrote:
>>
>> I don't understand the turnaround issue, why do the commits need to be
>> rushed in?
>> There is patch validation recently hooked in to avoid keeping your
>> computer busy for 25 minutes.
>> If you are not changing third party dependencies or anything "heavy"
>> like that you should at least run "ant documentation-lint" from
>> lucene/
>>
>>
>> On Thu, Apr 26, 2018 at 8:02 AM, Karl Wright <daddy...@gmail.com> wrote:
>> > How long does precommit take you to run?  For me, it's a good 25
>> > minutes.
>> > That really impacts turnaround, which is why I'd love a precommit that
>> > looked only at certain things in the local package I'm dealing with.
>> >
>> > Karl
>> >
>> > On Thu, Apr 26, 2018 at 6:14 AM, Simon Willnauer
>> > <simon.willna...@gmail.com>
>> > wrote:
>> >>
>> >> Hey folks,
>> >>
>> >> I had to fix several glitches lately that are caught by running
>> >> precommit. It's a simple step please take the time running `ant clean
>> >> precommit` on top-level.
>> >>
>> >> Thanks,
>> >>
>> >> Simon
>> >>
>> >> -
>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Friendly reminder: please run precommit

2018-04-26 Thread Robert Muir

I don't understand the turnaround issue, why do the commits need to be
rushed in?
There is patch validation recently hooked in to avoid keeping your
computer busy for 25 minutes.
If you are not changing third party dependencies or anything "heavy"
like that you should at least run "ant documentation-lint" from
lucene/


On Thu, Apr 26, 2018 at 8:02 AM, Karl Wright  wrote:
> How long does precommit take you to run?  For me, it's a good 25 minutes.
> That really impacts turnaround, which is why I'd love a precommit that
> looked only at certain things in the local package I'm dealing with.
>
> Karl
>
> On Thu, Apr 26, 2018 at 6:14 AM, Simon Willnauer 
> wrote:
>>
>> Hey folks,
>>
>> I had to fix several glitches lately that are caught by running
>> precommit. It's a simple step please take the time running `ant clean
>> precommit` on top-level.
>>
>> Thanks,
>>
>> Simon
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453852#comment-16453852
 ] 

Robert Muir commented on LUCENE-8273:
-

I feel like the OneTimeWrapper you have is close, it just needs to set the 
boolean on its "parent" after input.incrementToken() ? At that point we know 
the subfilter is "done"

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453818#comment-16453818
 ] 

Robert Muir commented on LUCENE-8273:
-

i mean in the worst case, you could make it "correct" by using something like 
the StackWalker api right? Then we figure out how to make it more efficient.

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter


[ 
https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453814#comment-16453814
 ] 

Robert Muir commented on LUCENE-8273:
-

ok i get it, so basically the filter switches between two states right now with 
that boolean it has. So its basically hardcoding that the passed filter will 
only call incrementToken once with that. 

maybe it could do this differently, like insert an "end-if" filter around the 
one passed in by the user, which would signal completion?

> Add a ConditionalTokenFilter
> 
>
> Key: LUCENE-8273
> URL: https://issues.apache.org/jira/browse/LUCENE-8273
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Alan Woodward
>Priority: Major
> Attachments: LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter 
> in such a way that it could optionally be bypassed based on the current state 
> of the TokenStream.  This could be used to, for example, only apply 
> WordDelimiterFilter to terms that contain hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter