[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089201#comment-17089201
 ] 

ASF subversion and git services commented on SOLR-14365:


Commit 1e1ef7f074503771172218a28282494a9faa97e5 in lucene-solr's branch 
refs/heads/branch_8x from Shalin Shekhar Mangar
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1e1ef7f ]

SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with 
size equals to number of unique values (#1395)

(cherry picked from commit adbd714b37d794e9aa7615e61c431e42162c1d3c)
(cherry picked from commit 71d335ff688982cef10a648c914623c81ced)


> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-21 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089177#comment-17089177
 ] 

Shalin Shekhar Mangar commented on SOLR-14365:
--

I think this is ready to be cherry picked to branch_8x. I'll do that today 
unless there are any objections.

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080492#comment-17080492
 ] 

ASF subversion and git services commented on SOLR-14365:


Commit 71d335ff688982cef10a648c914623c81ced in lucene-solr's branch 
refs/heads/master from Cao Manh Dat
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=71d335f ]

SOLR-14365: Automatically grow size of groupHeadValues


> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-10 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080401#comment-17080401
 ] 

Shalin Shekhar Mangar commented on SOLR-14365:
--

I just saw this test failure on master which seems related and is reproducible:

{code}
[junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestRandomCollapseQParserPlugin 
-Dtests.method=testRandomCollpaseWithSort -Dtests.seed=20C0F4D7CBA81876 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=lv 
-Dtests.timezone=America/St_Johns -Dtests.asserts=true 
-Dtests.file.encoding=ANSI_X3.4-1968
   [junit4] FAILURE 7.30s J4 | 
TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort <<<
   [junit4]> Throwable #1: java.lang.AssertionError: collapseKey too big -- 
need to grow array?
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([20C0F4D7CBA81876:257D871EE0002B85]:0)
   [junit4]>at 
org.apache.solr.search.CollapsingQParserPlugin$SortFieldsCompare.setGroupValues(CollapsingQParserPlugin.java:2702)
   [junit4]>at 
org.apache.solr.search.CollapsingQParserPlugin$IntSortSpecStrategy.collapse(CollapsingQParserPlugin.java:2544)
   [junit4]>at 
org.apache.solr.search.CollapsingQParserPlugin$IntFieldValueCollector.collect(CollapsingQParserPlugin.java:1223)
   [junit4]>at 
org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:254)
   [junit4]>at 
org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:205)
   [junit4]>at 
org.apache.lucene.search.BulkScorer.score(BulkScorer.java:39)
   [junit4]>at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:739)
   [junit4]>at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:526)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:202)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1651)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1469)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:584)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1487)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:399)
   [junit4]>at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328)
   [junit4]>at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:209)
   [junit4]>at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:2565)
   [junit4]>at 
org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:227)
   [junit4]>at 
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:207)
   [junit4]>at 
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1003)
   [junit4]>at 
org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:1018)
   [junit4]>at 
org.apache.solr.search.TestRandomCollapseQParserPlugin.testRandomCollpaseWithSort(TestRandomCollapseQParserPlugin.java:158)
   [junit4]>at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{code}

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is 

[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080339#comment-17080339
 ] 

ASF subversion and git services commented on SOLR-14365:


Commit adbd714b37d794e9aa7615e61c431e42162c1d3c in lucene-solr's branch 
refs/heads/master from Cao Manh Dat
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=adbd714 ]

SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with 
size equals to number of unique values (WIP) (#1395)



> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-10 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080323#comment-17080323
 ] 

ASF subversion and git services commented on SOLR-14365:


Commit 0e148071042056587bd934796bce3c27da81f922 in lucene-solr's branch 
refs/heads/jira/SOLR-14365 from Cao Manh Dat
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0e14807 ]

Merge branch 'master' into jira/SOLR-14365

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077084#comment-17077084
 ] 

ASF subversion and git services commented on SOLR-14365:


Commit a1be3dd08f506990b81e7aae08aa6a4ec8ec869f in lucene-solr's branch 
refs/heads/jira/SOLR-14365 from Cao Manh Dat
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=a1be3dd ]

SOLR-14365: Switch to dynamic map


> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073479#comment-17073479
 ] 

ASF subversion and git services commented on SOLR-14365:


Commit 0d82a9f5cdac67c9de92c979e016c1c6b1f8dcf4 in lucene-solr's branch 
refs/heads/jira/SOLR-14365 from Cao Manh Dat
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=0d82a9f ]

SOLR-14365: CollapsingQParser - Avoiding always allocate int[] and float[] with 
size equals to number of unique values (WIP)


> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-04-01 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073153#comment-17073153
 ] 

David Smiley commented on SOLR-14365:
-

Can you please do a PR instead as it's more conducive to a code review?  If it 
were a trivial patch, I wouldn't have suggested it.

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-03-31 Thread Cao Manh Dat (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071619#comment-17071619
 ] 

Cao Manh Dat commented on SOLR-14365:
-

[~jbernste] [~shalin] please take a look at the patch.

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-03-31 Thread Cao Manh Dat (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071610#comment-17071610
 ] 

Cao Manh Dat commented on SOLR-14365:
-

Attached a WIP patch that includes

* seperate out the logic of creating {{map}}s (backed by {{array}} or 
{{hashMap}}) and how collapse are using it. So we just need to create 
{{mapFactory}} correspond to a method.
* WIP benchmark test, to ensure that we did achieve something in case of 
collapsing on partial of the result.

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-14365.patch
>
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-03-28 Thread Shalin Shekhar Mangar (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069285#comment-17069285
 ] 

Shalin Shekhar Mangar commented on SOLR-14365:
--

I think we should add another method and make it configurable.

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-03-27 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069066#comment-17069066
 ] 

David Smiley commented on SOLR-14365:
-

FWIW I suggest the name "auto" more so than "smart".

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14365) CollapsingQParser - Avoiding always allocate int[] and float[] with size equals to number of unique values

2020-03-26 Thread Cao Manh Dat (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068280#comment-17068280
 ] 

Cao Manh Dat commented on SOLR-14365:
-

Hi [~jbernste] [~shalin] should we adding another method or just implicitly 
change the default implementation?

> CollapsingQParser - Avoiding always allocate int[] and float[] with size 
> equals to number of unique values
> --
>
> Key: SOLR-14365
> URL: https://issues.apache.org/jira/browse/SOLR-14365
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 8.4.1
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
>
> Since Collapsing is a PostFilter, documents reach Collapsing must match with 
> all filters and queries, so the number of documents Collapsing need to 
> collect/compute score is a small fraction of the total number documents in 
> the index. So why do we need to always consume the memory (for int[] and 
> float[] array) for all unique values of the collapsed field? If the number of 
> unique values of the collapsed field found in the documents that match 
> queries and filters is 300 then we only need int[] and float[] array with 
> size of 300 and not 1.2 million in size. However, we don't know which value 
> of the collapsed field will show up in the results so we cannot use a smaller 
> array.
> The easy fix for this problem is using as much as we need by using IntIntMap 
> and IntFloatMap that hold primitives and are much more space efficient than 
> the Java HashMap. These maps can be slower (10x or 20x) than plain int[] and 
> float[] if matched documents is large (almost all documents matched queries 
> and other filters). But our belief is that does not happen that frequently 
> (how frequently do we run collapsing on the entire index?).
> For this issue I propose adding 2 methods for collapsing which is
> * array : which is current implementation
> * hash : which is new approach and will be default method
> later we can add another method {{smart}} which is automatically pick method 
> based on comparision between {{number of docs matched queries and filters}} 
> and {{number of unique values of the field}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org