[jira] [Comment Edited] (SOLR-10123) Analytics Component 2.0
[ https://issues.apache.org/jira/browse/SOLR-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068518#comment-16068518 ] Houston Putman edited comment on SOLR-10123 at 6/29/17 6:18 PM: Okay, so I have updated the cloud and non-cloud schemas to add the randomized numeric fields. However the randomized doc-values cannot be used since docValues are required for almost all Analytics Component functionality. Almost all tests pass now, however there is a difference between SortedSetDocValues (TrieField) and SortedNumericDocValues (PointField) that might make this impossible. SortedSetDocValues only store the unique set of values for a multi-valued field, however SortedNumericDocValues can store the same value multiple times for a field on the same document. Therefore analytics results can vary between the two. Imagine you have the following document {code} { id="1", multi_valued_int_field=[1,1,2,2,3], float_field=3 } {code} and were executing a facet over multi_valued_int_field, and calculating the sum of float_field. Ie, for every unique value in multi_valued_int_field, calculate the sum of float_field. If multi_valued_int_field is of type IntPointField, then the following results appear ||Facet Value||Calculation||Result||Reason|| |1|3 + 3|6|value 1 appears 2 times in the multivalued field so 2 instances of 3 are summed| |2|3 + 3|6|value 2 appears 2 times in the multivalued field so 2 instances of 3 are summed| |3|3|3|value 3 appears 1 time in the multivalued field so 3 is the result| If multi_valued_int_field is of type TrieIntField, then the following results appear ||Facet Value||Calculation||Result||Reason|| |1|3|3|value 1 appears 1 time in the multivalued field so 3 is the result| |2|3|3|value 2 appears 1 time in the multivalued field so 3 is the result| |3|3|3|value 3 appears 1 time in the multivalued field so 3 is the result| The difference here is how IntPointField and TrieIntField are stored. IntPointField does not deduplicate the values in the array while TrieIntField does. The same thing would occur when a multi-valued numeric field was used in an expression, but that is not included in the unit tests. was (Author: houstonputman): Okay, so I have updated the cloud and non-cloud schemas to add the randomized numeric fields. However the randomized doc-values cannot be used since docValues are required for almost all Analytics Component functionality. Almost all tests pass now, however there is a difference between SortedSetDocValues (TrieField) and SortedNumericDocValues (PointField) that might make this impossible. SortedSetDocValues only store the unique set of values for a multi-valued field, however SortedNumericDocValues can store the same value multiple times for a field on the same document. Therefore analytics results can vary between the two. For an example, if you faceting on {{multi_valued_int_field}} and calculated {{sum(float_field)}} on just the following document: {{Document = ( id="1", multi_valued_int_field=\[1,1,2,2,3\], float_field=3 )}} If {{multi_valued_int_field}} was a {{IntPointField}}, then the results of the facet would be ( {{facet_value : facet_results, ...}} ): {{1 : ( sum(float_field) = 6 ) , 2 : ( sum(float_field) = 6 ) , 3 : ( sum(float_field) = 3 )}} If {{multi_valued_int_field}} was a {{TrieIntField}}, then the results of the facet would be ( {{facet_value : facet_results, ...}} ): {{1 : ( sum(float_field) = 3 ) , 2 : ( sum(float_field) = 3 ) , 3 : ( sum(float_field) = 3 )}} This isn't included in the unit tests, but the same thing would occur when a multi-valued numeric field was used in an expression. The results could be different. > Analytics Component 2.0 > --- > > Key: SOLR-10123 > URL: https://issues.apache.org/jira/browse/SOLR-10123 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Houston Putman > Labels: features > Attachments: SOLR-10123.patch, SOLR-10123.patch, SOLR-10123.patch > > > A completely redesigned Analytics Component, introducing the following > features: > * Support for distributed collections > * New JSON request language, and response format that fits JSON better. > * Faceting over mapping functions in addition to fields (Value Faceting) > * PivotFaceting with ValueFacets > * More advanced facet sorting > * Support for PointField types > * Expressions over multi-valued fields > * New types of mapping functions > ** Logical > ** Conditional > ** Comparison > * Concurrent request execution > * Custom user functions, defined within the request > Fully backwards compatible with the orifinal Analytics Component with the > following exceptions: > * All fields used must have doc-values enabled >
[jira] [Comment Edited] (SOLR-10123) Analytics Component 2.0
[ https://issues.apache.org/jira/browse/SOLR-10123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068518#comment-16068518 ] Houston Putman edited comment on SOLR-10123 at 6/29/17 5:35 PM: Okay, so I have updated the cloud and non-cloud schemas to add the randomized numeric fields. However the randomized doc-values cannot be used since docValues are required for almost all Analytics Component functionality. Almost all tests pass now, however there is a difference between SortedSetDocValues (TrieField) and SortedNumericDocValues (PointField) that might make this impossible. SortedSetDocValues only store the unique set of values for a multi-valued field, however SortedNumericDocValues can store the same value multiple times for a field on the same document. Therefore analytics results can vary between the two. For an example, if you faceting on {{multi_valued_int_field}} and calculated {{sum(float_field)}} on just the following document: {{Document = ( id="1", multi_valued_int_field=\[1,1,2,2,3\], float_field=3 )}} If {{multi_valued_int_field}} was a {{IntPointField}}, then the results of the facet would be ( {{facet_value : facet_results, ...}} ): {{1 : ( sum(float_field) = 6 ) , 2 : ( sum(float_field) = 6 ) , 3 : ( sum(float_field) = 3 )}} If {{multi_valued_int_field}} was a {{TrieIntField}}, then the results of the facet would be ( {{facet_value : facet_results, ...}} ): {{1 : ( sum(float_field) = 3 ) , 2 : ( sum(float_field) = 3 ) , 3 : ( sum(float_field) = 3 )}} This isn't included in the unit tests, but the same thing would occur when a multi-valued numeric field was used in an expression. The results could be different. was (Author: houstonputman): Okay, so I have updated the cloud and non-cloud schemas to add the randomized numeric fields. However the randomized doc-values cannot be used since docValues are required for almost all Analytics Component functionality. Almost all tests pass now, however there is a difference between SortedSetDocValues (TrieField) and SortedNumericDocValues (PointField) that might make this impossible. SortedSetDocValues only store the unique set of values for a multi-valued field, however SortedNumericDocValues can store the same value multiple times for a field on the same document. Therefore analytics results can vary between the two. For an example, if you faceting on {{multi_valued_int_field}} and calculated {{sum(float_field)}} on just the following document: {{Document = ( id="1", multi_valued_int_field=\[1,1,2,2,3\], float_field=3 )}} If {{multi_valued_int_field}} was a {{IntPointField}}, then the results of the facet would be: {{1 : ( sum(float_field) = 6 ) , 2 : ( sum(float_field) = 6 ) , 3 : ( sum(float_field) = 3 )}} If {{multi_valued_int_field}} was a {{TrieIntField}}, then the results of the facet would be: {{1 : ( sum(float_field) = 3 ) , 2 : ( sum(float_field) = 3 ) , 3 : ( sum(float_field) = 3 )}} This isn't included in the unit tests, but the same thing would occur when a multi-valued numeric field was used in an expression. The results could be different. > Analytics Component 2.0 > --- > > Key: SOLR-10123 > URL: https://issues.apache.org/jira/browse/SOLR-10123 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Houston Putman > Labels: features > Attachments: SOLR-10123.patch, SOLR-10123.patch, SOLR-10123.patch > > > A completely redesigned Analytics Component, introducing the following > features: > * Support for distributed collections > * New JSON request language, and response format that fits JSON better. > * Faceting over mapping functions in addition to fields (Value Faceting) > * PivotFaceting with ValueFacets > * More advanced facet sorting > * Support for PointField types > * Expressions over multi-valued fields > * New types of mapping functions > ** Logical > ** Conditional > ** Comparison > * Concurrent request execution > * Custom user functions, defined within the request > Fully backwards compatible with the orifinal Analytics Component with the > following exceptions: > * All fields used must have doc-values enabled > * Expression results can no longer be used when defining Range and Query > facets > * The reverse(string) mapping function is no longer a native function -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org