Re[4]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
>>You could add a "level2_comment_id" field to the level 2 commends and >>it's children, and then use unique() on that. OK, I see, I missed the children... Thank you for pointing out. I have introduced that "unique sub-branch identifying" field and propagated it down the subbranch (the data is here: https://github.com/alisa-ipn/solr_nesting/blob/master/data/example-data-solr-for-faceting.json). Also changed the corresponding part of the post. And it actually works. Yet it requires a lot of effort to make Json API faceting handle faceting by intermediate levels. Making those "unique sub-branch identifying" fields dynamically appear the same way as the "_root_" field does will make Solr use friendlier for nested data like email chains and social media data... Thanks, Alisa >Пятница, 22 апреля 2016, 13:47 -04:00 от Yonik Seeley : > >On Fri, Apr 22, 2016 at 12:26 PM, Alisa Z. < prol...@mail.ru > wrote: >> Hi Yonik, >> >> Thanks a lot for your response. >> >> I have discussed this with Mikhail Khludnev already and tried this >> suggestion. Here's what I've got: >> >> >> >> sentiment: positive >> author: Bob >> text: Great post about Solr >> 2.blog-posts.comments-id: 10735-23004 //this is a >> new field, field name is different on each level for each type, values are >> unique >> date: 2015-04-10T11:30:00Z >> path: 2.blog-posts.comments >> id: 10735-23004 >> Query: >> curl http://localhost:8985/solr/solr_nesting_unique/query -d >> 'q=path:2.blog-posts.comments&rows=0& >> json.facet={ >> filter_by_child_type :{ >> type:query, >> q:"path:*comments*keywords", >> domain: { blockChildren : "path:2.blog-posts.comments" }, >> facet:{ >> top_entity_text : { >> type: terms, >> field: text, >> limit: 10, >> sort: "counts_by_comments desc", >> facet: { >>counts_by_comments: "unique (2.blog-posts.comments-id )" >> // changed >> }' > > >Something is wrong if you are getting 0 counts. >Lets try taking it piece-by-piece: > >Step 1: q=path:2.blog-posts.comments >This finds level 2 documents > >Step 2: domain: { blockChildren : "path:2.blog-posts.comments" } >This first maps to all of the children (level 3 and level4) > >Step 3: q:"path:*comments*keywords" >This selects a subset of level3 and level4 documents with keywords >(Note, in the future this should be doable as an additional filter in >the domain spec, w/o an additional sub-facet level) > >Step 4: >Facet on the text field of those level3 and level4 keyword docs. For >each bucket, also find the unique number of values in the >"2.blog-posts.comments-id" field on those documents. > >"Without seeing what you indexed, my guess is that the issue is that >the "2.blog-posts.comments-id" field does not actually exist on those >level3 and level4 docs being faceted. The JSON Facet API doesn't >propagate field values up/down the nested stack yet. That's what >https://issues.apache.org/jira/browse/SOLR-8998 is mostly about. > >-Yonik > > >> >> Response: >> >> "response":{"numFound":3,"start":0,"docs":[] >> }, >> "facets":{ >> "count":3, >> "filter_by_child_type":{ >> "count":9, >> "top_entity_text":{ >> "buckets":[{ >> "val":"Elasticsearch", >> "count":2, >> "counts_by_comments":0}, >> { >> "val":"Solr", >> "count":5, >> "counts_by_comments":0}, >> { >> "val":"Solr 5.5", >> "count":1, >> "counts_by_comments":0}, >> { >> "val":"feature", >> "count":1, >> "counts_by_comments":0}] >> >> So unless I messed something up... or the field name does not look >> "canonical" (but it was fast to generate and it is accepted in a normal >> query >> http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id >> :* ) >> >> So I think that it's just a JSON facet API limitation... >> >> Best, >> --Alisa >> >> >>>Пятница, 22 апреля 2016, 9:55 -04:00 от Yonik Seeley < ysee...@gmail.com >: >>> >>>Hi Alisa, >>>This was a bit too hard for me to grok on a first pass... then I saw >>>your related blog post which includes the actual sample data and makes >>>it more clear. >>> >>> More comments inline: >>> >>>On Wed, Apr 20, 2016 at 2:29 PM, Alisa Z. < prol...@mail.ru > wrote: Hi all, I have been stretching some SOLR's capabilities for nested documents handling and I've come up with the following issue... Let's say I have the following structure: { "blog-posts":{ //level 1 "leaf-fields":[ "date", "author"], "title":{ //level 2 "leaf-fields":[ "text"], "keywords":{//level 3 "leaf-fields":[ "text",
Re: Re[2]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
On Fri, Apr 22, 2016 at 12:26 PM, Alisa Z. wrote: > Hi Yonik, > > Thanks a lot for your response. > > I have discussed this with Mikhail Khludnev already and tried this > suggestion. Here's what I've got: > > > > sentiment: positive > author: Bob > text: Great post about Solr > 2.blog-posts.comments-id: 10735-23004 //this is a > new field, field name is different on each level for each type, values are > unique > date: 2015-04-10T11:30:00Z > path: 2.blog-posts.comments > id: 10735-23004 > Query: > curl http://localhost:8985/solr/solr_nesting_unique/query -d > 'q=path:2.blog-posts.comments&rows=0& > json.facet={ > filter_by_child_type :{ > type:query, > q:"path:*comments*keywords", > domain: { blockChildren : "path:2.blog-posts.comments" }, > facet:{ > top_entity_text : { > type: terms, > field: text, > limit: 10, > sort: "counts_by_comments desc", > facet: { >counts_by_comments: "unique (2.blog-posts.comments-id )" > // changed > }' Something is wrong if you are getting 0 counts. Lets try taking it piece-by-piece: Step 1: q=path:2.blog-posts.comments This finds level 2 documents Step 2: domain: { blockChildren : "path:2.blog-posts.comments" } This first maps to all of the children (level 3 and level4) Step 3: q:"path:*comments*keywords" This selects a subset of level3 and level4 documents with keywords (Note, in the future this should be doable as an additional filter in the domain spec, w/o an additional sub-facet level) Step 4: Facet on the text field of those level3 and level4 keyword docs. For each bucket, also find the unique number of values in the "2.blog-posts.comments-id" field on those documents. "Without seeing what you indexed, my guess is that the issue is that the "2.blog-posts.comments-id" field does not actually exist on those level3 and level4 docs being faceted. The JSON Facet API doesn't propagate field values up/down the nested stack yet. That's what https://issues.apache.org/jira/browse/SOLR-8998 is mostly about. -Yonik > > Response: > > "response":{"numFound":3,"start":0,"docs":[] > }, > "facets":{ > "count":3, > "filter_by_child_type":{ > "count":9, > "top_entity_text":{ > "buckets":[{ > "val":"Elasticsearch", > "count":2, > "counts_by_comments":0}, > { > "val":"Solr", > "count":5, > "counts_by_comments":0}, > { > "val":"Solr 5.5", > "count":1, > "counts_by_comments":0}, > { > "val":"feature", > "count":1, > "counts_by_comments":0}] > > So unless I messed something up... or the field name does not look > "canonical" (but it was fast to generate and it is accepted in a normal query > http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id > :* ) > > So I think that it's just a JSON facet API limitation... > > Best, > --Alisa > > >>Пятница, 22 апреля 2016, 9:55 -04:00 от Yonik Seeley : >> >>Hi Alisa, >>This was a bit too hard for me to grok on a first pass... then I saw >>your related blog post which includes the actual sample data and makes >>it more clear. >> >> More comments inline: >> >>On Wed, Apr 20, 2016 at 2:29 PM, Alisa Z. < prol...@mail.ru > wrote: >>> Hi all, >>> >>> I have been stretching some SOLR's capabilities for nested documents >>> handling and I've come up with the following issue... >>> >>> Let's say I have the following structure: >>> >>> { >>> "blog-posts":{ //level 1 >>> "leaf-fields":[ >>> "date", >>> "author"], >>> "title":{ //level 2 >>> "leaf-fields":[ "text"], >>> "keywords":{//level 3 >>> "leaf-fields":[ >>> "text", >>> "type"] >>> } >>> }, >>> "body":{//level 2 >>> "leaf-fields":[ "text"], >>> "keywords":{//level 3 >>> "leaf-fields":[ >>> "text", >>> "type"] >>> } >>> }, >>> "comments":{//level 2 >>> "leaf-fields":[ >>> "date", >>> "author", >>> "text", >>> "sentiment" >>> ], >>> "keywords":{//level 3 >>> "leaf-fields":[ >>> "text", >>> "type"] >>> }, >>> "replies":{ //level 3 >>> "leaf-fields":[ >>> "date", >>> "author", >>> "text", >>> "sentiment"], >>> "keywords":{//level 4 >>> "leaf-fields":[ >>> "text", >>> "type"] >>> } >>>
Re[2]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
Hi Yonik, Thanks a lot for your response. I have discussed this with Mikhail Khludnev already and tried this suggestion. Here's what I've got: sentiment: positive author: Bob text: Great post about Solr 2.blog-posts.comments-id: 10735-23004 //this is a new field, field name is different on each level for each type, values are unique date: 2015-04-10T11:30:00Z path: 2.blog-posts.comments id: 10735-23004 Query: curl http://localhost:8985/solr/solr_nesting_unique/query -d 'q=path:2.blog-posts.comments&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.blog-posts.comments" }, facet:{ top_entity_text : { type: terms, field: text, limit: 10, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique (2.blog-posts.comments-id )" // changed }' Response: "response":{"numFound":3,"start":0,"docs":[] }, "facets":{ "count":3, "filter_by_child_type":{ "count":9, "top_entity_text":{ "buckets":[{ "val":"Elasticsearch", "count":2, "counts_by_comments":0}, { "val":"Solr", "count":5, "counts_by_comments":0}, { "val":"Solr 5.5", "count":1, "counts_by_comments":0}, { "val":"feature", "count":1, "counts_by_comments":0}] So unless I messed something up... or the field name does not look "canonical" (but it was fast to generate and it is accepted in a normal query http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id :* ) So I think that it's just a JSON facet API limitation... Best, --Alisa >Пятница, 22 апреля 2016, 9:55 -04:00 от Yonik Seeley : > >Hi Alisa, >This was a bit too hard for me to grok on a first pass... then I saw >your related blog post which includes the actual sample data and makes >it more clear. > > More comments inline: > >On Wed, Apr 20, 2016 at 2:29 PM, Alisa Z. < prol...@mail.ru > wrote: >> Hi all, >> >> I have been stretching some SOLR's capabilities for nested documents >> handling and I've come up with the following issue... >> >> Let's say I have the following structure: >> >> { >> "blog-posts":{ //level 1 >> "leaf-fields":[ >> "date", >> "author"], >> "title":{ //level 2 >> "leaf-fields":[ "text"], >> "keywords":{//level 3 >> "leaf-fields":[ >> "text", >> "type"] >> } >> }, >> "body":{//level 2 >> "leaf-fields":[ "text"], >> "keywords":{//level 3 >> "leaf-fields":[ >> "text", >> "type"] >> } >> }, >> "comments":{//level 2 >> "leaf-fields":[ >> "date", >> "author", >> "text", >> "sentiment" >> ], >> "keywords":{//level 3 >> "leaf-fields":[ >> "text", >> "type"] >> }, >> "replies":{ //level 3 >> "leaf-fields":[ >> "date", >> "author", >> "text", >> "sentiment"], >> "keywords":{//level 4 >> "leaf-fields":[ >> "text", >> "type"] >> } >> >> >> And I want to know the distribution of all readers' keywords (levels 3 and >> 4) by comments (level 2). >> In JSON Facet API I tried this: >> >> curl http://localhost:8983/solr/my_index/query -d >> 'q=path:2.blog-posts.comments&rows=0& >> json.facet={ >> filter_by_child_type :{ >> type:query, >> q:"path:*comments*keywords", >> domain: { blockChildren : "path:2.blog-posts.comments" }, >> facet:{ >> top_keywords : { >> type: terms, >> field: text, >> sort: "counts_by_comments desc", >> facet: { >>counts_by_comments: "unique(_root_)"// I suspect in should be >> a different field, not _root_, but would it be for an intermediate document? >> }' >> >> Which gives me the wrong results, it aggregates by posts, not by comments >> (it's a toy data set, so I know that the correct answer for "Solr" is 3 when >> faceted by for comments) > > >Yeah, this type if thing isn't currently directly supported, but >SOLR-8998 should address that. >You can currently hack around it (for simple counts) using unique(), >as you've discovered, but you need a unique ID at the right level to >get the right count. > >_root_ is unique for blog posts, hence that's why you get numbers of >posts (as opp
Re: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
Hi Alisa, This was a bit too hard for me to grok on a first pass... then I saw your related blog post which includes the actual sample data and makes it more clear. More comments inline: On Wed, Apr 20, 2016 at 2:29 PM, Alisa Z. wrote: > Hi all, > > I have been stretching some SOLR's capabilities for nested documents handling > and I've come up with the following issue... > > Let's say I have the following structure: > > { > "blog-posts":{ //level 1 > "leaf-fields":[ > "date", > "author"], > "title":{ //level 2 > "leaf-fields":[ "text"], > "keywords":{//level 3 > "leaf-fields":[ > "text", > "type"] > } > }, > "body":{//level 2 > "leaf-fields":[ "text"], > "keywords":{//level 3 > "leaf-fields":[ > "text", > "type"] > } > }, > "comments":{//level 2 > "leaf-fields":[ > "date", > "author", > "text", > "sentiment" > ], > "keywords":{//level 3 > "leaf-fields":[ > "text", > "type"] > }, > "replies":{ //level 3 > "leaf-fields":[ > "date", > "author", > "text", > "sentiment"], > "keywords":{//level 4 > "leaf-fields":[ > "text", > "type"] > } > > > And I want to know the distribution of all readers' keywords (levels 3 and 4) > by comments (level 2). > In JSON Facet API I tried this: > > curl http://localhost:8983/solr/my_index/query -d > 'q=path:2.blog-posts.comments&rows=0& > json.facet={ > filter_by_child_type :{ > type:query, > q:"path:*comments*keywords", > domain: { blockChildren : "path:2.blog-posts.comments" }, > facet:{ > top_keywords : { > type: terms, > field: text, > sort: "counts_by_comments desc", > facet: { >counts_by_comments: "unique(_root_)"// I suspect in should be > a different field, not _root_, but would it be for an intermediate document? > }' > > Which gives me the wrong results, it aggregates by posts, not by comments > (it's a toy data set, so I know that the correct answer for "Solr" is 3 when > faceted by for comments) Yeah, this type if thing isn't currently directly supported, but SOLR-8998 should address that. You can currently hack around it (for simple counts) using unique(), as you've discovered, but you need a unique ID at the right level to get the right count. _root_ is unique for blog posts, hence that's why you get numbers of posts (as opposed to numbers of level-2 comments). You could add a "level2_comment_id" field to the level 2 commends and it's children, and then use unique() on that. -Yonik > { > "response":{"numFound":3,"start":0,"docs":[] > }, > "facets":{ > "count":3, > "filter_by_child_type":{ > "count":9, > "top_keywords":{ > "buckets":[{ > "val":"Elasticsearch", > "count":2, > "counts_by_comments":2}, > { > "val":"Solr", > "count":5, > "counts_by_comments":2}, //here the count by > "comments" should be 3 > { > "val":"Solr 5.5", > "count":1, > "counts_by_comments":1}, > { > "val":"feature", > "count":1, > "counts_by_comments":1}] > > > Am I writing the query wrong? > > > By the way, Block Join Faceting works fine for this: > bjqfacet?q={!parent%20which=path:2.blog-posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text&wt=json&indent=true > > { > "response":{"numFound":3,"start":0,"docs":[] > }, > "facet_counts":{ > "facet_queries":{}, > "facet_fields":{ > "text":[ > "Elasticsearch",2, > "Solr",3, //correct result > "Solr 5.5",1, > "feature",1]}, > "facet_dates":{}, > "facet_ranges":{}, > "facet_intervals":{}, > "facet_heatmaps":{}}} > > But we've already discussed that it returns too much stuff: no way to put > limits or order by counts :( That's why I want to see whether it's posible > to make JSON Facet API straight. > > Thank you in advance! > > -- > Alisa Zhila
Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
Hi all, I have been stretching some SOLR's capabilities for nested documents handling and I've come up with the following issue... Let's say I have the following structure: { "blog-posts":{ //level 1 "leaf-fields":[ "date", "author"], "title":{ //level 2 "leaf-fields":[ "text"], "keywords":{ //level 3 "leaf-fields":[ "text", "type"] } }, "body":{ //level 2 "leaf-fields":[ "text"], "keywords":{ //level 3 "leaf-fields":[ "text", "type"] } }, "comments":{ //level 2 "leaf-fields":[ "date", "author", "text", "sentiment" ], "keywords":{ //level 3 "leaf-fields":[ "text", "type"] }, "replies":{ //level 3 "leaf-fields":[ "date", "author", "text", "sentiment"], "keywords":{ //level 4 "leaf-fields":[ "text", "type"] } And I want to know the distribution of all readers' keywords (levels 3 and 4) by comments (level 2). In JSON Facet API I tried this: curl http://localhost:8983/solr/my_index/query -d 'q=path:2.blog-posts.comments&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.blog-posts.comments" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique(_root_)" // I suspect in should be a different field, not _root_, but would it be for an intermediate document? }' Which gives me the wrong results, it aggregates by posts, not by comments (it's a toy data set, so I know that the correct answer for "Solr" is 3 when faceted by for comments) { "response":{"numFound":3,"start":0,"docs":[] }, "facets":{ "count":3, "filter_by_child_type":{ "count":9, "top_keywords":{ "buckets":[{ "val":"Elasticsearch", "count":2, "counts_by_comments":2}, { "val":"Solr", "count":5, "counts_by_comments":2}, //here the count by "comments" should be 3 { "val":"Solr 5.5", "count":1, "counts_by_comments":1}, { "val":"feature", "count":1, "counts_by_comments":1}] Am I writing the query wrong? By the way, Block Join Faceting works fine for this: bjqfacet?q={!parent%20which=path:2.blog-posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text&wt=json&indent=true { "response":{"numFound":3,"start":0,"docs":[] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "text":[ "Elasticsearch",2, "Solr",3, //correct result "Solr 5.5",1, "feature",1]}, "facet_dates":{}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}} But we've already discussed that it returns too much stuff: no way to put limits or order by counts :( That's why I want to see whether it's posible to make JSON Facet API straight. Thank you in advance! -- Alisa Zhila
Re: block join rollups
Hi Yonik, Well, no one replied to this yet, so I thought I'd chime in with some of the use cases that I am working with. Please note that I am lagging a big behind the last few releases, so I haven't had time to experiment with Solr 5.3+, I am sure that some of this is included in there already and I am very excited to play around with the new streaming API, json facets and SQL interface when I have a bit more time. I am indexing click stream data into Solr. Each set of records represents a user's unique visit to our website. They all share a common session id, as well as several session attributes, such as IP and user attributes if they log in. Each record represents an individual action, such as a search, product view or a visit to a particular page, all attributes and data elements of each request are stored with each record, additionally, session attributes get copied down to each event item. The current goal of this system is to provide less tech savvy users with easy access to this data in a way they can explore it and drill down on particular elements; we are using Banana for this. Currently, I have to copy a lot of session fields to each event so I can filter on them, for example, show all searches for users associated with organization X. This is super redundant and I am really looking for a better way. It would be great if I could make parent document fields appear as if they are a part of child documents. Additionally, I am counting various events for each session during processing. For example, I count the number of searches, product views, add to carts, etc... This information is also indexed in each record. This allows me to pull up specific events (like product views) where the number of searches in a given session is greater than X. However, again, indexing this information for each event creates a lot of redundancy. Finally, a slightly different use cases involves running functions on items in a group (even if they aren't a part of the result set) and returning that as a part of the document. Almost like a dynamically generated document, based on aggregations from child documents. This is currently somewhat available, but I can't include it in sort. For example, I am grouping items on a field, I want to get the minimum value of a field per group and sort the result (of groups) on that calculated value. I am not sure if this helps you at all, but wanted to share some of my pain points, hope it helps. On Sun, Apr 17, 2016 at 6:50 PM, Yonik Seeley wrote: > Hey folks, we're at the point of figuring out the API for block join > child rollups for the JSON Facet API. > We already have simple block join faceting: > http://yonik.com/solr-nested-objects/ > So now we need an API to carry over more information from children to > parents (say rolling up average rating of all the reviews to the > corresponding parent book objects). > > I've gathered some of my notes/thoughts on the API here: > https://issues.apache.org/jira/browse/SOLR-8998 > > Feedback welcome, and we can discuss here in this thread rather than > cluttering the JIRA. > > -Yonik >
block join rollups
Hey folks, we're at the point of figuring out the API for block join child rollups for the JSON Facet API. We already have simple block join faceting: http://yonik.com/solr-nested-objects/ So now we need an API to carry over more information from children to parents (say rolling up average rating of all the reviews to the corresponding parent book objects). I've gathered some of my notes/thoughts on the API here: https://issues.apache.org/jira/browse/SOLR-8998 Feedback welcome, and we can discuss here in this thread rather than cluttering the JIRA. -Yonik