Re[4]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
>>You could add a "level2_comment_id" field to the level 2 commends and >>it's children, and then use unique() on that. OK, I see, I missed the children... Thank you for pointing out. I have introduced that "unique sub-branch identifying" field and propagated it down the subbranch (the data is here: https://github.com/alisa-ipn/solr_nesting/blob/master/data/example-data-solr-for-faceting.json). Also changed the corresponding part of the post. And it actually works. Yet it requires a lot of effort to make Json API faceting handle faceting by intermediate levels. Making those "unique sub-branch identifying" fields dynamically appear the same way as the "_root_" field does will make Solr use friendlier for nested data like email chains and social media data... Thanks, Alisa >Пятница, 22 апреля 2016, 13:47 -04:00 от Yonik Seeley : > >On Fri, Apr 22, 2016 at 12:26 PM, Alisa Z. < prol...@mail.ru > wrote: >> Hi Yonik, >> >> Thanks a lot for your response. >> >> I have discussed this with Mikhail Khludnev already and tried this >> suggestion. Here's what I've got: >> >> >> >> sentiment: positive >> author: Bob >> text: Great post about Solr >> 2.blog-posts.comments-id: 10735-23004 //this is a >> new field, field name is different on each level for each type, values are >> unique >> date: 2015-04-10T11:30:00Z >> path: 2.blog-posts.comments >> id: 10735-23004 >> Query: >> curl http://localhost:8985/solr/solr_nesting_unique/query -d >> 'q=path:2.blog-posts.comments&rows=0& >> json.facet={ >> filter_by_child_type :{ >> type:query, >> q:"path:*comments*keywords", >> domain: { blockChildren : "path:2.blog-posts.comments" }, >> facet:{ >> top_entity_text : { >> type: terms, >> field: text, >> limit: 10, >> sort: "counts_by_comments desc", >> facet: { >>counts_by_comments: "unique (2.blog-posts.comments-id )" >> // changed >> }' > > >Something is wrong if you are getting 0 counts. >Lets try taking it piece-by-piece: > >Step 1: q=path:2.blog-posts.comments >This finds level 2 documents > >Step 2: domain: { blockChildren : "path:2.blog-posts.comments" } >This first maps to all of the children (level 3 and level4) > >Step 3: q:"path:*comments*keywords" >This selects a subset of level3 and level4 documents with keywords >(Note, in the future this should be doable as an additional filter in >the domain spec, w/o an additional sub-facet level) > >Step 4: >Facet on the text field of those level3 and level4 keyword docs. For >each bucket, also find the unique number of values in the >"2.blog-posts.comments-id" field on those documents. > >"Without seeing what you indexed, my guess is that the issue is that >the "2.blog-posts.comments-id" field does not actually exist on those >level3 and level4 docs being faceted. The JSON Facet API doesn't >propagate field values up/down the nested stack yet. That's what >https://issues.apache.org/jira/browse/SOLR-8998 is mostly about. > >-Yonik > > >> >> Response: >> >> "response":{"numFound":3,"start":0,"docs":[] >> }, >> "facets":{ >> "count":3, >> "filter_by_child_type":{ >> "count":9, >> "top_entity_text":{ >> "buckets":[{ >> "val":"Elasticsearch", >> "count":2, >> "counts_by_comments":0}, >> { >> "val":"Solr", >> "count":5, >> "counts_by_comments":0}, >> { >> "val":"Solr 5.5", >> "count":1, >> "counts_by_comments":0}, >> { >> "val":"feature", >> "count":1, >> "counts_by_comments":0}] >> >> So unless I messed something up... or the field name does not look >> "canonical" (but it was fast to generate and it is accepted in a normal >> query >> http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id >> :* ) >> >> So I think that
Re[2]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
Hi Yonik, Thanks a lot for your response. I have discussed this with Mikhail Khludnev already and tried this suggestion. Here's what I've got: sentiment: positive author: Bob text: Great post about Solr 2.blog-posts.comments-id: 10735-23004 //this is a new field, field name is different on each level for each type, values are unique date: 2015-04-10T11:30:00Z path: 2.blog-posts.comments id: 10735-23004 Query: curl http://localhost:8985/solr/solr_nesting_unique/query -d 'q=path:2.blog-posts.comments&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.blog-posts.comments" }, facet:{ top_entity_text : { type: terms, field: text, limit: 10, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique (2.blog-posts.comments-id )" // changed }' Response: "response":{"numFound":3,"start":0,"docs":[] }, "facets":{ "count":3, "filter_by_child_type":{ "count":9, "top_entity_text":{ "buckets":[{ "val":"Elasticsearch", "count":2, "counts_by_comments":0}, { "val":"Solr", "count":5, "counts_by_comments":0}, { "val":"Solr 5.5", "count":1, "counts_by_comments":0}, { "val":"feature", "count":1, "counts_by_comments":0}] So unless I messed something up... or the field name does not look "canonical" (but it was fast to generate and it is accepted in a normal query http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id :* ) So I think that it's just a JSON facet API limitation... Best, --Alisa >Пятница, 22 апреля 2016, 9:55 -04:00 от Yonik Seeley : > >Hi Alisa, >This was a bit too hard for me to grok on a first pass... then I saw >your related blog post which includes the actual sample data and makes >it more clear. > > More comments inline: > >On Wed, Apr 20, 2016 at 2:29 PM, Alisa Z. < prol...@mail.ru > wrote: >> Hi all, >> >> I have been stretching some SOLR's capabilities for nested documents >> handling and I've come up with the following issue... >> >> Let's say I have the following structure: >> >> { >> "blog-posts":{ //level 1 >> "leaf-fields":[ >> "date", >> "author"], >> "title":{ //level 2 >> "leaf-fields":[ "text"], >> "keywords":{//level 3 >> "leaf-fields":[ >> "text", >> "type"] >> } >> }, >> "body":{//level 2 >> "leaf-fields":[ "text"], >> "keywords":{//level 3 >> "leaf-fields":[ >> "text", >> "type"] >> } >> }, >> "comments":{//level 2 >> "leaf-fields":[ >> "date", >> "author", >> "text", >> "sentiment" >> ], >> "keywords":{//level 3 >> "leaf-fields":[ >> "text", >> "type"] >> }, >> "replies":{ //level 3 >> "leaf-fields":[ >> "date", >> "author", >> "text", >> "sentiment"], >> "keywords":{//level 4 >> "leaf-fields":[ >> "text", >> "type"] >> } >> >> >> And I want to know the distribution of all readers' keywords (levels 3 and >> 4) by comments (level 2). >> In JSON Facet API I tried this: >> >>
Re[2]: how to restrict phrase to appear in same child document
I'm afraid that if the queries are given in such a loose natural language form, the only way to handle it is to introduce some natural language processing stage that would form the right query (which is actually a working strategy, IBM does so). If your document structure is fixed (i.e., you know types of nested documents and what fields they exactly contain) , you can try to introduce some basic NLP that will detect the entities or nouns,e.g., "driver" and "car" (try AlchemyLanguage API http://www.alchemyapi.com/products/demo/alchemylanguage for this) and you will also need some syntactic parser to connect black+driver and white+mercedes correctly. >Среда, 20 апреля 2016, 15:31 -04:00 от Yangrui Guo : > >Hi thanks for answering. My problem is that users do not distinguish what >color the color belongs to in the query. For example, "which black driver >has a white mercedes", it is difficult to distinguish which color belongs >to which field, because there can be thousands of car brands and >professions. Is there anyway that can achieve the feature I stated been >fore? > >On Wednesday, April 20, 2016, Alisa Z. < prol...@mail.ru > wrote: > >> Yangrui, >> >> First, have you indexed your documents with proper nested document >> structure [ >> >> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments >> ]? >> From the peice of data you showed, it seems that you just put it right as >> it is and it all got flattened. >> >> Then, you'll probably want to introduce a distinguishing >> "type"/"category"/"path" fields into your data, so it would look like this: >> >> { >> type:top >> id: >> { >> type:car_color >> car: >> color: >> } >> { >> type:driver_color >> driver: >> color: >> } >> } >> >> >> >Wed, 20 Apr 2016 -3:28:33 -0400 от Yangrui Guo < guoyang...@gmail.com >> >: >> > >> >hello >> > >> >I have a nested document type in my index. Here's the structure of my >> >document: >> > >> >{ >> >id: >> >{ >> >car: >> >color: >> >} >> >{ >> >driver: >> >color: >> >} >> >} >> > >> >However, when I use the query q={!parent >> >which="content_type:parent"}+(black AND driver)&fq={!parent >> >which="content_type:parent"}+(white AND mercedes), the result also >> >contained white driver with black mercedes. I know I can put fields before >> >terms but it is not always easy to do this. Users might just enter one >> >string. How can I modify my query to require that the terms between two >> >parentheses must appear in the same child document, or boost those meet >> the >> >criteria? Thanks >> >>
Re: pivoting with json facet api
Hi Yangrui, I have summarized some experiments about Solr nesting capabilities (however, it does not include precisely pivoting yet more of faceting up to parents and down to children with some statictics) so maybe you could find an idea there: https://medium.com/@alisazhila/solr-s-nesting-on-solr-s-capabilities-to-handle-deeply-nested-document-structures-50eeaaa4347a#.dbxdv3zdp Please, let me know if it were useful in comments. You could also specify your problem a bit more if you don't find the answer. Cheers, Alisa >Четверг, 21 апреля 2016, 1:01 -04:00 от Yangrui Guo : > >Hi > >I am trying to facet results on my nest documents. The solr document did >not say much on how to pivot with json api with nest documents. Could >someone show me some examples? Thanks very much. > >Yangrui
Re[2]: Traversal of documents through network
Well, it took me 7 milliseconds to index a 100MB dataset on a local Solr. So you could assume that for 1 GB it would take 70ms= 0.07s which is still pretty fast. Yet dealing with network delays is a separate issue. 100 wikipedia article-size documents shouldn't be a big problem. >Четверг, 21 апреля 2016, 0:57 -04:00 от vidya : > >ok. I understand that. So, you would say documents traverse through network. >If i specify some 100 docs to be dispalyed on my first page, will it effect >performance. While docs gets traversed, will there be any high volume >traffic and effects performance of the application. > > >And whats the time solr takes to index 1GB of data in general. > > >Thanks > > > >-- >View this message in context: >http://lucene.472066.n3.nabble.com/Traversal-of-documents-through-network-tp4271555p4271743.html >Sent from the Solr - User mailing list archive at Nabble.com.
Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)
Hi all, I have been stretching some SOLR's capabilities for nested documents handling and I've come up with the following issue... Let's say I have the following structure: { "blog-posts":{ //level 1 "leaf-fields":[ "date", "author"], "title":{ //level 2 "leaf-fields":[ "text"], "keywords":{ //level 3 "leaf-fields":[ "text", "type"] } }, "body":{ //level 2 "leaf-fields":[ "text"], "keywords":{ //level 3 "leaf-fields":[ "text", "type"] } }, "comments":{ //level 2 "leaf-fields":[ "date", "author", "text", "sentiment" ], "keywords":{ //level 3 "leaf-fields":[ "text", "type"] }, "replies":{ //level 3 "leaf-fields":[ "date", "author", "text", "sentiment"], "keywords":{ //level 4 "leaf-fields":[ "text", "type"] } And I want to know the distribution of all readers' keywords (levels 3 and 4) by comments (level 2). In JSON Facet API I tried this: curl http://localhost:8983/solr/my_index/query -d 'q=path:2.blog-posts.comments&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.blog-posts.comments" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique(_root_)" // I suspect in should be a different field, not _root_, but would it be for an intermediate document? }' Which gives me the wrong results, it aggregates by posts, not by comments (it's a toy data set, so I know that the correct answer for "Solr" is 3 when faceted by for comments) { "response":{"numFound":3,"start":0,"docs":[] }, "facets":{ "count":3, "filter_by_child_type":{ "count":9, "top_keywords":{ "buckets":[{ "val":"Elasticsearch", "count":2, "counts_by_comments":2}, { "val":"Solr", "count":5, "counts_by_comments":2}, //here the count by "comments" should be 3 { "val":"Solr 5.5", "count":1, "counts_by_comments":1}, { "val":"feature", "count":1, "counts_by_comments":1}] Am I writing the query wrong? By the way, Block Join Faceting works fine for this: bjqfacet?q={!parent%20which=path:2.blog-posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text&wt=json&indent=true { "response":{"numFound":3,"start":0,"docs":[] }, "facet_counts":{ "facet_queries":{}, "facet_fields":{ "text":[ "Elasticsearch",2, "Solr",3, //correct result "Solr 5.5",1, "feature",1]}, "facet_dates":{}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}} But we've already discussed that it returns too much stuff: no way to put limits or order by counts :( That's why I want to see whether it's posible to make JSON Facet API straight. Thank you in advance! -- Alisa Zhila
Re: how to restrict phrase to appear in same child document
Yangrui, First, have you indexed your documents with proper nested document structure [https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments]? From the peice of data you showed, it seems that you just put it right as it is and it all got flattened. Then, you'll probably want to introduce a distinguishing "type"/"category"/"path" fields into your data, so it would look like this: { type:top id: { type:car_color car: color: } { type:driver_color driver: color: } } >Wed, 20 Apr 2016 -3:28:33 -0400 от Yangrui Guo : > >hello > >I have a nested document type in my index. Here's the structure of my >document: > >{ >id: >{ >car: >color: >} >{ >driver: >color: >} >} > >However, when I use the query q={!parent >which="content_type:parent"}+(black AND driver)&fq={!parent >which="content_type:parent"}+(white AND mercedes), the result also >contained white driver with black mercedes. I know I can put fields before >terms but it is not always easy to do this. Users might just enter one >string. How can I modify my query to require that the terms between two >parentheses must appear in the same child document, or boost those meet the >criteria? Thanks
Re: Traversal of documents through network
Viday, No, not all of those 500 result docs will be brought to your client (browser, etc.) Only as many documents as fit into the 1st "search result page" will be brought. There is a notion of "pagination" in Solr (as well as in most search engines). The counts of occurrence might be approximate and anyway you will be displayed only as many documents as specified by your "search result page" size. By default, page size is set to 10 documents, so although you might see something like "response":{"numFound":27,"start":0,"docs"}, only 10 top documents will be displayed. In Solr, "page" size is controlled with "start" and "row" parameters ( see https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results), so if you want less results to be brought at a time, you can specify your query like this: q="word"&row=5 - that will show you only top 5 results and only they will "traverse the network" (or being brought from the Solr server to your browser or other client). If you want to look at another page, you specify q="word"&row=5&start=5 - this is the 2nd page of the results Hope it helps. --Alisa >Среда, 20 апреля 2016, 10:01 -04:00 от vidya : > >Hi > >When i queried a word in solr, documents having that keyword is displayed in >500 documents,lets say. Will all those documents traverse through network ? >Or how it happens ? > >Please help me on this. > > > >-- >View this message in context: >http://lucene.472066.n3.nabble.com/Traversal-of-documents-through-network-tp4271555.html >Sent from the Solr - User mailing list archive at Nabble.com.
Re[2]: [possible bug]: [child] - ChildDocTransformerFactory returns top level documents nested under middle level documents when queried for the middle level ones
Thanks, Anshum! This definitely brings the result I wanted. It is just the description from ChildDocTransformerFactory docs (" This transformer returns all descendants of each parent document in a flat list nested inside the parent document .") is a bit misleading... One should never stop experimenting :) >Среда, 30 марта 2016, 15:19 -04:00 от Anshum Gupta : > >I'm not the best person to comment on this so perhaps someone could chime >in as well, but can you try using a wildcard for your childFilter? >Something like: childFilter=type_s:doc.enriched.text.* > >You could also possibly enrich the document with depth information and use >that for filtering out. > >On Wed, Mar 30, 2016 at 11:34 AM, Alisa Z. < prol...@mail.ru > wrote: > >> I think I am observing an unexpected behavior of >> ChildDocTransformerFactory. >> >> The query is like this: >> >> /select?q={!parent which= "type_s:doc.enriched.text "}t >> ype_s:doc.enriched.text.entities +text_t:pjm +type_t:Company >> +relevance_tf:[0.7%20TO%20*]&fl=*,[child >> parentFilter=type_s:doc.enriched.text limit=1000] >> >> The levels of hierarchy are shown in the type_s field. So I am querying >> on some descendants and returning some ancestors that are somewhere in the >> middle of the hierarchy. I also want to get all the nested documents >> below that middle level. >> >> Here is the result: >> >> >> >> >> doc.enriched.text// this is the level >> I wanted to get to and then go down from it >> ... >> 13565 >> >> doc.enriched // This is a document >> from 1 level up, the parent of the >>// current type_s : >> doc.enriched.text document -- why is it here? >> 22024 >> >> >> doc.original // This is an "uncle" >> 26698 >> >> >> doc// and this a >> grandparent!!! >> >> >> >> >> And so on, bringing the whole tree up and down all under my middle-level >> document. >> I really hope this is not the expected behavior. >> >> I appreciate your help in advance. >> >> -- >> Alisa Zhila > > > > >-- >Anshum Gupta
[possible bug]: [child] - ChildDocTransformerFactory returns top level documents nested under middle level documents when queried for the middle level ones
I think I am observing an unexpected behavior of ChildDocTransformerFactory. The query is like this: /select?q={!parent which= "type_s:doc.enriched.text "}t ype_s:doc.enriched.text.entities +text_t:pjm +type_t:Company +relevance_tf:[0.7%20TO%20*]&fl=*,[child parentFilter=type_s:doc.enriched.text limit=1000] The levels of hierarchy are shown in the type_s field. So I am querying on some descendants and returning some ancestors that are somewhere in the middle of the hierarchy. I also want to get all the nested documents below that middle level. Here is the result: doc.enriched.text // this is the level I wanted to get to and then go down from it ... 13565 doc.enriched // This is a document from 1 level up, the parent of the // current type_s : doc.enriched.text document -- why is it here? 22024 doc.original // This is an "uncle" 26698 doc // and this a grandparent!!! And so on, bringing the whole tree up and down all under my middle-level document. I really hope this is not the expected behavior. I appreciate your help in advance. -- Alisa Zhila
Re[5]: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)
Alright, based on https://issues.apache.org/jira/browse/SOLR-5743 I can assume that limit and mincount for the BlockJoin part stay an open issue for some time ... Therefore, the answer is no as of Solr 5.5.0. Thanks to Mikhail Khludnev for working on the subject. >Вторник, 29 марта 2016, 14:38 -04:00 от Alisa Z. : > >Mikhail, > >I totally see the point: the corresponding wiki page ( >https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting ) does not >mention it and says it's an experimental feature. > >Is it correct that no additional options ( limit, mincount, etc.) can be set >anyhow? > >Or more specifically, is there any work-around to control the output of the >query at hand (maybe anything beyond faceting options): > >/bjqfacet?q={!parent%20which=type_s:doc}type_s:doc.enriched.text.keywords&child.facet.field=text_t&rows=0&fq={!parent%20which=type_s:doc}type_s:doc.userData%20%2BSubject_t:california&wt=json&indent=true >> >> >> >>RETURNS: >> >> >> >>{ >> >> "responseHeader":{ >> >> "status":0, >> >> "QTime":1}, >> >> "response":{"numFound":19,"start":0,"docs":[] >> >> }, >> >> "facet_counts":[ >> >> "facet_fields",[ >> >> "text_t",[ >> >> "128x",1, >> >> "18xx",1, >> >> "1x",1, >> >> "2",2, >> >> "30",1, >> >> "60",1, >> >> "78xx",1, >> >> "82xx",1, >> >> "ab",2, >> >> "access",5, >> >> "account",1, >> >> "accounts",1, >> >>... >> >>"california",13, >> >>... >> >>"enron",9, >> >>... >> >>]]]} >> >> > > >>Вторник, 29 марта 2016, 13:40 -04:00 от Mikhail Khludnev < >>mkhlud...@griddynamics.com >: >> >>Alisa, >> >>There is no such thing as child.facet.limit, etc >> >>On Tue, Mar 29, 2016 at 6:27 PM, Alisa Z. < prol...@mail.ru > wrote: >> >>> So the first issue eventually solved by adding facet: {top_terms_by_doc: >>> "unique(_root_)"} AND sorting the outer facet buckets by this faceting: >>> >>> curl http://localhost:8985/solr/enron_path_w_ts/query -d >>> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& >>> json.facet={ >>> filter_by_child_type :{ >>> type:query, >>> q:"type_s:doc.enriched.text.keywords", >>> domain: { blockChildren : "type_s:doc" }, >>> facet:{ >>> top_keywords_text : { >>> type: terms, >>> field: text_t, >>> limit: 10, >>> sort: "top_terms_by_doc desc", >>> facet: { >>>top_terms_by_doc: "unique(_root_)" >>> } >>> } >>> } >>> } >>> }' >>> >>> >>> The BlockJoin Faceting part is still open: I've tried all conventional >>> faceting parameters: facet.limit, child.facet.limit, f.text_t.facet.limit >>> ... nothing worked :( >>> >>> >>> >Понедельник, 28 марта 2016, 17:20 -04:00 от Alisa Z. < prol...@mail.ru >: >>> > >>> >Ok, so for the 1st question, I think I'm getting closer: adding facet: >>> {top_terms_by_doc: "unique(_root_)"} as indicated in >>> http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns >>> correct counts. However, sorting is done by the upper faceting not by the >>> unique(_root_): >>> > >>> > >>> >curl http://localhost:8985/solr/my_collection /query -d >>> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& >>> >json.facet={ >>> > filter_by_child_type :{ >>> >type:query, >>> >q:"type_s:doc.enriched.text.keywords", >>> >domain: { blockChildren : "type_s:doc" }, >>> >facet:{ >>> > top_keywords_text : { >>> >type: terms, >>> >field: text_t, >>> >limit: 10, >>> >fac
Re[4]: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)
Mikhail, I totally see the point: the corresponding wiki page ( https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting ) does not mention it and says it's an experimental feature. Is it correct that no additional options ( limit, mincount, etc.) can be set anyhow? Or more specifically, is there any work-around to control the output of the query at hand (maybe anything beyond faceting options): /bjqfacet?q={!parent%20which=type_s:doc}type_s:doc.enriched.text.keywords&child.facet.field=text_t&rows=0&fq={!parent%20which=type_s:doc}type_s:doc.userData%20%2BSubject_t:california&wt=json&indent=true > >> > >>RETURNS: > >> > >>{ > >> "responseHeader":{ > >> "status":0, > >> "QTime":1}, > >> "response":{"numFound":19,"start":0,"docs":[] > >> }, > >> "facet_counts":[ > >> "facet_fields",[ > >> "text_t",[ > >> "128x",1, > >> "18xx",1, > >> "1x",1, > >> "2",2, > >> "30",1, > >> "60",1, > >> "78xx",1, > >> "82xx",1, > >> "ab",2, > >> "access",5, > >> "account",1, > >> "accounts",1, > >>... > >>"california",13, > >>... > >>"enron",9, > >>... > >>]]]} > >> >Вторник, 29 марта 2016, 13:40 -04:00 от Mikhail Khludnev >: > >Alisa, > >There is no such thing as child.facet.limit, etc > >On Tue, Mar 29, 2016 at 6:27 PM, Alisa Z. < prol...@mail.ru > wrote: > >> So the first issue eventually solved by adding facet: {top_terms_by_doc: >> "unique(_root_)"} AND sorting the outer facet buckets by this faceting: >> >> curl http://localhost:8985/solr/enron_path_w_ts/query -d >> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& >> json.facet={ >> filter_by_child_type :{ >> type:query, >> q:"type_s:doc.enriched.text.keywords", >> domain: { blockChildren : "type_s:doc" }, >> facet:{ >> top_keywords_text : { >> type: terms, >> field: text_t, >> limit: 10, >> sort: "top_terms_by_doc desc", >> facet: { >>top_terms_by_doc: "unique(_root_)" >> } >> } >> } >> } >> }' >> >> >> The BlockJoin Faceting part is still open: I've tried all conventional >> faceting parameters: facet.limit, child.facet.limit, f.text_t.facet.limit >> ... nothing worked :( >> >> >> >Понедельник, 28 марта 2016, 17:20 -04:00 от Alisa Z. < prol...@mail.ru >: >> > >> >Ok, so for the 1st question, I think I'm getting closer: adding facet: >> {top_terms_by_doc: "unique(_root_)"} as indicated in >> http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns >> correct counts. However, sorting is done by the upper faceting not by the >> unique(_root_): >> > >> > >> >curl http://localhost:8985/solr/my_collection /query -d >> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& >> >json.facet={ >> > filter_by_child_type :{ >> >type:query, >> >q:"type_s:doc.enriched.text.keywords", >> >domain: { blockChildren : "type_s:doc" }, >> >facet:{ >> > top_keywords_text : { >> >type: terms, >> >field: text_t, >> >limit: 10, >> >facet: { >> > top_terms_by_doc: "unique(_root_)" >> > } >> > } >> >} >> > } >> >}' >> > >> >RETURNS >> > >> >{ >> > "responseHeader":{ >> >"status":0, >> >"QTime":25, >> >"params":{ >> > "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData >> +Subject_t:california", >> > "json.facet":"{\n filter_by_child_type :{\ntype:query,\n >> q:\"type_s:doc.enriched.text.keywords\",\ndomain: { blockChildren : >> \"type_s:doc\" },\nfacet:{\n top_keywords_text : {\ntype
Re[2]: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)
So the first issue eventually solved by adding facet: {top_terms_by_doc: "unique(_root_)"} AND sorting the outer facet buckets by this faceting: curl http://localhost:8985/solr/enron_path_w_ts/query -d 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"type_s:doc.enriched.text.keywords", domain: { blockChildren : "type_s:doc" }, facet:{ top_keywords_text : { type: terms, field: text_t, limit: 10, sort: "top_terms_by_doc desc", facet: { top_terms_by_doc: "unique(_root_)" } } } } }' The BlockJoin Faceting part is still open: I've tried all conventional faceting parameters: facet.limit, child.facet.limit, f.text_t.facet.limit ... nothing worked :( >Понедельник, 28 марта 2016, 17:20 -04:00 от Alisa Z. : > >Ok, so for the 1st question, I think I'm getting closer: adding facet: >{top_terms_by_doc: "unique(_root_)"} as indicated in >http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns correct >counts. However, sorting is done by the upper faceting not by the >unique(_root_): > > >curl http://localhost:8985/solr/my_collection /query -d >'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& >json.facet={ > filter_by_child_type :{ > type:query, > q:"type_s:doc.enriched.text.keywords", > domain: { blockChildren : "type_s:doc" }, > facet:{ > top_keywords_text : { > type: terms, > field: text_t, > limit: 10, > facet: { > top_terms_by_doc: "unique(_root_)" > } > } > } > } >}' > >RETURNS > >{ > "responseHeader":{ > "status":0, > "QTime":25, > "params":{ > "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData >+Subject_t:california", > "json.facet":"{\n filter_by_child_type :{\n type:query,\n >q:\"type_s:doc.enriched.text.keywords\",\n domain: { blockChildren : >\"type_s:doc\" },\n facet:{\n top_keywords_text : {\n type: >terms,\n field: text_t,\n limit: 10,\n facet: {\n > top_terms_by_doc: \"unique(_root_)\"\n }\n }\n }\n }\n}", > "rows":"0"}}, > "response":{"numFound":19,"start":0,"docs":[] > }, > "facets":{ > "count":19, > "filter_by_child_type":{ > "count":686, > "top_keywords_text":{ > "buckets":[{ > "val":"enron", > "count":57, > "top_terms_by_doc":9}, > { > "val":"california", > "count":22, > "top_terms_by_doc":13}, > { > "val":"power", > "count":21, > "top_terms_by_doc":7}, > { > "val":"rate", > "count":15, > "top_terms_by_doc":5}, > { > "val":"plan", > "count":13, > "top_terms_by_doc":3}, > { > "val":"hou", > "count":12, > "top_terms_by_doc":5}, > { > "val":"energy", > "count":11, > "top_terms_by_doc":5}, > { > "val":"na", > "count":11, > "top_terms_by_doc":5}, > { > "val":"mckinsey", > "count":10, > "top_terms_by_doc":1}, > { > "val":"socal", > "count":10, > "top_terms_by_doc":4}] > >Nice, but I want them to be ordered by "top_terms_by_doc" frequencies, not by >the "count" frequencies. >Any suggestions? > >Thanks, >Alisa > > > > > >>Понедельник, 28 марта 2016, 15:39 -04:00 от Alisa Z. < prol...@mail.ru >: >> >>Hi all, >> >>I am trying to perform faceting of parent docs by nested document fields.
Re: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)
Ok, so for the 1st question, I think I'm getting closer: adding facet: {top_terms_by_doc: "unique(_root_)"} as indicated in http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns correct counts. However, sorting is done by the upper faceting not by the unique(_root_): curl http://localhost:8985/solr/my_collection /query -d 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"type_s:doc.enriched.text.keywords", domain: { blockChildren : "type_s:doc" }, facet:{ top_keywords_text : { type: terms, field: text_t, limit: 10, facet: { top_terms_by_doc: "unique(_root_)" } } } } }' RETURNS { "responseHeader":{ "status":0, "QTime":25, "params":{ "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData +Subject_t:california", "json.facet":"{\n filter_by_child_type :{\n type:query,\n q:\"type_s:doc.enriched.text.keywords\",\n domain: { blockChildren : \"type_s:doc\" },\n facet:{\n top_keywords_text : {\n type: terms,\n field: text_t,\n limit: 10,\n facet: {\n top_terms_by_doc: \"unique(_root_)\"\n }\n }\n }\n }\n}", "rows":"0"}}, "response":{"numFound":19,"start":0,"docs":[] }, "facets":{ "count":19, "filter_by_child_type":{ "count":686, "top_keywords_text":{ "buckets":[{ "val":"enron", "count":57, "top_terms_by_doc":9}, { "val":"california", "count":22, "top_terms_by_doc":13}, { "val":"power", "count":21, "top_terms_by_doc":7}, { "val":"rate", "count":15, "top_terms_by_doc":5}, { "val":"plan", "count":13, "top_terms_by_doc":3}, { "val":"hou", "count":12, "top_terms_by_doc":5}, { "val":"energy", "count":11, "top_terms_by_doc":5}, { "val":"na", "count":11, "top_terms_by_doc":5}, { "val":"mckinsey", "count":10, "top_terms_by_doc":1}, { "val":"socal", "count":10, "top_terms_by_doc":4}] Nice, but I want them to be ordered by "top_terms_by_doc" frequencies, not by the "count" frequencies. Any suggestions? Thanks, Alisa >Понедельник, 28 марта 2016, 15:39 -04:00 от Alisa Z. : > >Hi all, > >I am trying to perform faceting of parent docs by nested document fields. I've >tried 2 approaches as in subject, yet in first the results are not quite >correct and in the 2nd I cannot get the query right. So I need help on either >of them and any explication or documentation or blogs on the behavior is much >appreciated. > >Verbally the query is as follows: "Find top 10 keywords for all documents with >"california" in email subject line" > >Here is the query with responses: > > Json Facet API > >curl http://localhost:8985/solr/my_collection/query -d >'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& >json.facet={ > filter_by_child_type :{ > type:query, > q:"type_s:doc.enriched.text.keywords", > domain: { blockChildren : "type_s:doc" }, > facet:{ > top_keywords_text : { > type: terms, > field: text_t, > limit: 10 > } > } > } >}' > >RETURNS: > >{ > "responseHeader":{ > "status":0, > "QTime":134, > "params":{ > "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData >+Subject_t:california", > "json.facet":"{\n filter_by_child_type :{\n type:query,\n >q:\"type_s:doc.enriched.text.keywo
[nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)
Hi all, I am trying to perform faceting of parent docs by nested document fields. I've tried 2 approaches as in subject, yet in first the results are not quite correct and in the 2nd I cannot get the query right. So I need help on either of them and any explication or documentation or blogs on the behavior is much appreciated. Verbally the query is as follows: "Find top 10 keywords for all documents with "california" in email subject line" Here is the query with responses: Json Facet API curl http://localhost:8985/solr/my_collection/query -d 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"type_s:doc.enriched.text.keywords", domain: { blockChildren : "type_s:doc" }, facet:{ top_keywords_text : { type: terms, field: text_t, limit: 10 } } } }' RETURNS: { "responseHeader":{ "status":0, "QTime":134, "params":{ "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData +Subject_t:california", "json.facet":"{\n filter_by_child_type :{\n type:query,\n q:\"type_s:doc.enriched.text.keywords\",\n domain: { blockChildren : \"type_s:doc\" },\n facet:{\n top_keywords_text : {\n type: terms,\n field: text_t,\n limit: 10\n }\n }\n }\n}", "rows":"0"}}, "response":{"numFound":19,"start":0,"docs":[] }, "facets":{ "count":19, "filter_by_child_type":{ "count":686, "top_keywords_text":{ "buckets":[{ "val":"enron", "count":57}, { "val":"california", "count":22}, { "val":"power", "count":21}, { "val":"rate", "count":15}, { "val":"plan", "count":13}, { "val":"hou", "count":12}, { "val":"energy", "count":11}, { "val":"na", "count":11}, { "val":"mckinsey", "count":10}, { "val":"socal", "count":10}] QUESTION: where do the counts greater than 19 (the total number of the top-level documents returned by the query) comes from? How to adjust the query to facet only on the top-level documents (and consequently no count should be greater than 19)? = BlockJoin Faceting == Following the example on https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting , I've tried this: /bjqfacet?q={!parent%20which=type_s:doc}type_s:doc.enriched.text.keywords&child.facet.field=text_t&child.facet.limit=10&child.facet.mincount=5&rows=0&fq={!parent%20which=type_s:doc}type_s:doc.userData%20%2BSubject_t:california&wt=json&indent=true RETURNS: { "responseHeader":{ "status":0, "QTime":1}, "response":{"numFound":19,"start":0,"docs":[] }, "facet_counts":[ "facet_fields",[ "text_t",[ "128x",1, "18xx",1, "1x",1, "2",2, "30",1, "60",1, "78xx",1, "82xx",1, "ab",2, "access",5, "account",1, "accounts",1, ... "california",13, ... "enron",9, ... ]]]} QUESTION: This looks very close to what I want, yet why child.facet.limit=10&child.facet.mincount=5 are ignored? How to get top 10 most frequent? Thank you for your help in advance! -- Alisa Zhila
Re[2]: Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?
Oh, I apologize... When I ran it the first time, I must have tried putting it in a different collection. Now that I saw it and put it into the correct collection (where the schema is adjusted properly), it worked! Thanks, that was the solution. >Понедельник, 28 марта 2016, 14:44 -04:00 от Erik Hatcher >: > >Alisa - sorry for not seeing this sooner, but I think Yonik is right… try >adding `-format solr` to the command-line of bin/post. > >Solr 5.5 is where the changed occurred to use a different end-point for JSON. > >— >Erik Hatcher, Senior Solutions Architect >http://www.lucidworks.com > > > >>On Mar 28, 2016, at 2:04 PM, Alisa Z. < prol...@mail.ru > wrote: >>@Yonik, thank you for your response. >> >>I think that the issue is of a different kind because my upload used to work >>well on Solr 5.3.1 and does not want to work on Solr 5.5.0 because of some >>changes in dynamic schema recognition. So maybe you could advise on >>reconsidering the data model that I am using. >> >>I have the type_s field serving as an indicator of different types of >>parents and children. However, in my data model, siblings at one level could >>be of different type/category, e.g.,: >> >>- >>type_s: PARENT >>---/---|\ >>- type_s: child_type1 -- >>type_s: child_type2 - type_s: child_type3 >>--/--\ >>- >>/--\---/ \ >> >>type_s: grandchild_type4 type_s: grandchild_type5 grandchild_type6 >> grandchild_type4 grandchild_type7 grandchild_type5 >> >>So the hierarchy distinguishing field type_s can have different values at >>different levels of the hierarchy because the nodes could be of different >>type. >> >> >>Further, in Solr 5.3.1 >>solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json >>doesn't produce any error and I can produce BlockJoin queries using type_s >>field for indicating the nodes. >> >>However, in Solr 5.5.0, when I try upload the same data in the same format >>(which was consumed perfectly in Solr 5.3.2): >>solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json >>I get the following error: >>"msg":"ERROR: [parent=id1] multiple values encountered for non multiValued >>field _childDocuments_._childDocuments_.type_s: [grandchild_type4, >>grandchild_type5]" . >> >> >>So now I feel that I should have either 2 types of fields for hierarchy >>description: one for hierarchy level specification and another for type of >>node specification; or make all single-valued fields multi-valued in >>descendants. However, I am not sure whetherte 2nd option will uniquely >>specify a document. >> >>Can anybody advise on the data modelling/schema approach for successful >>navigation a hierarchical data structure? >>I will be trying to adapt the approach outlined in " The Many Facets of >>Apache Solr " to my data. Yet I would like to hear any other practical tips >>for hierarchical data on Solr 5.5? >> >>Thank you in advance. >>--Alisa >> >> >>>Sat, 26 Mar 2016 -4:48:00 -0400 от Yonik Seeley < ysee...@gmail.com >: >>> >>>Found the JIRA: https://issues.apache.org/jira/browse/SOLR-7042 >>>It looks like you can try adding >>> -format solr >>>to your bin/post command line to get back to normal "solr JSON" >>> >>>-Yonik >>> >>> >>>On Fri, Mar 25, 2016 at 8:43 PM, Yonik Seeley < ysee...@gmail.com > wrote: >>>>On Fri, Mar 25, 2016 at 6:19 PM, Alisa Z. < prol...@mail.ru > wrote: >>>>>Hi all, >>>>>It is partially a question, partially a discussion. >>>>>I am working with documents with deep levels of nesting. The documents are >>>>>in a single JSON file (see a sample below). >>>>> >>>>>When I was on Solr 5.3.1, >>>>>solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json >>>> >>>>I think i recall seeing a JIRA go by that changed the URL that >>>>/bin/post hits from /update/json to /update/json/docs. >>>>I know the latter does more processing and handles "custom" JSON, but >>>>I don't know the details. That would be my guess about what changed >>>>and what's messing you up. >>>> >>>>You could try using curl directly to /update/json and see if that works >>>>better. >>>> >>>>-Yonik >> >
Re[2]: Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?
@Yonik, thank you for your response. I think that the issue is of a different kind because my upload used to work well on Solr 5.3.1 and does not want to work on Solr 5.5.0 because of some changes in dynamic schema recognition. So maybe you could advise on reconsidering the data model that I am using. I have the type_s field serving as an indicator of different types of parents and children. However, in my data model, siblings at one level could be of different type/category, e.g.,: - type_s: PARENT ---/---|\ - type_s: child_type1 -- type_s: child_type2 - type_s: child_type3 --/--\ - /--\---/ \ type_s: grandchild_type4 type_s: grandchild_type5 grandchild_type6 grandchild_type4 grandchild_type7 grandchild_type5 So the hierarchy distinguishing field type_s can have different values at different levels of the hierarchy because the nodes could be of different type. Further, in Solr 5.3.1 solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json doesn't produce any error and I can produce BlockJoin queries using type_s field for indicating the nodes. However, in Solr 5.5.0, when I try upload the same data in the same format (which was consumed perfectly in Solr 5.3.2): solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json I get the following error: "msg":"ERROR: [parent=id1] multiple values encountered for non multiValued field _childDocuments_._childDocuments_.type_s: [grandchild_type4, grandchild_type5]" . So now I feel that I should have either 2 types of fields for hierarchy description: one for hierarchy level specification and another for type of node specification; or make all single-valued fields multi-valued in descendants. However, I am not sure whetherte 2nd option will uniquely specify a document. Can anybody advise on the data modelling/schema approach for successful navigation a hierarchical data structure? I will be trying to adapt the approach outlined in " The Many Facets of Apache Solr " to my data. Yet I would like to hear any other practical tips for hierarchical data on Solr 5.5? Thank you in advance. --Alisa >Sat, 26 Mar 2016 -4:48:00 -0400 от Yonik Seeley : > >Found the JIRA: https://issues.apache.org/jira/browse/SOLR-7042 >It looks like you can try adding > -format solr >to your bin/post command line to get back to normal "solr JSON" > >-Yonik > > >On Fri, Mar 25, 2016 at 8:43 PM, Yonik Seeley < ysee...@gmail.com > wrote: >> On Fri, Mar 25, 2016 at 6:19 PM, Alisa Z. < prol...@mail.ru > wrote: >>> Hi all, >>> It is partially a question, partially a discussion. >>> I am working with documents with deep levels of nesting. The documents are >>> in a single JSON file (see a sample below). >>> >>> When I was on Solr 5.3.1, >>> solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json >> >> I think i recall seeing a JIRA go by that changed the URL that >> /bin/post hits from /update/json to /update/json/docs. >> I know the latter does more processing and handles "custom" JSON, but >> I don't know the details. That would be my guess about what changed >> and what's messing you up. >> >> You could try using curl directly to /update/json and see if that works >> better. >> >> -Yonik
Re: Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?
Further experiments: -- updated the schema to account for multiple values: curl -X POST -H 'Content-type:application/json' --data-binary '{ "add-dynamic-field":{ "name":"*type_s", "type":"string", "indexed":true, "multiValued":true } }' http://localhost:8985/solr/my_collection/schema -- Re-ran indexing again: solr-5.5.0$ bin/post -c my_collection ../../data/data-solr.json -p 8985 java -classpath /Users//solr-5.5.0/dist/solr-core-5.5.0.jar -Dauto=yes -Dport=8985 -Dc=enron_path_w_ts -Ddata=files org.apache.solr.util.SimplePostTool ../../data/data-solr.json SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8985/solr/my_collection/update... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file data-solr-path-w-ts-suffix.json (application/json) to [base]/json/docs SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8985/solr/my_collection/update/json/docs SimplePostTool: WARNING: Response: {"responseHeader":{"status":400,"QTime":12},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"ERROR: [doc=AVNzOoBsX6g-H6sC3dgo] multiple values encountered for non multiValued field _childDocuments_._childDocuments_._childDocuments_.relevance_tf: [0.918377, 0.737646, 0.700964, 0.659539, 0.657294, 0.62809, 0.612241, 0.609963, 0.873428, 0.764, 0.763825, 0.552016, 0.472819, 0.30331, 0.292935, 0.285799, 0.278851, 0.936158, 0.790093, 0.722639, 0.649841, 0.576905, 0.570454, 0.445547, 0.429439, 0.410347, 0.391091, 0.293075, 0.253883, 0.252494, 0.250084, 0.242866, 0.24142, 0.239883, 0.239827, 0.239563, 0.239507, 0.238434, 0.238193, 0.237804, 0.237769, 0.237022, 0.236955, 0.2364, 0.236164, 0.236129, 0.236025, 0.235973]","code":400}} SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8985/solr/my_collection/update/json/docs 1 files indexed. COMMITting Solr index changes to http://localhost:8985/solr/my_collection/update... Time spent: 0:00:05.137 So now it dumps all the values of relevance_tf into one array disregarding the type of the nested field they actually belonged... It really does not seem to account for proper hierarchy handling with branches of different types. :( -- Alisa >Пятница, 25 марта 2016, 18:19 -04:00 от Alisa Z. : > >Hi all, >It is partially a question, partially a discussion. >I am working with documents with deep levels of nesting. The documents are in >a single JSON file (see a sample below). > >When I was on Solr 5.3.1, >solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json >caused no problems. > >Now, I am trying to run just the the same on Solr-5.5.0: > >solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json >java -classpath /Users//solr-5.5.0/dist/solr-core-5.5.0.jar >-Dauto=yes -Dc=enron_path_w_ts -Ddata=files >org.apache.solr.util.SimplePostTool ../data/data-solr.json >SimplePostTool version 5.0.0 >Posting files to [base] url http://localhost:8983/solr/my_collection/update >... >Entering auto mode. File endings considered are >xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log >POSTing file data-solr.json (application/json) to [base]/json/docs >SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: >http://localhost:8983/solr/my_collection/update/json/docs >SimplePostTool: WARNING: Response: >{"responseHeader":{"status":400,"QTime":5},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"ERROR: > [doc=AVNzOoBsX6g-H6sC3dgo] multiple values encountered for non multiValued >field _childDocuments_._childDocuments_.type_s: [doc.userData.parts, >doc.enriched.text]","code":400}} >SimplePostTool: WARNING: IOException while reading response: >java.io.IOException: Server returned HTTP response code: 400 for URL: >http://localhost:8983/solr/my_collection/json/docs >1 files indexed. >COMMITting Solr index changes to >http://localhost:8983/solr/my_collection/update .. . >Time spent: 0:00:05.078 > >So obviously I don't get my collection uploaded and indexed properly anymore. > > >The question is: > - What to do? >
Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?
Hi all, It is partially a question, partially a discussion. I am working with documents with deep levels of nesting. The documents are in a single JSON file (see a sample below). When I was on Solr 5.3.1, solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json caused no problems. Now, I am trying to run just the the same on Solr-5.5.0: solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json java -classpath /Users//solr-5.5.0/dist/solr-core-5.5.0.jar -Dauto=yes -Dc=enron_path_w_ts -Ddata=files org.apache.solr.util.SimplePostTool ../data/data-solr.json SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/my_collection/update... Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file data-solr.json (application/json) to [base]/json/docs SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: http://localhost:8983/solr/my_collection/update/json/docs SimplePostTool: WARNING: Response: {"responseHeader":{"status":400,"QTime":5},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"ERROR: [doc=AVNzOoBsX6g-H6sC3dgo] multiple values encountered for non multiValued field _childDocuments_._childDocuments_.type_s: [doc.userData.parts, doc.enriched.text]","code":400}} SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/my_collection/json/docs 1 files indexed. COMMITting Solr index changes to http://localhost:8983/solr/my_collection/update.. . Time spent: 0:00:05.078 So obviously I don't get my collection uploaded and indexed properly anymore. The question is: - What to do? The discussion is: - Is it a proper behavior? It used to be smooth on Solr 5.3.1: I did not need to know how many levels of nesting do I exactly have and specify whether the _childDocuments_ were of the same type or not. A partial sample follows: [ { "type_s": "doc", "_childDocuments_": [ { "type_s": "doc.userData", "Mime-Version_t": "1.0", "_childDocuments_": [ { "type_s": "doc.userData.parts", "id": "AVNzOoBsX6g-H6sC3dgo-userData-23461" "content_t": "- SOMETEXT", "id": "AVNzOoBsX6g-H6sC3dgo-parts-15557", "contentType_t": "text/plain" } ], "Content-Transfer-Encoding_t": "7bit", }, { "type_s": "doc.enriched", "_childDocuments_": [ { "type_s": "doc.enriched.text", "language_t": "english", "_childDocuments_": [ { "type_s": "doc.enriched.text.docSentiment", "id": "AVNzOoBsX6g-H6sC3dgo-docSentiment-17692", "type_t": "positive" }, { "type_s": "doc.enriched.text.taxonomy", "label_t": "/business", "id": "AVNzOoBsX6g-H6sC3dgo-taxonomy-12728" }, { "type_s": "doc.enriched.text.concepts", "id": "AVNzOoBsX6g-H6sC3dgo-concepts-98530", "text_t": "Stephen", "_childDocuments_": [ { "type_s": "doc.enriched.text.concepts.knowledgeGraph", "id": "AVNzOoBsX6g-H6sC3dgo-knowledgeGraph-20811", "typeHierarchy_t": "/people/children/stephen" } ] }, { "type_s": "doc.enriched.text.concepts", "id": "AVNzOoBsX6g-H6sC3dgo-concepts-12396", "text_t": "Thought", "_childDocuments_": [ { "type_s": "doc.enriched.text.concepts.knowledgeGraph", "id": "AVNzOoBsX6g-H6sC3dgo-knowledgeGraph-20316", "typeHierarchy_t": "/people/ideas/thought" } ] },
Re[2]: [nesting] Any way to return the whole hierarchical structure when doing Block Join queries?
Mikhail, Thank you for the answer. I'd be happy to contribute tons of test cases on nested structures and their querying and faceting... I am working on a case of moving very nested data structures to Solr (and the other option is ES...) but so far Solr seems to be quite behind... It's great to see that it is moving in that direction though. I am happy to provide the use-cases (that are out of eCommerce actually) and publicly available test-cases. Is it correct that the patch will appear in a release version no sooner than Solr 6.0 or even later? Thanks, Alisa >Четверг, 24 марта 2016, 15:52 -04:00 от Mikhail Khludnev >: > >I think you cal already kick tires and contribute a test case into >https://issues.apache.org/jira/browse/SOLR-8208 that's already reachable >there I believe, but I still working on core design. > >On Thu, Mar 24, 2016 at 10:02 PM, Alisa Z. < prol...@mail.ru > wrote: > >> Hi all, >> >> I apologize for duplicating my previous message: >> Solr 5.3: anything similar to ChildDocTransformerFactory that does not >> flatten the hierarchical structure? >> >> However, it is still an open and interesting question: >> >> Following the example from https://dzone.com/articles/using-solr-49-new >> , let's say we are given multiple-level nested structure: >> >> >> 1 >> I am the parent >> PARENT >> >> 1.1 >> I am the 1st child >> CHILD >> >> >> 1.2 >> I am the 2nd child >> CHILD >> >> 1.2.1 >> I am a grandchildren >> GRANDCHILD >> >> >> >> >> >> Querying >> q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child >> parentFilter=cat:PARENT] >> >> will return flattened structure, where cat:CHILD and cat:GRANDCHILD >> documents end up on the same level: >> >> 1 >> I am the parent >> PARENT >> >> 1.1 >> I am the 1st child >> CHILD >> >> >> 1.2 >> I am the 2nd child >> CHILD >> >> >> 1.2.1 >> I am a grandchildren >> GRANDCHILD >> >> Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This >> transformer returns all descendants of each parent document in a flat list >> nested inside the parent document". >> >> Yet is there any way to preserve the hierarchy in the response? I really >> need to find the way to preserve the structure in the response. >> >> Thank you in advance! >> >> -- >> Alisa Zhila >> -- >> > > > >-- >Sincerely yours >Mikhail Khludnev >Principal Engineer, >Grid Dynamics > >< http://www.griddynamics.com > >< mkhlud...@griddynamics.com >
[nesting] Any way to return the whole hierarchical structure when doing Block Join queries?
Hi all, I apologize for duplicating my previous message: Solr 5.3: anything similar to ChildDocTransformerFactory that does not flatten the hierarchical structure? However, it is still an open and interesting question: Following the example from https://dzone.com/articles/using-solr-49-new , let's say we are given multiple-level nested structure: 1 I am the parent PARENT 1.1 I am the 1st child CHILD 1.2 I am the 2nd child CHILD 1.2.1 I am a grandchildren GRANDCHILD Querying q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child parentFilter=cat:PARENT] will return flattened structure, where cat:CHILD and cat:GRANDCHILD documents end up on the same level: 1 I am the parent PARENT 1.1 I am the 1st child CHILD 1.2 I am the 2nd child CHILD 1.2.1 I am a grandchildren GRANDCHILD Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This transformer returns all descendants of each parent document in a flat list nested inside the parent document". Yet is there any way to preserve the hierarchy in the response? I really need to find the way to preserve the structure in the response. Thank you in advance! -- Alisa Zhila --
Solr 5.3: anything similar to ChildDocTransformerFactory that does not flatten the hierarchical structure?
Hi all, Following the example from https://dzone.com/articles/using-solr-49-new , let's say we are given multiple-level nested structure: 1 I am the parent PARENT 1.1 I am the 1st child CHILD 1.2 I am the 2nd child CHILD 1.2.1 I am a grandchildren GRANDCHILD Querying q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child parentFilter=cat:PARENT] will return flattened structure, where cat:CHILD and cat:GRANDCHILD documents end up on the same level: 1 I am the parent PARENT 1.1 I am the 1st child CHILD 1.2 I am the 2nd child CHILD 1.2.1 I am a grandchildren GRANDCHILD Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This transformer returns all descendants of each parent document in a flat list nested inside the parent document". Yet is there any way to preserve the hierarchy in the response? I really need to find the way to preserve the structure in the response. Thank you in advance! -- Alisa Zhila
date range faceting on the whole dataset
Hello, Is it possible to perform date range faceting on the whole dataset without indicating facet.range.start and facet.range.end? What if I have no clue about when my data starts and when it ends (might be some point in the future)? A sample query: http://localhost:8983/solr/enron-path/select?q=*:*&rows=0&facet=true&facet.range=date_tdt&f.date_tdt.facet.range.start=NOW-20YEAR&f.date_tdt.facet.range.end=NOW-14YEARS&f.date_tdt.facet.range.gap=%2B1DAY&debugQuery=true However, in this case I found the range.start ans range.end points empirically, and there still is a lot of "blank" periods. Given, that I actually need to step by day, how to avoid unnecessary calculation on dates that are out of my data set? Thanks, -- Alisa Zhila
Re[2]: [nested] how to specify a path for multiple nesting?
Thanks, Mikhail. I eventually added a distinguishing field "path" and queried unambiguously. >Четверг, 17 марта 2016, 9:46 -04:00 от Mikhail Khludnev >: > >Hello, > >Please find inline > >On Wed, Mar 16, 2016 at 10:10 PM, Alisa Z. < prol...@mail.ru > wrote: >> Hi all, >>I have a deeply multi-level data structure (up to 6-7 levels deep) where due >>to the nature of the data some nested documents can have same type names at >>various levels. How to form a proper query on a nested field that would >>contain "a path" that defines that field? >> >>I'll clarify with an example: >> >>Reduced dataset: >> >>[ >> { >> id : book1, >> type_s:book, >> title_t : "The Way of Kings", >> author_s : "Brandon Sanderson", >> _childDocuments_ : [ >> { >> id: book1_c1, >> type_s:body, >> text_t:"body text of the book... ", >> _childDocuments_:[ >> {id: book2_c1_e1, >> type_s:"keywords", >> text_t:["The Matrix", "Neo", "character", "somebody", ...]} >> ] >> }, >> { id: book1_c2, >> type_s:title, >> text_t:"This book was too long.", >> _childDocuments_:[ >> {id: book2_c1_e1, >> type_s:"keywords", >> text_t:["The Matrix", "Neo"]} >> ] >> } >> ] >> }, >> ... >>] >> >>So there are different paths to text_t field: >>* book.body.keywords.text_t >>* book.title.keywords.text_t >>I need to write a query that returns, say, all books which have keyword >>"Neo" in their title (not body). >>I tried : >> >>(1) q={!parent which=type_s:book}type_s:keywords AND text_t:Neo >>which is obviously incorrect (returns both books whose body keywords and >>title keywords contain Neo): >> >>(2) q={!parent which=type_s:book}type_s:body^=0{!parent >>which=type_s:body}type_s:keywords AND text_t:Neo > >I'd say this might work, however I prefer to use v=$foo to break query >unambiguously. And also >https://lucidworks.com/blog/2011/12/28/why-not-and-or-and-not/ but make sure >that + is encoded as %2B in url. > >q={!parent which=type_s:book v=$titles}&titles=+type_s:title^=0 +{!parent >which='type_s:(body title book)' v=$keywords}&keywords=+type_s:keywords^=0 >+text_t:Neo > >specifying all sibling scopes discriminators is a black magic of block join >(if it ever works). Please get back with parsed query (from debugQuery=true) >and actual/expected result. Anyway, explicitly resolving scopes >(type_s:body_keywords, type_s:title_keywords) might be much maintainable. > > which does not return correct results (and I am not quite sure what it >really does, I just saw it in another thread of this mailing list) >> >>Can you help me to understand whether it is possible? >>Or do I have to give unique types for documents at different levels of >>nesting (e.g., type_s:body_keywords & type_s:title_keywords)? I am trying to >>avoid, finding a way to specify a path would be much much more preferable. >> >> >>Thank you in advance and looking forward to hearing from you >>-- >>Alisa Zhila > > >-- >Sincerely yours >Mikhail Khludnev >Principal Engineer, >Grid Dynamics > > >
[nested] how to specify a path for multiple nesting?
Hi all, I have a deeply multi-level data structure (up to 6-7 levels deep) where due to the nature of the data some nested documents can have same type names at various levels. How to form a proper query on a nested field that would contain "a path" that defines that field? I'll clarify with an example: Reduced dataset: [ { id : book1, type_s:book, title_t : "The Way of Kings", author_s : "Brandon Sanderson", _childDocuments_ : [ { id: book1_c1, type_s:body, text_t:"body text of the book... ", _childDocuments_:[ {id: book2_c1_e1, type_s:"keywords", text_t:["The Matrix", "Neo", "character", "somebody", ...]} ] }, { id: book1_c2, type_s:title, text_t:"This book was too long.", _childDocuments_:[ {id: book2_c1_e1, type_s:"keywords", text_t:["The Matrix", "Neo"]} ] } ] }, ... ] So there are different paths to text_t field: * book.body.keywords.text_t * book.title.keywords.text_t I need to write a query that returns, say, all books which have keyword "Neo" in their title (not body). I tried : (1) q={!parent which=type_s:book}type_s:keywords AND text_t:Neo which is obviously incorrect (returns both books whose body keywords and title keywords contain Neo): (2) q={!parent which=type_s:book}type_s:body^=0{!parent which=type_s:body}type_s:keywords AND text_t:Neo which does not return correct results (and I am not quite sure what it really does, I just saw it in another thread of this mailing list) Can you help me to understand whether it is possible? Or do I have to give unique types for documents at different levels of nesting (e.g., type_s:body_keywords & type_s:title_keywords)? I am trying to avoid, finding a way to specify a path would be much much more preferable. Thank you in advance and looking forward to hearing from you -- Alisa Zhila