Re[4]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)

2016-05-02 Thread Alisa Z .
 >>You could add a "level2_comment_id" field to the level 2 commends and
>>it's children, and then use unique() on that.

OK, I see, I missed the children... Thank you for pointing out. 

I have introduced that "unique sub-branch identifying" field and propagated it 
down the subbranch (the data is here: 
https://github.com/alisa-ipn/solr_nesting/blob/master/data/example-data-solr-for-faceting.json).
 Also changed the corresponding part of the post. 

And it actually works. Yet it requires a lot of effort to make Json API 
faceting handle faceting by intermediate levels.  

Making those "unique sub-branch identifying" fields dynamically appear the same 
way as the "_root_" field does will make Solr use friendlier for nested data 
like email chains and social media data... 

Thanks,
Alisa 

>Пятница, 22 апреля 2016, 13:47 -04:00 от Yonik Seeley :
>
>On Fri, Apr 22, 2016 at 12:26 PM, Alisa Z. < prol...@mail.ru > wrote:
>>  Hi Yonik,
>>
>> Thanks a lot for your response.
>>
>> I have discussed this with Mikhail Khludnev already and tried this 
>> suggestion. Here's what I've got:
>>
>>
>>
>> sentiment: positive
>> author: Bob
>> text: Great post about Solr
>> 2.blog-posts.comments-id: 10735-23004   //this is a 
>> new field, field name is different on each level for each type, values are 
>> unique
>> date: 2015-04-10T11:30:00Z
>> path: 2.blog-posts.comments
>> id: 10735-23004
>> Query:
>> curl http://localhost:8985/solr/solr_nesting_unique/query -d 
>> 'q=path:2.blog-posts.comments&rows=0&
>> json.facet={
>>   filter_by_child_type :{
>> type:query,
>> q:"path:*comments*keywords",
>> domain: { blockChildren : "path:2.blog-posts.comments" },
>> facet:{
>>   top_entity_text : {
>> type: terms,
>> field: text,
>> limit: 10,
>> sort: "counts_by_comments desc",
>> facet: {
>>counts_by_comments: "unique (2.blog-posts.comments-id )"  
>>   // changed
>>  }'
>
>
>Something is wrong if you are getting 0 counts.
>Lets try taking it piece-by-piece:
>
>Step 1:  q=path:2.blog-posts.comments
>This finds level 2 documents
>
>Step 2:  domain: { blockChildren : "path:2.blog-posts.comments" }
>This first maps to  all of the children (level 3 and level4)
>
>Step 3:  q:"path:*comments*keywords"
>This selects a subset of level3 and level4 documents with keywords
>(Note, in the future this should be doable as an additional filter in
>the domain spec, w/o an additional sub-facet level)
>
>Step 4:
>Facet on the text field of those level3 and level4 keyword docs. For
>each bucket, also find the unique number of values in the
>"2.blog-posts.comments-id" field on those documents.
>
>"Without seeing what you indexed, my guess is that the issue is that
>the "2.blog-posts.comments-id" field does not actually exist on those
>level3 and level4 docs being faceted.  The JSON Facet API doesn't
>propagate field values up/down the nested stack yet.  That's what
>https://issues.apache.org/jira/browse/SOLR-8998 is mostly about.
>
>-Yonik
>
>
>>
>> Response:
>>
>> "response":{"numFound":3,"start":0,"docs":[]
>>   },
>>   "facets":{
>> "count":3,
>> "filter_by_child_type":{
>>   "count":9,
>>   "top_entity_text":{
>> "buckets":[{
>> "val":"Elasticsearch",
>> "count":2,
>> "counts_by_comments":0},
>>   {
>> "val":"Solr",
>> "count":5,
>> "counts_by_comments":0},
>>   {
>> "val":"Solr 5.5",
>> "count":1,
>> "counts_by_comments":0},
>>   {
>> "val":"feature",
>> "count":1,
>> "counts_by_comments":0}]
>>
>> So unless I messed something up... or the field name does not look 
>> "canonical" (but it was fast to generate and  it is accepted in a normal 
>> query
>>  http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id 
>> :* )
>>
>> So I think that

Re[2]: Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)

2016-04-22 Thread Alisa Z .
 Hi Yonik, 

Thanks a lot for your response.  

I have discussed this with Mikhail Khludnev already and tried this suggestion. 
Here's what I've got:  



sentiment: positive
author: Bob
text: Great post about Solr
2.blog-posts.comments-id: 10735-23004       //this is a new 
field, field name is different on each level for each type, values are unique
date: 2015-04-10T11:30:00Z
path: 2.blog-posts.comments
id: 10735-23004
Query:
curl http://localhost:8985/solr/solr_nesting_unique/query -d 
'q=path:2.blog-posts.comments&rows=0&
json.facet={
  filter_by_child_type :{
    type:query,
    q:"path:*comments*keywords",
    domain: { blockChildren : "path:2.blog-posts.comments" },
    facet:{
  top_entity_text : {
    type: terms,
    field: text,
    limit: 10,
    sort: "counts_by_comments desc",
    facet: {
   counts_by_comments: "unique (2.blog-posts.comments-id )" 
   // changed
 }'


Response:

"response":{"numFound":3,"start":0,"docs":[]
  },
  "facets":{
    "count":3,
    "filter_by_child_type":{
  "count":9,
  "top_entity_text":{
    "buckets":[{
    "val":"Elasticsearch",
    "count":2,
    "counts_by_comments":0},
  {
    "val":"Solr",
    "count":5,
    "counts_by_comments":0},
  {
    "val":"Solr 5.5",
    "count":1,
    "counts_by_comments":0},
  {
    "val":"feature",
    "count":1,
    "counts_by_comments":0}]

So unless I messed something up... or the field name does not look "canonical" 
(but it was fast to generate and  it is accepted in a normal query 
http://localhost:8985/solr/solr_nesting_unique/query?q=2.blog-posts.body-id :* 
) 

So I think that it's just a JSON facet API limitation...  

Best,
--Alisa 


>Пятница, 22 апреля 2016, 9:55 -04:00 от Yonik Seeley :
>
>Hi Alisa,
>This was a bit too hard for me to grok on a first pass... then I saw
>your related blog post which includes the actual sample data and makes
>it more clear.
>
> More comments inline:
>
>On Wed, Apr 20, 2016 at 2:29 PM, Alisa Z. < prol...@mail.ru > wrote:
>>  Hi all,
>>
>> I have been stretching some SOLR's capabilities for nested documents 
>> handling and I've come up with the following issue...
>>
>> Let's say I have the following structure:
>>
>> {
>> "blog-posts":{  //level 1
>> "leaf-fields":[
>> "date",
>> "author"],
>> "title":{   //level 2
>> "leaf-fields":[ "text"],
>> "keywords":{//level 3
>> "leaf-fields":[
>> "text",
>> "type"]
>> }
>> },
>> "body":{//level 2
>> "leaf-fields":[ "text"],
>> "keywords":{//level 3
>> "leaf-fields":[
>> "text",
>> "type"]
>> }
>> },
>> "comments":{//level 2
>> "leaf-fields":[
>> "date",
>> "author",
>> "text",
>> "sentiment"
>> ],
>> "keywords":{//level 3
>> "leaf-fields":[
>> "text",
>> "type"]
>> },
>> "replies":{ //level 3
>> "leaf-fields":[
>> "date",
>> "author",
>> "text",
>> "sentiment"],
>> "keywords":{//level 4
>> "leaf-fields":[
>> "text",
>> "type"]
>> }
>>
>>
>> And I want to know the distribution of all readers' keywords (levels 3 and 
>> 4) by comments (level 2).
>> In JSON Facet API I tried this:
>>
>>

Re[2]: how to restrict phrase to appear in same child document

2016-04-21 Thread Alisa Z .
 I'm afraid that if the queries are given in such a loose natural language 
form, the only way to handle it is to introduce some natural language 
processing stage that would form the right query (which is actually a working 
strategy, IBM does so). 

If your document structure is fixed (i.e., you know types of nested documents 
and what fields they exactly contain) , you can try to introduce some basic NLP 
that will detect the entities or nouns,e.g., "driver" and "car" (try 
AlchemyLanguage API  http://www.alchemyapi.com/products/demo/alchemylanguage 
for this) and you will also need some syntactic parser to connect black+driver 
and white+mercedes correctly.  



>Среда, 20 апреля 2016, 15:31 -04:00 от Yangrui Guo :
>
>Hi thanks for answering. My problem is that users do not distinguish what
>color the color belongs to in the query. For example, "which black driver
>has a white mercedes", it is difficult to distinguish which color belongs
>to which field, because there can be thousands of car brands and
>professions. Is there anyway that can achieve the feature I stated been
>fore?
>
>On Wednesday, April 20, 2016, Alisa Z. < prol...@mail.ru > wrote:
>
>>  Yangrui,
>>
>> First, have you indexed your documents with proper nested document
>> structure [
>>  
>> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments
>>  ]?
>> From the peice of data you showed, it seems that you just put it right as
>> it is and it all got flattened.
>>
>> Then, you'll probably want to introduce a distinguishing
>> "type"/"category"/"path" fields into your data, so it would look like this:
>>
>> {
>> type:top
>> id:
>> {
>> type:car_color
>> car:
>> color:
>> }
>> {
>>   type:driver_color
>> driver:
>> color:
>> }
>> }
>>
>>
>> >Wed, 20 Apr 2016 -3:28:33 -0400 от Yangrui Guo < guoyang...@gmail.com
>> >:
>> >
>> >hello
>> >
>> >I have a nested document type in my index. Here's the structure of my
>> >document:
>> >
>> >{
>> >id:
>> >{
>> >car:
>> >color:
>> >}
>> >{
>> >driver:
>> >color:
>> >}
>> >}
>> >
>> >However, when I use the query q={!parent
>> >which="content_type:parent"}+(black AND driver)&fq={!parent
>> >which="content_type:parent"}+(white AND mercedes), the result also
>> >contained white driver with black mercedes. I know I can put fields before
>> >terms but it is not always easy to do this. Users might just enter one
>> >string. How can I modify my query to require that the terms between two
>> >parentheses must appear in the same child document, or boost those meet
>> the
>> >criteria? Thanks
>>
>>



Re: pivoting with json facet api

2016-04-21 Thread Alisa Z .
 Hi Yangrui, 

I have summarized some experiments about Solr nesting capabilities (however, it 
does not include precisely pivoting yet more of faceting up to parents and down 
to children with some statictics) so maybe you could find an idea there: 

https://medium.com/@alisazhila/solr-s-nesting-on-solr-s-capabilities-to-handle-deeply-nested-document-structures-50eeaaa4347a#.dbxdv3zdp
   

Please, let me know if it were useful in comments. You could also specify your 
problem a bit more if you don't find the answer. 

Cheers,
Alisa 



>Четверг, 21 апреля 2016, 1:01 -04:00 от Yangrui Guo :
>
>Hi
>
>I am trying to facet results on my nest documents. The solr document did
>not say much on how to pivot with json api with nest documents. Could
>someone show me some examples? Thanks very much.
>
>Yangrui



Re[2]: Traversal of documents through network

2016-04-21 Thread Alisa Z .
 Well, it took me 7 milliseconds to index a 100MB dataset on a local Solr. So 
you could assume that for 1 GB it would take 70ms= 0.07s which is still pretty 
fast. 
Yet dealing with network delays is a separate issue.  

100 wikipedia article-size documents shouldn't be a big problem. 


>Четверг, 21 апреля 2016, 0:57 -04:00 от vidya :
>
>ok. I understand that. So, you would say documents traverse through network.
>If i specify some 100 docs to be dispalyed on my first page, will it effect
>performance. While docs gets traversed, will there be any high volume
>traffic and effects performance of the application.
>
>
>And whats the time solr takes to index 1GB of data in general.
>
>
>Thanks
>
>
>
>--
>View this message in context:  
>http://lucene.472066.n3.nabble.com/Traversal-of-documents-through-network-tp4271555p4271743.html
>Sent from the Solr - User mailing list archive at Nabble.com.



Block Join faceting on intermediate levels with JSON Facet API (might be related to block join rollups & SOLR-8998)

2016-04-20 Thread Alisa Z .
 Hi all, 

I have been stretching some SOLR's capabilities for nested documents handling 
and I've come up with the following issue...

Let's say I have the following structure:

{
"blog-posts":{  //level 1
    "leaf-fields":[
    "date",
    "author"],
    "title":{   //level 2
    "leaf-fields":[ "text"],
    "keywords":{    //level 3
    "leaf-fields":[
    "text",
    "type"]
    }
    },
    "body":{    //level 2
    "leaf-fields":[ "text"],
    "keywords":{    //level 3
    "leaf-fields":[
    "text",
    "type"]
    }
    },
    "comments":{    //level 2
    "leaf-fields":[
    "date",
    "author",
    "text",
    "sentiment"
    ],
    "keywords":{    //level 3
    "leaf-fields":[
    "text",
    "type"]
    },
    "replies":{ //level 3
    "leaf-fields":[
    "date",
    "author",
    "text",
    "sentiment"],
    "keywords":{    //level 4
    "leaf-fields":[
    "text",
    "type"]
    }
  
  
And I want to know the distribution of all readers' keywords (levels 3 and 4) 
by comments (level 2).  
In JSON Facet API I tried this: 

curl http://localhost:8983/solr/my_index/query -d 
'q=path:2.blog-posts.comments&rows=0&
json.facet={
  filter_by_child_type :{
    type:query,
    q:"path:*comments*keywords",
    domain: { blockChildren : "path:2.blog-posts.comments" },
    facet:{
  top_keywords : {
    type: terms,
    field: text,
    sort: "counts_by_comments desc",
    facet: {
   counts_by_comments: "unique(_root_)"    // I suspect in should be a 
different field, not _root_, but would it be for an intermediate document? 
 }'

Which gives me the wrong results, it aggregates by posts, not by comments (it's 
a toy data set, so I know that the correct answer for "Solr" is 3 when faceted 
by for comments)

{
"response":{"numFound":3,"start":0,"docs":[]
  },
  "facets":{
    "count":3,
    "filter_by_child_type":{
  "count":9,
  "top_keywords":{
    "buckets":[{
    "val":"Elasticsearch",
    "count":2,
    "counts_by_comments":2},
  {
    "val":"Solr",
    "count":5,
    "counts_by_comments":2},   //here the count by 
"comments" should be 3 
  {
    "val":"Solr 5.5",
    "count":1,
    "counts_by_comments":1},
  {
    "val":"feature",
    "count":1,
    "counts_by_comments":1}]


Am I writing the query wrong? 


By the way, Block Join Faceting works fine for this: 
bjqfacet?q={!parent%20which=path:2.blog-posts.comments}path:*.comments*keywords&rows=0&facet=true&child.facet.field=text&wt=json&indent=true

{
  "response":{"numFound":3,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
  "text":[
    "Elasticsearch",2,
    "Solr",3,  //correct result 
    "Solr 5.5",1,
    "feature",1]},
    "facet_dates":{},
    "facet_ranges":{},
    "facet_intervals":{},
    "facet_heatmaps":{}}}

But we've already discussed that it returns too much stuff: no way to put 
limits or order by counts :(  That's why I want to see whether it's posible to 
make JSON Facet API straight. 

Thank you in advance!

-- 
Alisa Zhila

Re: how to restrict phrase to appear in same child document

2016-04-20 Thread Alisa Z .
 Yangrui, 

First, have you indexed your documents with proper nested document structure 
[https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments]?
 From the peice of data you showed, it seems that you just put it right as it 
is and it all got flattened. 

Then, you'll probably want to introduce a distinguishing 
"type"/"category"/"path" fields into your data, so it would look like this: 

{
type:top
id:
{
type:car_color
car:
color:
}
{
  type:driver_color
driver:
color:
}
}


>Wed, 20 Apr 2016 -3:28:33 -0400 от Yangrui Guo :
>
>hello
>
>I have a nested document type in my index. Here's the structure of my
>document:
>
>{
>id:
>{
>car:
>color:
>}
>{
>driver:
>color:
>}
>}
>
>However, when I use the query q={!parent
>which="content_type:parent"}+(black AND driver)&fq={!parent
>which="content_type:parent"}+(white AND mercedes), the result also
>contained white driver with black mercedes. I know I can put fields before
>terms but it is not always easy to do this. Users might just enter one
>string. How can I modify my query to require that the terms between two
>parentheses must appear in the same child document, or boost those meet the
>criteria? Thanks



Re: Traversal of documents through network

2016-04-20 Thread Alisa Z .
 Viday, 

No, not all of those 500 result docs will be brought to your client (browser, 
etc.)   Only as many documents as fit into the 1st "search result page" will be 
brought.

There is a notion of "pagination" in Solr (as well as in most search engines). 
The counts of occurrence might be approximate and anyway you will be displayed 
only as many documents as specified by your "search result page" size. By 
default, page size is set to 10 documents, so although you might see something 
like "response":{"numFound":27,"start":0,"docs"}, only 10 top documents will be 
displayed. 

In Solr, "page" size  is controlled with "start" and "row" parameters ( see 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results), so if 
you want less results to be brought at a time, you can specify your query like 
this: 
q="word"&row=5  - that will show you only top 5 results and only they will 
"traverse the network" (or being brought from the Solr server to your browser 
or other client).

If you want to look at another page, you specify 
q="word"&row=5&start=5 - this is the 2nd page  of the results 


Hope it helps.

--Alisa 


>Среда, 20 апреля 2016, 10:01 -04:00 от vidya :
>
>Hi
>
>When i queried a word in solr, documents having that keyword is displayed in
>500 documents,lets say. Will all those documents traverse through network ?
>Or how it happens ?
>
>Please help me on this.
>
>
>
>--
>View this message in context:  
>http://lucene.472066.n3.nabble.com/Traversal-of-documents-through-network-tp4271555.html
>Sent from the Solr - User mailing list archive at Nabble.com.



Re[2]: [possible bug]: [child] - ChildDocTransformerFactory returns top level documents nested under middle level documents when queried for the middle level ones

2016-03-31 Thread Alisa Z .
 Thanks, Anshum! 

This definitely brings the result I wanted. 

It is just the description from ChildDocTransformerFactory docs (" This 
transformer returns all descendants of each parent document in a flat list 
nested inside the parent document .") is a bit misleading... 

One should never stop experimenting :) 


>Среда, 30 марта 2016, 15:19 -04:00 от Anshum Gupta :
>
>I'm not the best person to comment on this so perhaps someone could chime
>in as well, but can you try using a wildcard for your childFilter?
>Something like: childFilter=type_s:doc.enriched.text.*
>
>You could also possibly enrich the document with depth information and use
>that for filtering out.
>
>On Wed, Mar 30, 2016 at 11:34 AM, Alisa Z. < prol...@mail.ru > wrote:
>
>>  I think I am observing an unexpected behavior of
>> ChildDocTransformerFactory.
>>
>> The query is like this:
>>
>> /select?q={!parent which= "type_s:doc.enriched.text "}t
>> ype_s:doc.enriched.text.entities  +text_t:pjm +type_t:Company
>> +relevance_tf:[0.7%20TO%20*]&fl=*,[child
>> parentFilter=type_s:doc.enriched.text  limit=1000]
>>
>> The levels of hierarchy are shown in the  type_s field.  So I am querying
>> on some descendants and returning some ancestors that are somewhere in the
>> middle of the hierarchy. I also want to get all the nested documents
>> below  that middle level.
>>
>> Here is the result:
>>
>> 
>> 
>>
>>  doc.enriched.text// this is the level
>> I wanted to get to and then go down from it
>>  ... 
>>  13565 
>> 
>>  doc.enriched   // This is a document
>> from 1 level up, the parent of the
>>// current  type_s :
>> doc.enriched.text document -- why is it here?
>>  22024 
>> 
>> 
>>  doc.original   // This is an "uncle"
>>  26698 
>> 
>> 
>>  doc// and this a
>> grandparent!!!
>>
>>
>> 
>>
>> And so on, bringing the whole tree up and down all under my middle-level
>> document.
>> I really hope this is not the expected behavior.
>>
>> I appreciate your help in advance.
>>
>> --
>> Alisa Zhila
>
>
>
>
>-- 
>Anshum Gupta



[possible bug]: [child] - ChildDocTransformerFactory returns top level documents nested under middle level documents when queried for the middle level ones

2016-03-30 Thread Alisa Z .
 I think I am observing an unexpected behavior of ChildDocTransformerFactory. 

The query is like this: 

/select?q={!parent which= "type_s:doc.enriched.text "}t 
ype_s:doc.enriched.text.entities  +text_t:pjm +type_t:Company 
+relevance_tf:[0.7%20TO%20*]&fl=*,[child  parentFilter=type_s:doc.enriched.text 
 limit=1000]

The levels of hierarchy are shown in the  type_s field.  So I am querying on 
some descendants and returning some ancestors that are somewhere in the middle 
of the hierarchy. I also want to get all the nested documents  below  that 
middle level. 

Here is the result:




 doc.enriched.text    // this is the level I 
wanted to get to and then go down from it 
 ... 
 13565 

 doc.enriched   // This is a document from 1 
level up, the parent of the   
   // current  type_s : 
doc.enriched.text document -- why is it here?   
 22024 


 doc.original   // This is an "uncle"
 26698 


 doc    // and this a grandparent!!! 
 
   
   


And so on, bringing the whole tree up and down all under my middle-level 
document.  
I really hope this is not the expected behavior.

I appreciate your help in advance. 

-- 
Alisa Zhila

Re[5]: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)

2016-03-29 Thread Alisa Z .
 Alright, based on  https://issues.apache.org/jira/browse/SOLR-5743 I can 
assume that limit and mincount for the BlockJoin part stay an open issue for 
some time ...  
Therefore, the answer is no as of Solr 5.5.0. 

Thanks to Mikhail Khludnev for working on the subject. 

>Вторник, 29 марта 2016, 14:38 -04:00 от Alisa Z. :
>
>Mikhail, 
>
>I totally see the point: the corresponding wiki page (  
>https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting ) does not 
>mention it and says it's an experimental feature. 
>
>Is it correct that no additional options ( limit, mincount, etc.) can  be set 
>anyhow?  
>
>Or more specifically, is there any work-around to control the output of the 
>query at hand (maybe anything beyond faceting options): 
>
>/bjqfacet?q={!parent%20which=type_s:doc}type_s:doc.enriched.text.keywords&child.facet.field=text_t&rows=0&fq={!parent%20which=type_s:doc}type_s:doc.userData%20%2BSubject_t:california&wt=json&indent=true
>> >>
>> >>RETURNS:
>> >>
>> >>{
>> >> "responseHeader":{
>> >> "status":0,
>> >> "QTime":1},
>> >> "response":{"numFound":19,"start":0,"docs":[]
>> >> },
>> >> "facet_counts":[
>> >> "facet_fields",[
>> >> "text_t",[
>> >> "128x",1,
>> >> "18xx",1,
>> >> "1x",1,
>> >> "2",2,
>> >> "30",1,
>> >> "60",1,
>> >> "78xx",1,
>> >> "82xx",1,
>> >> "ab",2,
>> >> "access",5,
>> >> "account",1,
>> >> "accounts",1,
>> >>...
>> >>"california",13,
>> >>...
>> >>"enron",9,
>> >>...
>> >>]]]}
>> >>  
>
>
>>Вторник, 29 марта 2016, 13:40 -04:00 от Mikhail Khludnev < 
>>mkhlud...@griddynamics.com >:
>>
>>Alisa,
>>
>>There is no such thing as child.facet.limit, etc
>>
>>On Tue, Mar 29, 2016 at 6:27 PM, Alisa Z. <  prol...@mail.ru > wrote:
>>
>>>  So the first issue eventually solved by adding facet: {top_terms_by_doc:
>>> "unique(_root_)"} AND sorting the outer facet buckets by this faceting:
>>>
>>> curl http://localhost:8985/solr/enron_path_w_ts/query -d
>>> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
>>> json.facet={
>>>   filter_by_child_type :{
>>> type:query,
>>> q:"type_s:doc.enriched.text.keywords",
>>> domain: { blockChildren : "type_s:doc" },
>>> facet:{
>>>   top_keywords_text : {
>>> type: terms,
>>> field: text_t,
>>> limit: 10,
>>> sort: "top_terms_by_doc desc",
>>>  facet: {
>>>top_terms_by_doc: "unique(_root_)"
>>>  }
>>>   }
>>> }
>>>   }
>>> }'
>>>
>>>
>>> The  BlockJoin Faceting  part is still open:  I've tried all conventional
>>> faceting parameters:  facet.limit, child.facet.limit, f.text_t.facet.limit
>>> ... nothing worked :(
>>>
>>>
>>> >Понедельник, 28 марта 2016, 17:20 -04:00 от Alisa Z. <  prol...@mail.ru >:
>>> >
>>> >Ok, so for the 1st question, I think I'm getting closer:  adding  facet:
>>> {top_terms_by_doc: "unique(_root_)"}  as indicated in
>>>  http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns
>>> correct counts. However, sorting is done by the upper faceting not by the
>>> unique(_root_):
>>> >
>>> >
>>> >curl  http://localhost:8985/solr/my_collection /query -d
>>> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
>>> >json.facet={
>>> >  filter_by_child_type :{
>>> >type:query,
>>> >q:"type_s:doc.enriched.text.keywords",
>>> >domain: { blockChildren : "type_s:doc" },
>>> >facet:{
>>> >  top_keywords_text : {
>>> >type: terms,
>>> >field: text_t,
>>> >limit: 10,
>>> >fac

Re[4]: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)

2016-03-29 Thread Alisa Z .
 Mikhail, 

I totally see the point: the corresponding wiki page ( 
https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting ) does not 
mention it and says it's an experimental feature. 

Is it correct that no additional options ( limit, mincount, etc.) can  be set 
anyhow?  

Or more specifically, is there any work-around to control the output of the 
query at hand (maybe anything beyond faceting options): 

/bjqfacet?q={!parent%20which=type_s:doc}type_s:doc.enriched.text.keywords&child.facet.field=text_t&rows=0&fq={!parent%20which=type_s:doc}type_s:doc.userData%20%2BSubject_t:california&wt=json&indent=true
> >>
> >>RETURNS:
> >>
> >>{
> >> "responseHeader":{
> >> "status":0,
> >> "QTime":1},
> >> "response":{"numFound":19,"start":0,"docs":[]
> >> },
> >> "facet_counts":[
> >> "facet_fields",[
> >> "text_t",[
> >> "128x",1,
> >> "18xx",1,
> >> "1x",1,
> >> "2",2,
> >> "30",1,
> >> "60",1,
> >> "78xx",1,
> >> "82xx",1,
> >> "ab",2,
> >> "access",5,
> >> "account",1,
> >> "accounts",1,
> >>...
> >>"california",13,
> >>...
> >>"enron",9,
> >>...
> >>]]]}
> >>  


>Вторник, 29 марта 2016, 13:40 -04:00 от Mikhail Khludnev 
>:
>
>Alisa,
>
>There is no such thing as child.facet.limit, etc
>
>On Tue, Mar 29, 2016 at 6:27 PM, Alisa Z. < prol...@mail.ru > wrote:
>
>>  So the first issue eventually solved by adding facet: {top_terms_by_doc:
>> "unique(_root_)"} AND sorting the outer facet buckets by this faceting:
>>
>> curl http://localhost:8985/solr/enron_path_w_ts/query -d
>> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
>> json.facet={
>>   filter_by_child_type :{
>> type:query,
>> q:"type_s:doc.enriched.text.keywords",
>> domain: { blockChildren : "type_s:doc" },
>> facet:{
>>   top_keywords_text : {
>> type: terms,
>> field: text_t,
>> limit: 10,
>> sort: "top_terms_by_doc desc",
>>  facet: {
>>top_terms_by_doc: "unique(_root_)"
>>  }
>>   }
>> }
>>   }
>> }'
>>
>>
>> The  BlockJoin Faceting  part is still open:  I've tried all conventional
>> faceting parameters:  facet.limit, child.facet.limit, f.text_t.facet.limit
>> ... nothing worked :(
>>
>>
>> >Понедельник, 28 марта 2016, 17:20 -04:00 от Alisa Z. < prol...@mail.ru >:
>> >
>> >Ok, so for the 1st question, I think I'm getting closer:  adding  facet:
>> {top_terms_by_doc: "unique(_root_)"}  as indicated in
>>  http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns
>> correct counts. However, sorting is done by the upper faceting not by the
>> unique(_root_):
>> >
>> >
>> >curl  http://localhost:8985/solr/my_collection /query -d
>> 'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
>> >json.facet={
>> >  filter_by_child_type :{
>> >type:query,
>> >q:"type_s:doc.enriched.text.keywords",
>> >domain: { blockChildren : "type_s:doc" },
>> >facet:{
>> >  top_keywords_text : {
>> >type: terms,
>> >field: text_t,
>> >limit: 10,
>> >facet: {
>> >   top_terms_by_doc: "unique(_root_)"
>> > }
>> >  }
>> >}
>> >  }
>> >}'
>> >
>> >RETURNS
>> >
>> >{
>> >  "responseHeader":{
>> >"status":0,
>> >"QTime":25,
>> >"params":{
>> >  "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData
>> +Subject_t:california",
>> >  "json.facet":"{\n  filter_by_child_type :{\ntype:query,\n
>> q:\"type_s:doc.enriched.text.keywords\",\ndomain: { blockChildren :
>> \"type_s:doc\" },\nfacet:{\n  top_keywords_text : {\ntype

Re[2]: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)

2016-03-29 Thread Alisa Z .
 So the first issue eventually solved by adding facet: {top_terms_by_doc: 
"unique(_root_)"} AND sorting the outer facet buckets by this faceting:  

curl http://localhost:8985/solr/enron_path_w_ts/query -d 
'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
json.facet={
  filter_by_child_type :{
    type:query,
    q:"type_s:doc.enriched.text.keywords",
    domain: { blockChildren : "type_s:doc" },
    facet:{
  top_keywords_text : {
    type: terms,
    field: text_t,
    limit: 10,
    sort: "top_terms_by_doc desc",
     facet: {
   top_terms_by_doc: "unique(_root_)"
 }
  }
    }
  }
}'


The  BlockJoin Faceting  part is still open:  I've tried all conventional 
faceting parameters:  facet.limit, child.facet.limit, f.text_t.facet.limit ... 
nothing worked :( 


>Понедельник, 28 марта 2016, 17:20 -04:00 от Alisa Z. :
>
>Ok, so for the 1st question, I think I'm getting closer:  adding  facet: 
>{top_terms_by_doc: "unique(_root_)"}  as indicated in  
>http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns correct 
>counts. However, sorting is done by the upper faceting not by the 
>unique(_root_):  
>
>
>curl  http://localhost:8985/solr/my_collection /query -d 
>'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
>json.facet={
>  filter_by_child_type :{
>    type:query,
>    q:"type_s:doc.enriched.text.keywords",
>    domain: { blockChildren : "type_s:doc" },
>    facet:{
>  top_keywords_text : {
>    type: terms,
>    field: text_t,
>    limit: 10,
>    facet: {
>   top_terms_by_doc: "unique(_root_)"
> }
>  }
>    }
>  }
>}'
>
>RETURNS 
>
>{
>  "responseHeader":{
>    "status":0,
>    "QTime":25,
>    "params":{
>  "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData 
>+Subject_t:california",
>  "json.facet":"{\n  filter_by_child_type :{\n    type:query,\n    
>q:\"type_s:doc.enriched.text.keywords\",\n    domain: { blockChildren : 
>\"type_s:doc\" },\n    facet:{\n  top_keywords_text : {\n    type: 
>terms,\n    field: text_t,\n    limit: 10,\n    facet: {\n 
>  top_terms_by_doc: \"unique(_root_)\"\n }\n  }\n    }\n  }\n}",
>  "rows":"0"}},
>  "response":{"numFound":19,"start":0,"docs":[]
>  },
>  "facets":{
>    "count":19,
>    "filter_by_child_type":{
>  "count":686,
>  "top_keywords_text":{
>    "buckets":[{
>    "val":"enron",
>    "count":57,
>    "top_terms_by_doc":9},
>  {
>    "val":"california",
>    "count":22,
>    "top_terms_by_doc":13},
>  {
>    "val":"power",
>    "count":21,
>    "top_terms_by_doc":7},
>  {
>    "val":"rate",
>    "count":15,
>    "top_terms_by_doc":5},
>  {
>    "val":"plan",
>    "count":13,
>    "top_terms_by_doc":3},
>  {
>    "val":"hou",
>    "count":12,
>    "top_terms_by_doc":5},
>  {
>    "val":"energy",
>    "count":11,
>    "top_terms_by_doc":5},
>  {
>    "val":"na",
>    "count":11,
>    "top_terms_by_doc":5},
>  {
>    "val":"mckinsey",
>    "count":10,
>    "top_terms_by_doc":1},
>  {
>    "val":"socal",
>    "count":10,
>    "top_terms_by_doc":4}]
>
>Nice, but I want them to be ordered by "top_terms_by_doc" frequencies,  not by 
>the "count" frequencies. 
>Any suggestions?
>
>Thanks,
>Alisa 
>
>
>
>
>
>>Понедельник, 28 марта 2016, 15:39 -04:00 от Alisa Z. < prol...@mail.ru >:
>>
>>Hi all, 
>>
>>I am trying to perform faceting of parent docs by nested document fields. 

Re: [nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)

2016-03-28 Thread Alisa Z .
 Ok, so for the 1st question, I think I'm getting closer:  adding  facet: 
{top_terms_by_doc: "unique(_root_)"}  as indicated in  
http://blog.griddynamics.com/search/label/~Mikhail%20Khludnev returns correct 
counts. However, sorting is done by the upper faceting not by the 
unique(_root_):  


curl  http://localhost:8985/solr/my_collection /query -d 
'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
json.facet={
  filter_by_child_type :{
    type:query,
    q:"type_s:doc.enriched.text.keywords",
    domain: { blockChildren : "type_s:doc" },
    facet:{
  top_keywords_text : {
    type: terms,
    field: text_t,
    limit: 10,
    facet: {
   top_terms_by_doc: "unique(_root_)"
 }
  }
    }
  }
}'

RETURNS 

{
  "responseHeader":{
    "status":0,
    "QTime":25,
    "params":{
  "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData 
+Subject_t:california",
  "json.facet":"{\n  filter_by_child_type :{\n    type:query,\n    
q:\"type_s:doc.enriched.text.keywords\",\n    domain: { blockChildren : 
\"type_s:doc\" },\n    facet:{\n  top_keywords_text : {\n    type: 
terms,\n    field: text_t,\n    limit: 10,\n    facet: {\n  
 top_terms_by_doc: \"unique(_root_)\"\n }\n  }\n    }\n  }\n}",
  "rows":"0"}},
  "response":{"numFound":19,"start":0,"docs":[]
  },
  "facets":{
    "count":19,
    "filter_by_child_type":{
  "count":686,
  "top_keywords_text":{
    "buckets":[{
    "val":"enron",
    "count":57,
    "top_terms_by_doc":9},
  {
    "val":"california",
    "count":22,
    "top_terms_by_doc":13},
  {
    "val":"power",
    "count":21,
    "top_terms_by_doc":7},
  {
    "val":"rate",
    "count":15,
    "top_terms_by_doc":5},
  {
    "val":"plan",
    "count":13,
    "top_terms_by_doc":3},
  {
    "val":"hou",
    "count":12,
    "top_terms_by_doc":5},
  {
    "val":"energy",
    "count":11,
    "top_terms_by_doc":5},
  {
    "val":"na",
    "count":11,
    "top_terms_by_doc":5},
  {
    "val":"mckinsey",
    "count":10,
    "top_terms_by_doc":1},
  {
    "val":"socal",
    "count":10,
    "top_terms_by_doc":4}]

Nice, but I want them to be ordered by "top_terms_by_doc" frequencies,  not by 
the "count" frequencies. 
Any suggestions?

Thanks,
Alisa 





>Понедельник, 28 марта 2016, 15:39 -04:00 от Alisa Z. :
>
>Hi all, 
>
>I am trying to perform faceting of parent docs by nested document fields. I've 
>tried 2 approaches as in subject, yet in first the results are not quite 
>correct and in the 2nd I cannot get the query right. So I need help on either 
>of them and any explication or documentation or blogs on the behavior is much 
>appreciated.   
>
>Verbally the query is as follows: "Find top 10 keywords for all documents with 
>"california" in email subject line"
>
>Here is the query with responses: 
>
> Json Facet API   
>
>curl http://localhost:8985/solr/my_collection/query -d 
>'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
>json.facet={
>  filter_by_child_type :{
>    type:query,
>    q:"type_s:doc.enriched.text.keywords",
>    domain: { blockChildren : "type_s:doc" },
>    facet:{
>  top_keywords_text : {
>    type: terms,
>    field: text_t,
>    limit: 10
>  }
>    }
>  }
>}'
>
>RETURNS:  
>
>{
>  "responseHeader":{
>    "status":0,
>    "QTime":134,
>    "params":{
>  "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData 
>+Subject_t:california",
>  "json.facet":"{\n  filter_by_child_type :{\n    type:query,\n    
>q:\"type_s:doc.enriched.text.keywo

[nesting] JSON Facet API vs. BlockJoin Faceting: need help on queries (Facet API facets by wrong doc level VS. BlockJoin Faceting does not return top 10 most frequent)

2016-03-28 Thread Alisa Z .
 Hi all, 

I am trying to perform faceting of parent docs by nested document fields. I've 
tried 2 approaches as in subject, yet in first the results are not quite 
correct and in the 2nd I cannot get the query right. So I need help on either 
of them and any explication or documentation or blogs on the behavior is much 
appreciated.   

Verbally the query is as follows: "Find top 10 keywords for all documents with 
"california" in email subject line"

Here is the query with responses: 

 Json Facet API   

curl http://localhost:8985/solr/my_collection/query -d 
'q={!parent%20which="type_s:doc"}type_s:doc.userData%20%2BSubject_t:california&rows=0&
json.facet={
  filter_by_child_type :{
    type:query,
    q:"type_s:doc.enriched.text.keywords",
    domain: { blockChildren : "type_s:doc" },
    facet:{
  top_keywords_text : {
    type: terms,
    field: text_t,
    limit: 10
  }
    }
  }
}'

RETURNS:  

{
  "responseHeader":{
    "status":0,
    "QTime":134,
    "params":{
  "q":"{!parent which=\"type_s:doc\"}type_s:doc.userData 
+Subject_t:california",
  "json.facet":"{\n  filter_by_child_type :{\n    type:query,\n    
q:\"type_s:doc.enriched.text.keywords\",\n    domain: { blockChildren : 
\"type_s:doc\" },\n    facet:{\n  top_keywords_text : {\n    type: 
terms,\n    field: text_t,\n    limit: 10\n  }\n    }\n  }\n}",
  "rows":"0"}},
  "response":{"numFound":19,"start":0,"docs":[]
  },
  "facets":{
    "count":19,
    "filter_by_child_type":{
  "count":686,
  "top_keywords_text":{
    "buckets":[{
    "val":"enron",
    "count":57},
  {
    "val":"california",
    "count":22},
  {
    "val":"power",
    "count":21},
  {
    "val":"rate",
    "count":15},
  {
    "val":"plan",
    "count":13},
  {
    "val":"hou",
    "count":12},
  {
    "val":"energy",
    "count":11},
  {
    "val":"na",
    "count":11},
  {
    "val":"mckinsey",
    "count":10},
  {
    "val":"socal",
    "count":10}]


QUESTION:  where do the counts greater than 19 (the total number of the 
top-level documents returned by the query) comes from?  How to adjust the query 
to facet only on the top-level documents (and consequently no count should be 
greater than 19)? 


= BlockJoin Faceting == 
Following the example on  
https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting , I've 
tried this:  

/bjqfacet?q={!parent%20which=type_s:doc}type_s:doc.enriched.text.keywords&child.facet.field=text_t&child.facet.limit=10&child.facet.mincount=5&rows=0&fq={!parent%20which=type_s:doc}type_s:doc.userData%20%2BSubject_t:california&wt=json&indent=true

RETURNS: 

{
  "responseHeader":{
"status":0,
"QTime":1},
  "response":{"numFound":19,"start":0,"docs":[]
  },
  "facet_counts":[
"facet_fields",[
  "text_t",[
"128x",1,
"18xx",1,
"1x",1,
"2",2,
"30",1,
"60",1,
"78xx",1,
"82xx",1,
"ab",2,
"access",5,
"account",1,
"accounts",1,
...
"california",13,
...
"enron",9,
...
]]]}

QUESTION: This looks very close to what I want, yet why  
child.facet.limit=10&child.facet.mincount=5 are ignored?  How to get top 10 
most frequent? 


Thank you for your help in advance! 

-- 
Alisa Zhila

Re[2]: Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?

2016-03-28 Thread Alisa Z .
 Oh, I apologize...
When I ran it the first time, I must have tried putting it in a different 
collection. Now that I saw it and put it into the correct collection (where the 
schema is adjusted properly), it worked! 

Thanks,  that was the solution.  

 
>Понедельник, 28 марта 2016, 14:44 -04:00 от Erik Hatcher 
>:
>
>Alisa - sorry for not seeing this sooner, but I think Yonik is right… try 
>adding `-format solr` to the command-line of bin/post.
>
>Solr 5.5 is where the changed occurred to use a different end-point for JSON.
>
>—
>Erik Hatcher, Senior Solutions Architect
>http://www.lucidworks.com
>
>
>
>>On Mar 28, 2016, at 2:04 PM, Alisa Z. < prol...@mail.ru > wrote:
>>@Yonik, thank you for your response. 
>>
>>I think that the issue is of a different kind because my upload used to work 
>>well on Solr 5.3.1 and does not want to work on Solr 5.5.0 because of some 
>>changes in dynamic schema recognition.  So maybe you could advise on 
>>reconsidering the data model that I am using. 
>>
>>I have the  type_s field serving as an indicator of different types of 
>>parents and children. However, in my data model, siblings at one level could 
>>be of different type/category, e.g.,:
>>
>>- 
>>type_s: PARENT
>>---/---|\
>>- type_s: child_type1 --  
>>type_s: child_type2   - type_s: child_type3
>>--/--\ 
>>- 
>>/--\---/        \ 
>>
>>type_s: grandchild_type4    type_s: grandchild_type5       grandchild_type6   
>>    grandchild_type4  grandchild_type7   grandchild_type5
>> 
>>So the hierarchy distinguishing field  type_s can have different values at 
>>different levels of the hierarchy because the nodes could be of different 
>>type.
>>
>>
>>Further, in Solr 5.3.1 
>>solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json 
>>doesn't produce any error and I can produce BlockJoin queries using  type_s 
>>field for indicating the nodes.   
>>
>>However, in Solr 5.5.0, when I try upload the same data in the same format 
>>(which was consumed perfectly in Solr 5.3.2):
>>solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json
>>I get the following error:  
>>"msg":"ERROR: [parent=id1] multiple values encountered for non multiValued 
>>field _childDocuments_._childDocuments_.type_s: [grandchild_type4, 
>>grandchild_type5]" .
>>
>>
>>So now I feel that I should have either 2 types of fields for hierarchy 
>>description: one for hierarchy level specification and another for type of 
>>node specification; or make all single-valued fields multi-valued in 
>>descendants.  However, I am not sure whetherte 2nd option will uniquely  
>>specify a document. 
>>
>>Can anybody advise on the data modelling/schema approach for successful 
>>navigation a hierarchical data structure?  
>>I will be trying to adapt the approach outlined in " The Many Facets of 
>>Apache Solr " to my data. Yet I would like to hear any other practical tips 
>>for hierarchical data on Solr 5.5?
>>
>>Thank you in advance. 
>>--Alisa 
>>
>>
>>>Sat, 26 Mar 2016 -4:48:00 -0400 от Yonik Seeley < ysee...@gmail.com >:
>>>
>>>Found the JIRA:   https://issues.apache.org/jira/browse/SOLR-7042
>>>It looks like you can try adding
>>>   -format solr
>>>to your bin/post command line to get back to normal "solr JSON"
>>>
>>>-Yonik
>>>
>>>
>>>On Fri, Mar 25, 2016 at 8:43 PM, Yonik Seeley <  ysee...@gmail.com > wrote:
>>>>On Fri, Mar 25, 2016 at 6:19 PM, Alisa Z. <  prol...@mail.ru > wrote:
>>>>>Hi all,
>>>>>It is partially a question, partially a discussion.
>>>>>I am working with documents with deep levels of nesting. The documents are 
>>>>>in a single JSON file (see a sample below).
>>>>>
>>>>>When I was on Solr 5.3.1,
>>>>>solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json
>>>>
>>>>I think i recall seeing a JIRA go by that changed the URL that
>>>>/bin/post hits from /update/json to /update/json/docs.
>>>>I know the latter does more processing and handles "custom" JSON, but
>>>>I don't know the details.  That would be my guess about what changed
>>>>and what's messing you up.
>>>>
>>>>You could try using curl directly to /update/json and see if that works 
>>>>better.
>>>>
>>>>-Yonik
>>
>



Re[2]: Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?

2016-03-28 Thread Alisa Z .
 @Yonik, thank you for your response. 

I think that the issue is of a different kind because my upload used to work 
well on Solr 5.3.1 and does not want to work on Solr 5.5.0 because of some 
changes in dynamic schema recognition.  So maybe you could advise on 
reconsidering the data model that I am using. 

I have the  type_s field serving as an indicator of different types of parents 
and children. However, in my data model, siblings at one level could be of 
different type/category, e.g.,:

- 
type_s: PARENT
---/---|\
- type_s: child_type1 --  
type_s: child_type2   - type_s: child_type3
--/--\ 
- 
/--\---/        \   
  
type_s: grandchild_type4    type_s: grandchild_type5       grandchild_type6 
  grandchild_type4  grandchild_type7   grandchild_type5
 
So the hierarchy distinguishing field  type_s can have different values at 
different levels of the hierarchy because the nodes could be of different type.


Further, in Solr 5.3.1 
solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json 
doesn't produce any error and I can produce BlockJoin queries using  type_s 
field for indicating the nodes.   

However, in Solr 5.5.0, when I try upload the same data in the same format 
(which was consumed perfectly in Solr 5.3.2):
solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json
I get the following error:  
"msg":"ERROR: [parent=id1] multiple values encountered for non multiValued 
field _childDocuments_._childDocuments_.type_s: [grandchild_type4, 
grandchild_type5]" .


So now I feel that I should have either 2 types of fields for hierarchy 
description: one for hierarchy level specification and another for type of node 
specification; or make all single-valued fields multi-valued in descendants.  
However, I am not sure whetherte 2nd option will uniquely  specify a document. 

Can anybody advise on the data modelling/schema approach for successful 
navigation a hierarchical data structure?  
I will be trying to adapt the approach outlined in " The Many Facets of Apache 
Solr " to my data. Yet I would like to hear any other practical tips for 
hierarchical data on Solr 5.5?

Thank you in advance. 
--Alisa 


>Sat, 26 Mar 2016 -4:48:00 -0400 от Yonik Seeley :
>
>Found the JIRA:  https://issues.apache.org/jira/browse/SOLR-7042
>It looks like you can try adding
>   -format solr
>to your bin/post command line to get back to normal "solr JSON"
>
>-Yonik
>
>
>On Fri, Mar 25, 2016 at 8:43 PM, Yonik Seeley < ysee...@gmail.com > wrote:
>> On Fri, Mar 25, 2016 at 6:19 PM, Alisa Z. < prol...@mail.ru > wrote:
>>>  Hi all,
>>> It is partially a question, partially a discussion.
>>> I am working with documents with deep levels of nesting. The documents are 
>>> in a single JSON file (see a sample below).
>>>
>>> When I was on Solr 5.3.1,
>>> solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json
>>
>> I think i recall seeing a JIRA go by that changed the URL that
>> /bin/post hits from /update/json to /update/json/docs.
>> I know the latter does more processing and handles "custom" JSON, but
>> I don't know the details.  That would be my guess about what changed
>> and what's messing you up.
>>
>> You could try using curl directly to /update/json and see if that works 
>> better.
>>
>> -Yonik



Re: Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?

2016-03-25 Thread Alisa Z .
 Further experiments:

-- updated the schema to account for multiple values: 

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-dynamic-field":{
 "name":"*type_s",
 "type":"string",
 "indexed":true, 
 "multiValued":true
 }
}' http://localhost:8985/solr/my_collection/schema

-- Re-ran indexing again: 
solr-5.5.0$ bin/post -c my_collection ../../data/data-solr.json -p 8985
java -classpath /Users//solr-5.5.0/dist/solr-core-5.5.0.jar -Dauto=yes 
-Dport=8985 -Dc=enron_path_w_ts -Ddata=files 
org.apache.solr.util.SimplePostTool ../../data/data-solr.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8985/solr/my_collection/update...
Entering auto mode. File endings considered are 
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file data-solr-path-w-ts-suffix.json (application/json) to 
[base]/json/docs
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: 
http://localhost:8985/solr/my_collection/update/json/docs
SimplePostTool: WARNING: Response: 
{"responseHeader":{"status":400,"QTime":12},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"ERROR:
 [doc=AVNzOoBsX6g-H6sC3dgo] multiple values encountered for non multiValued 
field  _childDocuments_._childDocuments_._childDocuments_.relevance_tf: 
[0.918377, 0.737646, 0.700964, 0.659539, 0.657294, 0.62809, 0.612241, 0.609963, 
0.873428, 0.764, 0.763825, 0.552016, 0.472819, 0.30331, 0.292935, 0.285799, 
0.278851, 0.936158, 0.790093, 0.722639, 0.649841, 0.576905, 0.570454, 0.445547, 
0.429439, 0.410347, 0.391091, 0.293075, 0.253883, 0.252494, 0.250084, 0.242866, 
0.24142, 0.239883, 0.239827, 0.239563, 0.239507, 0.238434, 0.238193, 0.237804, 
0.237769, 0.237022, 0.236955, 0.2364, 0.236164, 0.236129, 0.236025, 
0.235973]","code":400}}
SimplePostTool: WARNING: IOException while reading response: 
java.io.IOException: Server returned HTTP response code: 400 for URL: 
http://localhost:8985/solr/my_collection/update/json/docs
1 files indexed.
COMMITting Solr index changes to 
http://localhost:8985/solr/my_collection/update...
Time spent: 0:00:05.137

So now it dumps all the values of  relevance_tf into one array  disregarding 
the type of the nested field they actually belonged... It really does not seem 
to account for proper hierarchy handling with branches of different types.  :(  

-- Alisa 


>Пятница, 25 марта 2016, 18:19 -04:00 от Alisa Z. :
>
>Hi all, 
>It is partially a question, partially a discussion. 
>I am working with documents with deep levels of nesting. The documents are in 
>a single JSON file (see a sample below).
>
>When I was on Solr 5.3.1, 
>solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json
>caused no problems.
>
>Now, I am trying to run just the the same on Solr-5.5.0: 
>
>solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json
>java -classpath /Users//solr-5.5.0/dist/solr-core-5.5.0.jar 
>-Dauto=yes -Dc=enron_path_w_ts -Ddata=files 
>org.apache.solr.util.SimplePostTool ../data/data-solr.json
>SimplePostTool version 5.0.0
>Posting files to [base] url  http://localhost:8983/solr/my_collection/update 
>...
>Entering auto mode. File endings considered are 
>xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
>POSTing file data-solr.json (application/json) to [base]/json/docs
>SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: 
>http://localhost:8983/solr/my_collection/update/json/docs
>SimplePostTool: WARNING: Response: 
>{"responseHeader":{"status":400,"QTime":5},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"ERROR:
> [doc=AVNzOoBsX6g-H6sC3dgo] multiple values encountered for non multiValued 
>field _childDocuments_._childDocuments_.type_s: [doc.userData.parts, 
>doc.enriched.text]","code":400}}
>SimplePostTool: WARNING: IOException while reading response: 
>java.io.IOException: Server returned HTTP response code: 400 for URL: 
>http://localhost:8983/solr/my_collection/json/docs
>1 files indexed.
>COMMITting Solr index changes to  
>http://localhost:8983/solr/my_collection/update ..  .
>Time spent: 0:00:05.078
>
>So obviously I don't get my collection uploaded and indexed properly anymore.  
> 
>
>The question is: 
> - What to do?  
>

Solr-5.5.0 doesn't recognize difefrent types of _childDocuments_ any more --degrading since 5.3.1?

2016-03-25 Thread Alisa Z .
 Hi all, 
It is partially a question, partially a discussion. 
I am working with documents with deep levels of nesting. The documents are in a 
single JSON file (see a sample below).

When I was on Solr 5.3.1, 
solr-5.3.1$ bin/post -c my_collection ../data/data-solr.json
caused no problems.

Now, I am trying to run just the the same on Solr-5.5.0: 

solr-5.5.0$ bin/post -c my_collection ../data/data-solr.json
java -classpath /Users//solr-5.5.0/dist/solr-core-5.5.0.jar -Dauto=yes 
-Dc=enron_path_w_ts -Ddata=files org.apache.solr.util.SimplePostTool 
../data/data-solr.json
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/my_collection/update...
Entering auto mode. File endings considered are 
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file data-solr.json (application/json) to [base]/json/docs
SimplePostTool: WARNING: Solr returned an error #400 (Bad Request) for url: 
http://localhost:8983/solr/my_collection/update/json/docs
SimplePostTool: WARNING: Response: 
{"responseHeader":{"status":400,"QTime":5},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","org.apache.solr.common.SolrException"],"msg":"ERROR:
 [doc=AVNzOoBsX6g-H6sC3dgo] multiple values encountered for non multiValued 
field _childDocuments_._childDocuments_.type_s: [doc.userData.parts, 
doc.enriched.text]","code":400}}
SimplePostTool: WARNING: IOException while reading response: 
java.io.IOException: Server returned HTTP response code: 400 for URL: 
http://localhost:8983/solr/my_collection/json/docs
1 files indexed.
COMMITting Solr index changes to  
http://localhost:8983/solr/my_collection/update..  .
Time spent: 0:00:05.078

So obviously I don't get my collection uploaded and indexed properly anymore.   

The question is: 
 - What to do?  

The discussion is: 
- Is it a proper behavior?  It used to be smooth on Solr 5.3.1: I did not need 
to know how many levels of nesting do I exactly have and specify whether the 
_childDocuments_ were of the same type or not. 
 

A partial sample follows: 

[
    {
    "type_s": "doc",
    "_childDocuments_": [
    {
    "type_s": "doc.userData",
    "Mime-Version_t": "1.0",
    "_childDocuments_": [
    {
    "type_s": "doc.userData.parts",
    "id": "AVNzOoBsX6g-H6sC3dgo-userData-23461"
    "content_t": "- SOMETEXT",
    "id": "AVNzOoBsX6g-H6sC3dgo-parts-15557",
    "contentType_t": "text/plain"
    }
    ],
    "Content-Transfer-Encoding_t": "7bit",
    },
    {
    "type_s": "doc.enriched",
    "_childDocuments_": [
    {
   "type_s": "doc.enriched.text",
    "language_t": "english",
    "_childDocuments_": [
    {
    "type_s": "doc.enriched.text.docSentiment",
    "id": "AVNzOoBsX6g-H6sC3dgo-docSentiment-17692",
    "type_t": "positive"
    },
    {
    "type_s": "doc.enriched.text.taxonomy",
    "label_t": "/business",
    "id": "AVNzOoBsX6g-H6sC3dgo-taxonomy-12728"
    },
   {
    "type_s": "doc.enriched.text.concepts",
    "id": "AVNzOoBsX6g-H6sC3dgo-concepts-98530",
    "text_t": "Stephen",
    "_childDocuments_": [
    {
    "type_s": 
"doc.enriched.text.concepts.knowledgeGraph",
    "id": 
"AVNzOoBsX6g-H6sC3dgo-knowledgeGraph-20811",
    "typeHierarchy_t": 
"/people/children/stephen"
    }
    ]
    },
    {
   "type_s": "doc.enriched.text.concepts",  
    
    "id": "AVNzOoBsX6g-H6sC3dgo-concepts-12396",
    "text_t": "Thought",
    "_childDocuments_": [
    {
    "type_s": 
"doc.enriched.text.concepts.knowledgeGraph",
    "id": 
"AVNzOoBsX6g-H6sC3dgo-knowledgeGraph-20316",
    "typeHierarchy_t": 
"/people/ideas/thought"
    }
    ]
    }, 
  

Re[2]: [nesting] Any way to return the whole hierarchical structure when doing Block Join queries?

2016-03-25 Thread Alisa Z .
 Mikhail, 
Thank you for the answer.  
I'd be happy to contribute tons of test cases on nested structures and their 
querying and faceting... 
I am working on a case of moving very nested data structures to Solr (and the 
other option is ES...) but so far Solr seems to be quite behind... It's great 
to see that it is moving in that direction though. I am happy to provide the 
use-cases (that are out of eCommerce actually) and publicly available 
test-cases.

Is it correct that the patch will appear in a release version no sooner than 
Solr 6.0 or even later?  

Thanks,
Alisa 

>Четверг, 24 марта 2016, 15:52 -04:00 от Mikhail Khludnev 
>:
>
>I think you cal already kick tires and contribute a test case into
>https://issues.apache.org/jira/browse/SOLR-8208 that's already reachable
>there I believe, but I still working on core design.
>
>On Thu, Mar 24, 2016 at 10:02 PM, Alisa Z. < prol...@mail.ru > wrote:
>
>>  Hi all,
>>
>> I apologize for duplicating my previous message:
>> Solr 5.3:  anything similar to ChildDocTransformerFactory  that does not
>> flatten the hierarchical structure?
>>
>> However, it is still an open and interesting question:
>>
>> Following the example from  https://dzone.com/articles/using-solr-49-new
>> , let's say we are given multiple-level nested structure:
>>
>> 
>> 1
>> I am the parent
>> PARENT
>> 
>> 1.1
>> I am the 1st child
>> CHILD
>> 
>> 
>> 1.2
>> I am the 2nd child
>> CHILD
>> 
>> 1.2.1
>> I am a grandchildren
>> GRANDCHILD
>> 
>> 
>> 
>>
>>
>> Querying
>> q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child
>> parentFilter=cat:PARENT]
>>
>> will return flattened structure, where cat:CHILD and cat:GRANDCHILD
>> documents end up on the same level:
>> 
>> 1
>> I am the parent
>> PARENT
>> 
>> 1.1
>> I am the 1st child
>> CHILD
>> 
>> 
>> 1.2
>> I am the 2nd child
>> CHILD
>> 
>> 
>> 1.2.1
>> I am a grandchildren
>> GRANDCHILD
>> 
>>  Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This
>> transformer returns all descendants of each parent document in a flat list
>> nested inside the parent document".
>>
>> Yet is there any way to preserve the hierarchy in the response? I really
>> need to find the way to preserve the structure in the response.
>>
>> Thank you in advance!
>>
>> --
>> Alisa Zhila
>> --
>>
>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
>< http://www.griddynamics.com >
>< mkhlud...@griddynamics.com >



[nesting] Any way to return the whole hierarchical structure when doing Block Join queries?

2016-03-24 Thread Alisa Z .
 Hi all, 

I apologize for duplicating my previous message: 
Solr 5.3:  anything similar to ChildDocTransformerFactory  that does not 
flatten the hierarchical structure?    

However, it is still an open and interesting question:  

Following the example from  https://dzone.com/articles/using-solr-49-new , 
let's say we are given multiple-level nested structure: 


1
I am the parent
PARENT

1.1
I am the 1st child
CHILD


1.2
I am the 2nd child
CHILD

1.2.1
I am a grandchildren
GRANDCHILD





Querying 
q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child 
parentFilter=cat:PARENT]

will return flattened structure, where cat:CHILD and cat:GRANDCHILD documents 
end up on the same level:

1
I am the parent
PARENT

1.1
I am the 1st child
CHILD


1.2
I am the 2nd child
CHILD


1.2.1
I am a grandchildren
GRANDCHILD
  
 Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This 
transformer returns all descendants of each parent document in a flat list 
nested inside the parent document". 

Yet is there any way to preserve the hierarchy in the response? I really need 
to find the way to preserve the structure in the response.  

Thank you in advance! 

-- 
Alisa Zhila
--


Solr 5.3: anything similar to ChildDocTransformerFactory that does not flatten the hierarchical structure?

2016-03-22 Thread Alisa Z .
 Hi all, 

Following the example from  https://dzone.com/articles/using-solr-49-new , 
let's say we are given multiple-level nested structure: 


1
I am the parent
PARENT

1.1
I am the 1st child
CHILD


1.2
I am the 2nd child
CHILD

1.2.1
I am a grandchildren
GRANDCHILD





Querying 
q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child 
parentFilter=cat:PARENT]

will return flattened structure, where cat:CHILD and cat:GRANDCHILD documents 
end up on the same level:

1
I am the parent
PARENT

1.1
I am the 1st child
CHILD


1.2
I am the 2nd child
CHILD


1.2.1
I am a grandchildren
GRANDCHILD
  
 Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This 
transformer returns all descendants of each parent document in a flat list 
nested inside the parent document". 

Yet is there any way to preserve the hierarchy in the response? I really need 
to find the way to preserve the structure in the response.  

Thank you in advance! 

-- 
Alisa Zhila

date range faceting on the whole dataset

2016-03-21 Thread Alisa Z .
 Hello,

Is it possible to perform date range faceting on the whole dataset without 
indicating facet.range.start and facet.range.end? 
What if  I have no clue about when my data starts and when it ends (might be 
some point in the future)?  

A sample query: 
http://localhost:8983/solr/enron-path/select?q=*:*&rows=0&facet=true&facet.range=date_tdt&f.date_tdt.facet.range.start=NOW-20YEAR&f.date_tdt.facet.range.end=NOW-14YEARS&f.date_tdt.facet.range.gap=%2B1DAY&debugQuery=true

However, in this case I found the range.start ans range.end points empirically, 
and there still is a lot of "blank" periods. Given, that I actually need to 
step by day, how to avoid unnecessary calculation on dates that are out of my 
data set?  

Thanks,

-- 
Alisa Zhila

Re[2]: [nested] how to specify a path for multiple nesting?

2016-03-21 Thread Alisa Z .
 Thanks, Mikhail. 

I eventually added a distinguishing field "path" and queried unambiguously.  

>Четверг, 17 марта 2016, 9:46 -04:00 от Mikhail Khludnev 
>:
>
>Hello,
>
>Please find inline
>
>On Wed, Mar 16, 2016 at 10:10 PM, Alisa Z.  < prol...@mail.ru > wrote:
>> Hi all,
>>I have a deeply multi-level data structure (up to 6-7 levels deep) where due 
>>to the nature of the data some nested documents can have same type names at 
>>various levels. How to form a proper query on a nested field that would 
>>contain "a path"  that defines that field?
>>
>>I'll clarify with an example:
>>
>>Reduced dataset:
>>
>>[
>> {
>>    id : book1,
>>    type_s:book,
>>    title_t : "The Way of Kings",
>>    author_s : "Brandon Sanderson",
>>    _childDocuments_ : [
>>    {
>> id: book1_c1,
>>    type_s:body,
>>    text_t:"body text of the book... ",
>>    _childDocuments_:[
>>    {id: book2_c1_e1,
>>    type_s:"keywords",
>>    text_t:["The Matrix", "Neo", "character", "somebody", ...]}
>>    ]
>>    },
>>    { id: book1_c2,
>>    type_s:title,
>>    text_t:"This book was too long.",
>>    _childDocuments_:[
>>    {id: book2_c1_e1,
>>    type_s:"keywords",
>>    text_t:["The Matrix", "Neo"]}
>>    ]
>>  }
>>    ]
>> },
>> ...
>>]
>>
>>So there are different paths to text_t field:
>>*  book.body.keywords.text_t
>>*  book.title.keywords.text_t
>>I need to write a query that returns, say, all  books which have  keyword 
>>"Neo"  in their  title  (not body). 
>>I tried :
>>
>>(1)  q={!parent which=type_s:book}type_s:keywords AND text_t:Neo
>>which is obviously incorrect (returns both books whose body keywords and 
>>title keywords contain Neo):
>>
>>(2) q={!parent which=type_s:book}type_s:body^=0{!parent 
>>which=type_s:body}type_s:keywords AND text_t:Neo
>
>I'd say this might work, however I prefer to use v=$foo to break query 
>unambiguously. And also  
>https://lucidworks.com/blog/2011/12/28/why-not-and-or-and-not/ but make sure 
>that + is encoded as %2B in url.
>
>q={!parent which=type_s:book v=$titles}&titles=+type_s:title^=0 +{!parent 
>which='type_s:(body title book)' v=$keywords}&keywords=+type_s:keywords^=0 
>+text_t:Neo
>
>specifying all sibling scopes discriminators is a black magic of block join 
>(if it ever works). Please get back with parsed query (from debugQuery=true) 
>and actual/expected result. Anyway, explicitly resolving scopes 
>(type_s:body_keywords, type_s:title_keywords) might be much maintainable. 
>
>  which does not return correct results (and I am not quite sure what it 
>really does, I just saw it in another thread of this mailing list)
>>
>>Can you help me to understand whether it is possible?
>>Or do I have to give unique types for documents at different levels of 
>>nesting (e.g., type_s:body_keywords & type_s:title_keywords)? I am trying to 
>>avoid, finding a way to specify a path would be much much more preferable. 
>>
>>
>>Thank you in advance and looking forward to hearing from you
>>--
>>Alisa Zhila
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
>
>



[nested] how to specify a path for multiple nesting?

2016-03-19 Thread Alisa Z .
 Hi all, 
I have a deeply multi-level data structure (up to 6-7 levels deep) where due to 
the nature of the data some nested documents can have same type names at 
various levels. How to form a proper query on a nested field that would contain 
"a path"  that defines that field? 

I'll clarify with an example:

Reduced dataset: 

[
 {
    id : book1,
    type_s:book,
    title_t : "The Way of Kings",
    author_s : "Brandon Sanderson",
    _childDocuments_ : [
    {
 id: book1_c1,
    type_s:body,
    text_t:"body text of the book... ",
    _childDocuments_:[
    {id: book2_c1_e1,
    type_s:"keywords",
    text_t:["The Matrix", "Neo", "character", "somebody", ...]}
    ]
    },
    { id: book1_c2,
    type_s:title,
    text_t:"This book was too long.",
    _childDocuments_:[
    {id: book2_c1_e1,
    type_s:"keywords",
    text_t:["The Matrix", "Neo"]}
    ]
  }
    ]
 },
 ...
]

So there are different paths to text_t field: 
*  book.body.keywords.text_t
*  book.title.keywords.text_t
I need to write a query that returns, say, all  books which have  keyword "Neo" 
 in their  title  (not body). 
I tried :

(1)  q={!parent which=type_s:book}type_s:keywords AND text_t:Neo
which is obviously incorrect (returns both books whose body keywords and title 
keywords contain Neo):

(2) q={!parent which=type_s:book}type_s:body^=0{!parent 
which=type_s:body}type_s:keywords AND text_t:Neo
which does not return correct results (and I am not quite sure what it really 
does, I just saw it in another thread of this mailing list)

Can you help me to understand whether it is possible? 
Or do I have to give unique types for documents at different levels of nesting 
(e.g., type_s:body_keywords & type_s:title_keywords)? I am trying to avoid, 
finding a way to specify a path would be much much more preferable.  


Thank you in advance and looking forward to hearing from you
-- 
Alisa Zhila