Re: some parent documents

Arturas Mazeika Fri, 06 Apr 2018 06:16:04 -0700

Hi Mikhail et al,

I must say that this complexity question is still bugging me, and I wonder
if it is possible to get even partial answers in Big-O notation..


Say that we have N (for example 10^6) documents, each having 10 SKUs and
each in turn having 10  storage as well as every product having 10 vendors.
Consider then answer to be 1% large (there are 10 000 documents satisfying
the query). What would be the complexity of answering it?

Cheers,
Arturas

On Thu, Apr 5, 2018 at 11:47 AM, Arturas Mazeika <maze...@gmail.com> wrote:

> Hi Mikhail et al,
>
> Thanks a lot for sharing the code snippet. I would not have been able to
> dig this Java file myself to investigate the complexity of the search
> query. Scanning the code I get a feeling that it is well structured and
> well thought of. There is a concept like advance (Parent Approximation) as
> well as ParentPhaseTwo, matches, matchCost, BlockJoinScorer, Explanation,
> Query rewriting. Is there a documentation available how the architecture
> looks like and what school of thought/doctrine used here?
>
> W.r.t. to my complexity question, I expected to see an answer in the Big-O
> notation (rather than as Java code). Typically one makes assumptions there
> about the key parameters (e.g., number of Products to be N_P, number of
> SKUs to be N_Sk, number of storages to be N_St, number of vendors to be
> N_V, JOIN Selectivities (in terms of percentage) be  p(P,SK), p(SK,ST),
> p(P,V) between the corresponding entities and computes a formula.
>
> What is the complexity of this query in big-O notation?
>
> Cheers,
> Arturas
>
>
>
> On Wed, Apr 4, 2018 at 6:16 PM, Mikhail Khludnev <m...@apache.org> wrote:
>
>> >
>> > What's happening under the hood of
>> > solr in answering query [1] from [2]?
>>
>> https://github.com/apache/lucene-solr/blob/master/lucene/
>> join/src/java/org/apache/lucene/search/join/ToParentBlo
>> ckJoinQuery.java#L178
>>
>> On Wed, Apr 4, 2018 at 3:39 PM, Arturas Mazeika <maze...@gmail.com>
>> wrote:
>>
>> > Hi Mikhail et al,
>> >
>> > Thanks a lot for a very thorough answer. This is an impressive piece of
>> > knowledge you just shared.
>> >
>> > Not surprisingly, I was caught unprepared by the 'v=...' part of the
>> > answer. This brought me to the links you posted (starts with http). From
>> > those links I went to the more updated link (starts with https), which
>> > brought me to other very resourceful links. Combined with some
>> meditation
>> > session, it came into my mind that it is not possible to express block
>> > queries using mathematical logic only. The format of the input document
>> is
>> > deeply built into the query expression and answering. Expressing these
>> > queries mathematically / logically may give an impression that solr is
>> > capable of answering (NP-?) hard problems. I have a feeling though that
>> > solr answers to queries in polynomial (or even almost linear) times.
>> >
>> > Just to connect the remaining dots.. What's happening under the hood of
>> > solr in answering query [1] from [2]? Is it really so that inverted
>> index
>> > is used to identify the vectors of ids, that are scanned linearly in a
>> hope
>> > to get matches on _root_ and other internal variables?
>> >
>> > [1] q=+{!parent which=type_s:product v=$skuq} +{!parent
>> > which=type_s:product v=$vendorq}&skuq=+COLOR_s:Blue +SIZE_s:XL +{!parent
>> > which=type_s:sku v='+QTY_i:[10 TO *] +STATE_s:CA'}&vendorq=+NAME_s:Bob
>> > +PRICE_i:[20 TO 25]
>> > [2]
>> > https://blog.griddynamics.com/searching-grandchildren-and-
>> > siblings-with-solr-block-join/
>> >
>> > Thanks!
>> > Arturas
>> >
>> > On Wed, Apr 4, 2018 at 12:36 PM, Mikhail Khludnev <m...@apache.org>
>> wrote:
>> >
>> > > q=+{!parent which=ntype:p v='+msg:Hello +person:Arturas'} +{!parent
>> > which=
>> > > ntype:p v='+msg:ciao +person:Vai'}
>> > >
>> > > On Wed, Apr 4, 2018 at 12:19 PM, Arturas Mazeika <maze...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Mikhail et al,
>> > > >
>> > > > It seems to me that the nested documents must include nodes that
>> encode
>> > > the
>> > > > level of nodes (within the document). Therefore, the minimal example
>> > must
>> > > > include the node type. Is the following structure sufficient?
>> > > >
>> > > > {
>> > > >     "id":1,
>> > > >     "ntype":"p",
>> > > >     "_childDocuments_":
>> > > >     [
>> > > >         {"id":"1_1", "ntype":"c", "person":"Vai",     "time":"3:14",
>> > > > "msg":"Hello"},
>> > > >         {"id":"1_2", "ntype":"c", "person":"Arturas", "time":"3:14",
>> > > > "msg":"Hello"},
>> > > >         {"id":"1_3", "ntype":"c", "person":"Vai",     "time":"3:15",
>> > > > "msg":"Coz Mathias is working on another system- different
>> screen."},
>> > > >         {"id":"1_4", "ntype":"c", "person":"Vai",     "time":"3:15",
>> > > > "msg":"It can get annoying"},
>> > > >         {"id":"1_5", "ntype":"c", "person":"Arturas", "time":"3:15",
>> > > > "msg":"Thank you. this is very nice of you"},
>> > > >         {"id":"1_6", "ntype":"c", "person":"Vai",     "time":"3:16",
>> > > > "msg":"ciao"},
>> > > >         {"id":"1_7", "ntype":"c", "person":"Arturas", "time":"3:16",
>> > > > "msg":"ciao"}
>> > > >     ]
>> > > > },
>> > > > {
>> > > >     "id":2,
>> > > >     "ntype":"p",
>> > > >     "_childDocuments_":
>> > > >     [
>> > > >         {"id":"2_1", "ntype":"c", "person":"Vai",     "time":"4:14",
>> > > > "msg":"Hi"},
>> > > >         {"id":"2_2", "ntype":"c", "person":"Arturas", "time":"4:14",
>> > > > "msg":"IBM Watson"},
>> > > >         {"id":"2_3", "ntype":"c", "person":"Vai",     "time":"4:15",
>> > > > "msg":"need to retain content"},
>> > > >         {"id":"2_4", "ntype":"c", "person":"Vai",     "time":"4:15",
>> > > > "msg":"It can get annoying"},
>> > > >         {"id":"2_5", "ntype":"c", "person":"Arturas", "time":"4:15",
>> > > > "msg":"You can make all your meetings more access"},
>> > > >         {"id":"2_6", "ntype":"c", "person":"Vai",     "time":"4:16",
>> > > > "msg":"Make every meeting a Skype meeting"},
>> > > >         {"id":"2_7", "ntype":"c", "person":"Arturas", "time":"4:16",
>> > > > "msg":"ciao"}
>> > > >     ]
>> > > > }
>> > > >
>> > > > How would a query look like that has a Hello from Person Arturas and
>> > ciao
>> > > > from Person Vai?
>> > > >
>> > > > Cheers,
>> > > > Arturas
>> > > >
>> > > >
>> > > > On Tue, Apr 3, 2018 at 5:21 PM, Arturas Mazeika <maze...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Hi Mikhail,
>> > > > >
>> > > > > Thanks a lot for the reply.
>> > > > >
>> > > > > You mentioned that
>> > > > >
>> > > > > q=+{!parent which.. v='+text:hello +person:A'} +{!parent
>> > > > > which..v='+text:ciao +person:B'}
>> > > > >
>> > > > > is the way to go. How would it look like precisely for the
>> following
>> > > > > collection?
>> > > > >
>> > > > > {
>> > > > >     "id":1,
>> > > > >     "_childDocuments_":
>> > > > >     [
>> > > > >         {"id":"1_1", "person":"Vai"         , "time":"3:14",
>> > > > > "msg":"Hello"},
>> > > > >         {"id":"1_2", "person":"Arturas"     , "time":"3:14",
>> > > > > "msg":"Hello"},
>> > > > >         {"id":"1_3", "person":"Vai"         , "time":"3:15",
>> > "msg":"Coz
>> > > > > Mathias is working on another system- different screen."},
>> > > > >         {"id":"1_4", "person":"Vai"         , "time":"3:15",
>> > "msg":"It
>> > > > can
>> > > > > get annoying"},
>> > > > >         {"id":"1_5", "person":"Arturas"     , "time":"3:15",
>> > > "msg":"Thank
>> > > > > you. this is very nice of you"},
>> > > > >         {"id":"1_6", "person":"Vai"         , "time":"3:16",
>> > > > "msg":"ciao"},
>> > > > >         {"id":"1_7", "person":"Arturas"     , "time":"3:16",
>> > > > "msg":"ciao"}
>> > > > >     ]
>> > > > > },
>> > > > > {
>> > > > >     "id":2,
>> > > > >     "_childDocuments_":
>> > > > >     [
>> > > > >         {"id":"2_1", "person":"Vai"         , "time":"4:14",
>> > > > > "msg":"Hello"},
>> > > > >         {"id":"2_2", "person":"Arturas"     , "time":"4:14",
>> > "msg":"IBM
>> > > > > Watson"},
>> > > > >         {"id":"2_3", "person":"Vai"         , "time":"4:15",
>> > > "msg":"need
>> > > > > to retain content"},
>> > > > >         {"id":"2_4", "person":"Vai"         , "time":"4:15",
>> > "msg":"It
>> > > > can
>> > > > > get annoying"},
>> > > > >         {"id":"2_5", "person":"Arturas"     , "time":"4:15",
>> > "msg":"You
>> > > > > can make all your meetings more access"},
>> > > > >         {"id":"2_6", "person":"Vai"         , "time":"4:16",
>> > > "msg":"Make
>> > > > > every meeting a Skype meeting"},
>> > > > >         {"id":"2_7", "person":"Arturas"     , "time":"4:16",
>> > > > "msg":"ciao"}
>> > > > >     ]
>> > > > > }
>> > > > >
>> > > > > Cheers,
>> > > > > Arturas
>> > > > >
>> > > > >
>> > > > > On Tue, Apr 3, 2018 at 4:33 PM, Mikhail Khludnev <m...@apache.org
>> >
>> > > > wrote:
>> > > > >
>> > > > >> Hello, Arturas.
>> > > > >>
>> > > > >> TLDR; Please find inline below.
>> > > > >>
>> > > > >> On Tue, Apr 3, 2018 at 5:14 PM, Arturas Mazeika <
>> maze...@gmail.com>
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Hi Solr Fans,
>> > > > >> >
>> > > > >> > I am trying to make sense of information retrieval using
>> > expressions
>> > > > >> like
>> > > > >> > "some parent", "*only parent*", " *all parent*". I am also
>> trying
>> > to
>> > > > >> > understand the syntax "!parent which" and "!child of". On the
>> > > > technical
>> > > > >> > level, I am reading the following documents:
>> > > > >> >
>> > > > >> > [1]
>> > > > >> > https://lucene.apache.org/solr/guide/7_2/other-parsers.
>> > > > >> > html#block-join-query-parsers
>> > > > >> > [2]
>> > > > >> > https://lucene.apache.org/solr/guide/7_2/uploading-data-
>> > > > >> > with-index-handlers.html#nested-child-documents
>> > > > >> > [3] http://yonik.com/solr-nested-objects/
>> > > > >> >
>> > > > >> > and I am confused to read:
>> > > > >> >
>> > > > >> > This parser takes a query that matches some parent documents
>> and
>> > > > returns
>> > > > >> > their children. The syntax for this parser is: q={!child
>> > > > >> > of=<allParents>}<someParents>. The parameter allParents is a
>> > filter
>> > > > that
>> > > > >> > matches *only parent documents*; here you would define the
>> field
>> > and
>> > > > >> value
>> > > > >> > that you used to identify *all parent documents*. The parameter
>> > > > >> someParents
>> > > > >> > identifies a query that will match some of the parent
>> documents.
>> > The
>> > > > >> output
>> > > > >> > is the children.
>> > > > >> >
>> > > > >> > The first sentence talks about "matching" but does not define
>> what
>> > > > that
>> > > > >> > means (and why it is only some parents matching?). The second
>> > > sentence
>> > > > >> > introduces a syntax of the parser, but blurs the understanding
>> as
>> > > > "some"
>> > > > >> > and "all" of parents are combined into one sentence. My
>> > > understanding
>> > > > is
>> > > > >> > that all documents are retrieve that satisfy a query. The query
>> > must
>> > > > >> > express some constraints on the parent node and some on the
>> child
>> > > > node.
>> > > > >> I
>> > > > >> > have a feeling that "only parent documents" reads "criteria is
>> > > > >> formulated
>> > > > >> > over the parent part of {parent document}->{child document} of
>> > > entity.
>> > > > >> > My simplified conceptual world of solr looks in the following
>> way:
>> > > > >> >
>> > > > >> > 1. Every document has an ID.
>> > > > >> > 2. Every document may have additional attributes
>> > > > >> > 3. Text attributes is what's at stake in solr. Sure we can
>> search
>> > > for
>> > > > >> > products that costs at most X, but this is the added
>> > functionality.
>> > > > For
>> > > > >> > simplicity I am neglecting those here.
>> > > > >> > 4. The user has an information need. She expresses it with
>> > > (key)words
>> > > > >> and
>> > > > >> > hopes to find matching documents. For simplicity, I am skipping
>> > all
>> > > > >> issues
>> > > > >> > related to the information presentation of the documents
>> > > > >> > 5. Analysis chain (and inverse index) are the key technologies
>> > solr
>> > > is
>> > > > >> > based upon. Once the chain-processing is applied, mathematical
>> > logic
>> > > > >> kicks
>> > > > >> > in, retrieving the documents (that are a set of processed,
>> > > normalized,
>> > > > >> > enriched tokens) matching the query (processed, normalized and
>> > > > enriched
>> > > > >> > tokens). Clearly, the logic function can be a fancy one (at
>> least
>> > > one
>> > > > of
>> > > > >> > query token is in the document set of tokens, etc.), ranking is
>> > used
>> > > > to
>> > > > >> > sort the results.
>> > > > >> > 6. A nested document concept is introduced in solr. It needs
>> to be
>> > > > >> uploaded
>> > > > >> > into the index structure using a specific handlers [2]. A
>> nested
>> > > > >> documents
>> > > > >> > is a tree. A root may contain children documents, which may be
>> > > parents
>> > > > >> of
>> > > > >> > grandchildren documents.
>> > > > >> > 7. Querying nested documents is supported in the following
>> manner:
>> > > > >> >     7.1 Child documents are return that satisfies {parent
>> > > > >> > document}->{document}
>> > > > >> >     7.2 Parent documents are return that satisfy
>> > {document}->{child
>> > > > >> > document}
>> > > > >> >
>> > > > >> > Would I be very wrong to have this conceptual picture?
>> > > > >> >
>> > > > >> > From this point, the situation is a bit bury in my head. At the
>> > > core,
>> > > > I
>> > > > >> do
>> > > > >> > not really understand what "a document" is anymore (since the
>> > > complete
>> > > > >> json
>> > > > >> > or xml, so is a sub-json and sub-xml are documents, every
>> document
>> > > > must
>> > > > >> > have an ID, does that meant the the subdocuments must have and
>> ID
>> > > too,
>> > > > >> or
>> > > > >> > sub-ids are also fine?), how to formulate mathematical
>> expressions
>> > > > over
>> > > > >> > documents and what it means that the document satisfies my
>> > (key)word
>> > > > >> query?
>> > > > >> > Can we define a document to be the largest entity of
>> information
>> > > that
>> > > > >> does
>> > > > >> > not contain any other nested documents [4]? If this is defined
>> and
>> > > > >> > communicated like this already where can I find it? There is a
>> use
>> > > of
>> > > > >> the
>> > > > >> > clarification, as the concept of the document means different
>> > things
>> > > > in
>> > > > >> > different contexts (e.g., you can update only the "complete
>> > > document"
>> > > > in
>> > > > >> > the index vs. parent document, etc.).
>> > > > >> >
>> > > > >> > Is it possible to formulate what's going on using mathematical
>> > > logic?
>> > > > >> Can
>> > > > >> > one express something like
>> > > > >> >
>> > > > >> > { give documents d : d is a document, d is parent of document
>> c, d
>> > > > >> > satisfies logical criteria C1,....,CN, c satisfies logical
>> > criteria
>> > > > >> > C1',...,CM'}
>> > > > >> > { give documents c : c is a document, d is parent of document
>> c, d
>> > > > >> > satisfies logical criteria C1,....,CN, c satisfies logical
>> > criteria
>> > > > >> > C1',...,CM'}
>> > > > >> >
>> > > > >> > here the meaning of document is as in definition [4] above.
>> > > > >> >
>> > > > >> > 1. Is it possible to retrieve all parent documents that have
>> two
>> > > > >> children
>> > > > >> > c1 and c2? Consider a document that is a skype chat, and
>> children
>> > > are
>> > > > >> > individual lines of communication in the chat. I would be
>> looking
>> > > for
>> > > > >> the
>> > > > >> > (parent) documents that have "hello" said by person A and
>> "ciao"
>> > > said
>> > > > by
>> > > > >> > person B (as two different sub-documents).
>> > > > >> >
>> > > > >>
>> > > > >> q=+{!parent which.. v='+text:hello +person:A'} +{!parent which..
>> > > > >> v='+text:ciao +person:B'}
>> > > > >> The query syntax is really tricky and cumbersome.
>> > > > >>
>> > > > >>
>> > > > >> >
>> > > > >> > 2. Is it possible to search for documents such that they have a
>> > > > >> grandchild
>> > > > >> > and the grandchild has the word "hello"?
>> > > > >> >
>> > > > >>
>> > > > >> http://blog-archive.griddynamics.com/2013/12/grandchildren-
>> > > > >> and-siblings-with-block.html
>> > > > >>
>> > > > >>
>> > > > >> >
>> > > > >> > 3. Is it possible to search for documents that do not have
>> > children?
>> > > > >> >
>> > > > >> q=-{!parent which..}type:child
>> > > > >> Beware that mixing parents and childfree products is not
>> supported
>> > and
>> > > > >> causes pain. as a workaround you need to put empty child
>> placeholder
>> > > > doc.
>> > > > >> Sic. Sorry.
>> > > > >>
>> > > > >>
>> > > > >> > Is this the right venue to discuss documentation of solr?
>> > > > >> >
>> > > > >> > Thanks!
>> > > > >> > Arturas
>> > > > >> >
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Sincerely yours
>> > > > >> Mikhail Khludnev
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Sincerely yours
>> > > Mikhail Khludnev
>> > >
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
>

Re: some parent documents

Reply via email to