Hello,

    I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
    I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
    Let me give an example. Consider the entities:

User:
    id: 1
    type: Joan of Arc
    age: 27

Webpage:
    id: 1
    url: http://wiki.apache.org/solr/Join
    category: Technical
    user_id: 1

    id: 2
    url: http://stackoverflow.com
    category: Technical
    user_id: 1

    Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
    id: 1
    name: Joan of Arc
    age: 27
    webpage1: ["id:1", "url: http://wiki.apache.org/solr/Join";, "category:
Technical"]
    webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
Technical"]

    It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
    I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
    I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Reply via email to