Re: amount of values in a multi value field - is denormalization always the best option?

Roman Chyla Wed, 10 Jul 2013 16:02:41 -0700

On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle <mvall...@gmail.com
> wrote:


> Hello,
>
>     I have asked a question recently about solr limitations and some about
> joins. It comes that this question is about both at the same time.
>     I am trying to figure how to denormalize my data so I will need just 1
> document in my index instead of performing a join. I figure one way of
> doing this is storing an entity as a multivalued field, instead of storing
> different fields.
>     Let me give an example. Consider the entities:
>
> User:
>     id: 1
>     type: Joan of Arc
>     age: 27
>
> Webpage:
>     id: 1
>     url: http://wiki.apache.org/solr/Join
>     category: Technical
>     user_id: 1
>
>     id: 2
>     url: http://stackoverflow.com
>     category: Technical
>     user_id: 1
>
>     Instead of creating 1 document for user, 1 for webpage 1 and 1 for
> webpage 2 (1 parent and 2 childs) I could store webpages in a user
> multivalued field, as follows:
>
> User:
>     id: 1
>     name: Joan of Arc
>     age: 27
>     webpage1: ["id:1", "url: http://wiki.apache.org/solr/Join";, "category:
> Technical"]
>     webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
> Technical"]
>
>     It would probably perform better than the join, right? However, it made
> me think about solr limitations again. What if I have 200 million webpges
> (200 million fields) per user? Or imagine a case where I could have 200
> million values on a field, like in the case I need to index every html DOM
> element (div, a, etc.) for each web page user visited.
>     I mean, if I need to do the query and this is a business requirement no
> matter what, although denormalizing could be better than using query time
> joins, I wonder it distributing the data present in this single document
> along the cluster wouldn't give me better performance. And this is
> something I won't get with block joins or multivalued fields...
>

Indeed, and when you think of it, then there are only (2?) alternatives

1. let you distributed search cluster have the knowledge of relations
2. denormalize & duplicate the data


>     I guess there is probably no right answer for this question (at least
> not a known one), and I know I should create a POC to check how each
> perform... But do you think a so large number of values in a single
> document could make denormalization not possible in an extreme case like
> this? Would you share my thoughts if I said denormalization is not always
> the right option?
>

Aren't words of natural language (and whatever crap there comes with them
in the fulltext) similar? You may not want to retrieve relations between
every word that you indexed, but still you can index millions of unique
tokens (well, having 200 millions seems to high). But if you were having
such a high number of unique values, you can think of indexing hash values
- search for 'near-duplicates' could be acceptable too.

And so, with lucene, only the denormalization will give you anywhere closer
to acceptable search speed. If you look at the code that executes the join
search, you would see that values for the 1st order search are harvested,
then a new search (or lookup) is performed - so it has to be almost always
slower than the inverted index lookup

roman


>
> Best regards,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>

Re: amount of values in a multi value field - is denormalization always the best option?

Reply via email to