Re: amount of values in a multi value field - is denormalization always the best option?
I also have a similar scenario, where fundamentally I have to retrieve all urls where a userid has been found. So, in my schema, I designed the url as (string) key and a (possible huge) list of attributes automatically mapped to strings. For example: Url1 (key): - language: en - content:userid1 - content:userid1 - content:userid1 (i.e. 3 times actually for user 1) - content:userid2 - content:userid3 - author:userid4 and so on and so forth. So, if I did understand, you're saying that this is a bad design? How should I fix my schema in your opinion in that case? Best, Flavio On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky j...@basetechnology.comwrote: Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
Hello Flavio, Out of curiosity, are you already using this in prod? Would you share your results / benchmarks with us? (not sure if you have some). I wonder how it is performing for you. I was thinking in using a very similar schema, comparing to yours. The thing is: each option has drawbacks, there is no good or bad schema, if I understood things correctly. Even joins, which is something we should avoid using in a nosql technology like solr, may be a good option in some cases, I guess sometimes the only thing that can answer some questions are POCs and benchmarks. I am not a solr expert, there are several commiters on this list that might help you much better than I, but the way I think you should try your solution, see how it performs, and keep looking for alternatives that perform better forever, if possible. As I said, I am not an expert, but I wouldn't call your model a bad model that needs fix. It's a possible model and who knows, maybe other model could perform better. It's like in the case of an algorithm, we should assume we can always do better... Best regards, Marcelo. 2013/7/11 Flavio Pompermaier pomperma...@okkam.it I also have a similar scenario, where fundamentally I have to retrieve all urls where a userid has been found. So, in my schema, I designed the url as (string) key and a (possible huge) list of attributes automatically mapped to strings. For example: Url1 (key): - language: en - content:userid1 - content:userid1 - content:userid1 (i.e. 3 times actually for user 1) - content:userid2 - content:userid3 - author:userid4 and so on and so forth. So, if I did understand, you're saying that this is a bad design? How should I fix my schema in your opinion in that case? Best, Flavio On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky j...@basetechnology.com wrote: Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Join http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Join http://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform...
Re: amount of values in a multi value field - is denormalization always the best option?
Yeah, probably you're right..I have to test different configurations! That is what I'd like to know in advance the available solutions..I'm still developing fortunately so I'm still in the position to investigate the solution. Obviously I'll do some benchmarking on it, but I should know the alternatives...so I asked the list! I'm sure someone will give me some hint, at least I hope :) Best, Flavio On Thu, Jul 11, 2013 at 3:46 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello Flavio, Out of curiosity, are you already using this in prod? Would you share your results / benchmarks with us? (not sure if you have some). I wonder how it is performing for you. I was thinking in using a very similar schema, comparing to yours. The thing is: each option has drawbacks, there is no good or bad schema, if I understood things correctly. Even joins, which is something we should avoid using in a nosql technology like solr, may be a good option in some cases, I guess sometimes the only thing that can answer some questions are POCs and benchmarks. I am not a solr expert, there are several commiters on this list that might help you much better than I, but the way I think you should try your solution, see how it performs, and keep looking for alternatives that perform better forever, if possible. As I said, I am not an expert, but I wouldn't call your model a bad model that needs fix. It's a possible model and who knows, maybe other model could perform better. It's like in the case of an algorithm, we should assume we can always do better... Best regards, Marcelo. 2013/7/11 Flavio Pompermaier pomperma...@okkam.it I also have a similar scenario, where fundamentally I have to retrieve all urls where a userid has been found. So, in my schema, I designed the url as (string) key and a (possible huge) list of attributes automatically mapped to strings. For example: Url1 (key): - language: en - content:userid1 - content:userid1 - content:userid1 (i.e. 3 times actually for user 1) - content:userid2 - content:userid3 - author:userid4 and so on and so forth. So, if I did understand, you're saying that this is a bad design? How should I fix my schema in your opinion in that case? Best, Flavio On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky j...@basetechnology.com wrote: Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Join http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Join http://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like
Re: amount of values in a multi value field - is denormalization always the best option?
Again, generally, if the number of values is relatively modest and you don't need to discriminate (tell which one matches on a search) and you don't edit the list, a multivalued field makes perfect sense, but if any of those requirements is not true, then you need to represent the items as discrete Solr documents. But, it does all depend on your particular data and particular requirements. -- Jack Krupansky -Original Message- From: Flavio Pompermaier Sent: Thursday, July 11, 2013 7:50 AM To: solr-user@lucene.apache.org Subject: Re: amount of values in a multi value field - is denormalization always the best option? I also have a similar scenario, where fundamentally I have to retrieve all urls where a userid has been found. So, in my schema, I designed the url as (string) key and a (possible huge) list of attributes automatically mapped to strings. For example: Url1 (key): - language: en - content:userid1 - content:userid1 - content:userid1 (i.e. 3 times actually for user 1) - content:userid2 - content:userid3 - author:userid4 and so on and so forth. So, if I did understand, you're saying that this is a bad design? How should I fix my schema in your opinion in that case? Best, Flavio On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky j...@basetechnology.comwrote: Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
Jack, When you say: large number of values in a single document you also mean a block in a block join, right? Exactly the same thing, agree? In my case, I have just 1 insert and no updates. Even in this case, you think a large document or block would be a really bad idea? I am more worried about the search time. Best regards, Marcelo. 2013/7/10 Jack Krupansky j...@basetechnology.com Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... Indeed, and when you think of it, then there are only (2?) alternatives 1. let you distributed search cluster have the knowledge of relations 2. denormalize duplicate the data I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Aren't words of natural language (and whatever crap there comes with them in the fulltext) similar? You may not want to retrieve relations between every word that you indexed, but still you can index millions of unique tokens (well, having 200 millions seems to high). But if you were having such a high number of unique values, you can think of indexing hash values - search for 'near-duplicates' could be acceptable too. And so, with lucene, only the denormalization will give you anywhere closer to acceptable search speed. If you look at the code that executes the join search, you would see that values for the 1st order search are harvested, then a new search (or lookup) is performed - so it has to be almost always slower than the inverted index lookup roman Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: amount of values in a multi value field - is denormalization always the best option?
Join is a query operation - it has nothing to do with the number of values (fields and multivalued fields) in a Solr/Lucene document. Block insert isn't available yet anyway, so we don't have any clear assessments of its performance. Generally, any kind of large block of data is not a great idea. 1. Break things down. 2. Keep things simple. 3. Join is not simple. 4. Only use non-simple features in careful moderation. There is no reasonable short cut to doing a robust data model. Shortcuts may seem enticing in the short run, but will eat you alive in the long run. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 6:52 PM To: solr-user@lucene.apache.org Subject: Re: amount of values in a multi value field - is denormalization always the best option? Jack, When you say: large number of values in a single document you also mean a block in a block join, right? Exactly the same thing, agree? In my case, I have just 1 insert and no updates. Even in this case, you think a large document or block would be a really bad idea? I am more worried about the search time. Best regards, Marcelo. 2013/7/10 Jack Krupansky j...@basetechnology.com Simple answer: avoid large number of values in a single document. There should only be a modest to moderate number of fields in a single document. Is the data relatively static, or subject to frequent updates? To update any field of a single document, even with atomic update, requires Solr to read and rewrite every field of the document. So, lots of smaller documents are best for a frequent update scenario. Multivalues fields are great for storing a relatively small list of values. You can add to the list easily, but under the hood, Solr must read and rewrite the full list as well as the full document. And, there is no way to address or synchronize individual elements of multivalued fields. Joins are great... if used in moderation. Heavy use of joins is not a great idea. -- Jack Krupansky -Original Message- From: Marcelo Elias Del Valle Sent: Wednesday, July 10, 2013 5:37 PM To: solr-user@lucene.apache.org Subject: amount of values in a multi value field - is denormalization always the best option? Hello, I have asked a question recently about solr limitations and some about joins. It comes that this question is about both at the same time. I am trying to figure how to denormalize my data so I will need just 1 document in my index instead of performing a join. I figure one way of doing this is storing an entity as a multivalued field, instead of storing different fields. Let me give an example. Consider the entities: User: id: 1 type: Joan of Arc age: 27 Webpage: id: 1 url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join category: Technical user_id: 1 id: 2 url: http://stackoverflow.com category: Technical user_id: 1 Instead of creating 1 document for user, 1 for webpage 1 and 1 for webpage 2 (1 parent and 2 childs) I could store webpages in a user multivalued field, as follows: User: id: 1 name: Joan of Arc age: 27 webpage1: [id:1, url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join, category: Technical] webpage2: [id:2, url: http://stackoverflow.com;, category: Technical] It would probably perform better than the join, right? However, it made me think about solr limitations again. What if I have 200 million webpges (200 million fields) per user? Or imagine a case where I could have 200 million values on a field, like in the case I need to index every html DOM element (div, a, etc.) for each web page user visited. I mean, if I need to do the query and this is a business requirement no matter what, although denormalizing could be better than using query time joins, I wonder it distributing the data present in this single document along the cluster wouldn't give me better performance. And this is something I won't get with block joins or multivalued fields... I guess there is probably no right answer for this question (at least not a known one), and I know I should create a POC to check how each perform... But do you think a so large number of values in a single document could make denormalization not possible in an extreme case like this? Would you share my thoughts if I said denormalization is not always the right option? Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr