Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-11 Thread Flavio Pompermaier
I also have a similar scenario, where fundamentally I have to retrieve all
urls where a userid has been found.
So, in my schema, I designed the url as (string) key and a (possible huge)
list of attributes automatically mapped to strings.
For example:

Url1 (key):
 - language: en
 - content:userid1
 - content:userid1
 - content:userid1 (i.e. 3 times actually for user 1)
 - content:userid2
 - content:userid3
 - author:userid4

and so on and so forth.
So, if I did understand, you're saying that this is a bad design? How
should I fix my schema in your opinion in that case?

Best,
Flavio


On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky j...@basetechnology.comwrote:

 Simple answer: avoid large number of values in a single document. There
 should only be a modest to moderate number of fields in a single document.

 Is the data relatively static, or subject to frequent updates? To update
 any field of a single document, even with atomic update, requires Solr to
 read and rewrite every field of the document. So, lots of smaller documents
 are best for a frequent update scenario.

 Multivalues fields are great for storing a relatively small list of
 values. You can add to the list easily, but under the hood, Solr must read
 and rewrite the full list as well as the full document. And, there is no
 way to address or synchronize individual elements of multivalued fields.

 Joins are great... if used in moderation. Heavy use of joins is not a
 great idea.

 -- Jack Krupansky

 -Original Message- From: Marcelo Elias Del Valle
 Sent: Wednesday, July 10, 2013 5:37 PM
 To: solr-user@lucene.apache.org
 Subject: amount of values in a multi value field - is denormalization
 always the best option?


 Hello,

I have asked a question recently about solr limitations and some about
 joins. It comes that this question is about both at the same time.
I am trying to figure how to denormalize my data so I will need just 1
 document in my index instead of performing a join. I figure one way of
 doing this is storing an entity as a multivalued field, instead of storing
 different fields.
Let me give an example. Consider the entities:

 User:
id: 1
type: Joan of Arc
age: 27

 Webpage:
id: 1
url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join
category: Technical
user_id: 1

id: 2
url: http://stackoverflow.com
category: Technical
user_id: 1

Instead of creating 1 document for user, 1 for webpage 1 and 1 for
 webpage 2 (1 parent and 2 childs) I could store webpages in a user
 multivalued field, as follows:

 User:
id: 1
name: Joan of Arc
age: 27
webpage1: [id:1, url: 
 http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join,
 category:
 Technical]
webpage2: [id:2, url: http://stackoverflow.com;, category:
 Technical]

It would probably perform better than the join, right? However, it made
 me think about solr limitations again. What if I have 200 million webpges
 (200 million fields) per user? Or imagine a case where I could have 200
 million values on a field, like in the case I need to index every html DOM
 element (div, a, etc.) for each web page user visited.
I mean, if I need to do the query and this is a business requirement no
 matter what, although denormalizing could be better than using query time
 joins, I wonder it distributing the data present in this single document
 along the cluster wouldn't give me better performance. And this is
 something I won't get with block joins or multivalued fields...
I guess there is probably no right answer for this question (at least
 not a known one), and I know I should create a POC to check how each
 perform... But do you think a so large number of values in a single
 document could make denormalization not possible in an extreme case like
 this? Would you share my thoughts if I said denormalization is not always
 the right option?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr



Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-11 Thread Marcelo Elias Del Valle
Hello Flavio,

Out of curiosity, are you already using this in prod? Would you share
your results / benchmarks with us? (not sure if you have some). I wonder
how it is performing for you.
I was thinking in using a very similar schema, comparing to yours. The
thing is: each option has drawbacks, there is no good or bad schema, if
I understood things correctly. Even joins, which is something we should
avoid using in a nosql technology like solr, may be a good option in some
cases, I guess sometimes the only thing that can answer some questions are
POCs and benchmarks. I am not a solr expert, there are several commiters on
this list that might help you much better than I, but the way I think you
should try your solution, see how it performs, and keep looking for
alternatives that perform better forever, if possible.
 As I said, I am not an expert, but I wouldn't call your model a bad
model that needs fix. It's a possible model and who knows, maybe other
model could perform better. It's like in the case of an algorithm, we
should assume we can always do better...

Best regards,
Marcelo.


2013/7/11 Flavio Pompermaier pomperma...@okkam.it

 I also have a similar scenario, where fundamentally I have to retrieve all
 urls where a userid has been found.
 So, in my schema, I designed the url as (string) key and a (possible huge)
 list of attributes automatically mapped to strings.
 For example:

 Url1 (key):
  - language: en
  - content:userid1
  - content:userid1
  - content:userid1 (i.e. 3 times actually for user 1)
  - content:userid2
  - content:userid3
  - author:userid4

 and so on and so forth.
 So, if I did understand, you're saying that this is a bad design? How
 should I fix my schema in your opinion in that case?

 Best,
 Flavio


 On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky j...@basetechnology.com
 wrote:

  Simple answer: avoid large number of values in a single document. There
  should only be a modest to moderate number of fields in a single
 document.
 
  Is the data relatively static, or subject to frequent updates? To update
  any field of a single document, even with atomic update, requires Solr to
  read and rewrite every field of the document. So, lots of smaller
 documents
  are best for a frequent update scenario.
 
  Multivalues fields are great for storing a relatively small list of
  values. You can add to the list easily, but under the hood, Solr must
 read
  and rewrite the full list as well as the full document. And, there is no
  way to address or synchronize individual elements of multivalued fields.
 
  Joins are great... if used in moderation. Heavy use of joins is not a
  great idea.
 
  -- Jack Krupansky
 
  -Original Message- From: Marcelo Elias Del Valle
  Sent: Wednesday, July 10, 2013 5:37 PM
  To: solr-user@lucene.apache.org
  Subject: amount of values in a multi value field - is denormalization
  always the best option?
 
 
  Hello,
 
 I have asked a question recently about solr limitations and some about
  joins. It comes that this question is about both at the same time.
 I am trying to figure how to denormalize my data so I will need just 1
  document in my index instead of performing a join. I figure one way of
  doing this is storing an entity as a multivalued field, instead of
 storing
  different fields.
 Let me give an example. Consider the entities:
 
  User:
 id: 1
 type: Joan of Arc
 age: 27
 
  Webpage:
 id: 1
 url: http://wiki.apache.org/solr/**Join
 http://wiki.apache.org/solr/Join
 category: Technical
 user_id: 1
 
 id: 2
 url: http://stackoverflow.com
 category: Technical
 user_id: 1
 
 Instead of creating 1 document for user, 1 for webpage 1 and 1 for
  webpage 2 (1 parent and 2 childs) I could store webpages in a user
  multivalued field, as follows:
 
  User:
 id: 1
 name: Joan of Arc
 age: 27
 webpage1: [id:1, url: http://wiki.apache.org/solr/**Join
 http://wiki.apache.org/solr/Join,
  category:
  Technical]
 webpage2: [id:2, url: http://stackoverflow.com;, category:
  Technical]
 
 It would probably perform better than the join, right? However, it
 made
  me think about solr limitations again. What if I have 200 million webpges
  (200 million fields) per user? Or imagine a case where I could have 200
  million values on a field, like in the case I need to index every html
 DOM
  element (div, a, etc.) for each web page user visited.
 I mean, if I need to do the query and this is a business requirement
 no
  matter what, although denormalizing could be better than using query time
  joins, I wonder it distributing the data present in this single document
  along the cluster wouldn't give me better performance. And this is
  something I won't get with block joins or multivalued fields...
 I guess there is probably no right answer for this question (at least
  not a known one), and I know I should create a POC to check how each
  perform... 

Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-11 Thread Flavio Pompermaier
Yeah, probably you're right..I have to test different configurations!
That is what I'd like to know in advance the available solutions..I'm still
developing fortunately so I'm still in the position to investigate the
solution.
Obviously I'll do some benchmarking on it, but I should know the
alternatives...so I asked the list!
I'm sure someone will give me some hint, at least I hope :)

Best,
Flavio



On Thu, Jul 11, 2013 at 3:46 PM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 Hello Flavio,

 Out of curiosity, are you already using this in prod? Would you share
 your results / benchmarks with us? (not sure if you have some). I wonder
 how it is performing for you.
 I was thinking in using a very similar schema, comparing to yours. The
 thing is: each option has drawbacks, there is no good or bad schema, if
 I understood things correctly. Even joins, which is something we should
 avoid using in a nosql technology like solr, may be a good option in some
 cases, I guess sometimes the only thing that can answer some questions are
 POCs and benchmarks. I am not a solr expert, there are several commiters on
 this list that might help you much better than I, but the way I think you
 should try your solution, see how it performs, and keep looking for
 alternatives that perform better forever, if possible.
  As I said, I am not an expert, but I wouldn't call your model a bad
 model that needs fix. It's a possible model and who knows, maybe other
 model could perform better. It's like in the case of an algorithm, we
 should assume we can always do better...

 Best regards,
 Marcelo.


 2013/7/11 Flavio Pompermaier pomperma...@okkam.it

  I also have a similar scenario, where fundamentally I have to retrieve
 all
  urls where a userid has been found.
  So, in my schema, I designed the url as (string) key and a (possible
 huge)
  list of attributes automatically mapped to strings.
  For example:
 
  Url1 (key):
   - language: en
   - content:userid1
   - content:userid1
   - content:userid1 (i.e. 3 times actually for user 1)
   - content:userid2
   - content:userid3
   - author:userid4
 
  and so on and so forth.
  So, if I did understand, you're saying that this is a bad design? How
  should I fix my schema in your opinion in that case?
 
  Best,
  Flavio
 
 
  On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky 
 j...@basetechnology.com
  wrote:
 
   Simple answer: avoid large number of values in a single document.
 There
   should only be a modest to moderate number of fields in a single
  document.
  
   Is the data relatively static, or subject to frequent updates? To
 update
   any field of a single document, even with atomic update, requires Solr
 to
   read and rewrite every field of the document. So, lots of smaller
  documents
   are best for a frequent update scenario.
  
   Multivalues fields are great for storing a relatively small list of
   values. You can add to the list easily, but under the hood, Solr must
  read
   and rewrite the full list as well as the full document. And, there is
 no
   way to address or synchronize individual elements of multivalued
 fields.
  
   Joins are great... if used in moderation. Heavy use of joins is not a
   great idea.
  
   -- Jack Krupansky
  
   -Original Message- From: Marcelo Elias Del Valle
   Sent: Wednesday, July 10, 2013 5:37 PM
   To: solr-user@lucene.apache.org
   Subject: amount of values in a multi value field - is denormalization
   always the best option?
  
  
   Hello,
  
  I have asked a question recently about solr limitations and some
 about
   joins. It comes that this question is about both at the same time.
  I am trying to figure how to denormalize my data so I will need
 just 1
   document in my index instead of performing a join. I figure one way of
   doing this is storing an entity as a multivalued field, instead of
  storing
   different fields.
  Let me give an example. Consider the entities:
  
   User:
  id: 1
  type: Joan of Arc
  age: 27
  
   Webpage:
  id: 1
  url: http://wiki.apache.org/solr/**Join
  http://wiki.apache.org/solr/Join
  category: Technical
  user_id: 1
  
  id: 2
  url: http://stackoverflow.com
  category: Technical
  user_id: 1
  
  Instead of creating 1 document for user, 1 for webpage 1 and 1 for
   webpage 2 (1 parent and 2 childs) I could store webpages in a user
   multivalued field, as follows:
  
   User:
  id: 1
  name: Joan of Arc
  age: 27
  webpage1: [id:1, url: http://wiki.apache.org/solr/**Join
  http://wiki.apache.org/solr/Join,
   category:
   Technical]
  webpage2: [id:2, url: http://stackoverflow.com;, category:
   Technical]
  
  It would probably perform better than the join, right? However, it
  made
   me think about solr limitations again. What if I have 200 million
 webpges
   (200 million fields) per user? Or imagine a case where I could have 200
   million values on a field, like 

Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-11 Thread Jack Krupansky

Again, generally, if the number of values is relatively modest and you don't
need to discriminate (tell which one matches on a search) and you don't edit
the list, a multivalued field makes perfect sense, but if any of those
requirements is not true, then you need to represent the items as discrete
Solr documents.

But, it does all depend on your particular data and particular requirements.

-- Jack Krupansky

-Original Message- 
From: Flavio Pompermaier

Sent: Thursday, July 11, 2013 7:50 AM
To: solr-user@lucene.apache.org
Subject: Re: amount of values in a multi value field - is denormalization
always the best option?

I also have a similar scenario, where fundamentally I have to retrieve all
urls where a userid has been found.
So, in my schema, I designed the url as (string) key and a (possible huge)
list of attributes automatically mapped to strings.
For example:

Url1 (key):
- language: en
- content:userid1
- content:userid1
- content:userid1 (i.e. 3 times actually for user 1)
- content:userid2
- content:userid3
- author:userid4

and so on and so forth.
So, if I did understand, you're saying that this is a bad design? How
should I fix my schema in your opinion in that case?

Best,
Flavio


On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky
j...@basetechnology.comwrote:


Simple answer: avoid large number of values in a single document. There
should only be a modest to moderate number of fields in a single document.

Is the data relatively static, or subject to frequent updates? To update
any field of a single document, even with atomic update, requires Solr to
read and rewrite every field of the document. So, lots of smaller
documents
are best for a frequent update scenario.

Multivalues fields are great for storing a relatively small list of
values. You can add to the list easily, but under the hood, Solr must read
and rewrite the full list as well as the full document. And, there is no
way to address or synchronize individual elements of multivalued fields.

Joins are great... if used in moderation. Heavy use of joins is not a
great idea.

-- Jack Krupansky

-Original Message- From: Marcelo Elias Del Valle
Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization
always the best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url:
http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join
   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: [id:1, url:
http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join,
category:
Technical]
   webpage2: [id:2, url: http://stackoverflow.com;, category:
Technical]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr



Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Jack Krupansky
Simple answer: avoid large number of values in a single document. There 
should only be a modest to moderate number of fields in a single document.


Is the data relatively static, or subject to frequent updates? To update any 
field of a single document, even with atomic update, requires Solr to read 
and rewrite every field of the document. So, lots of smaller documents are 
best for a frequent update scenario.


Multivalues fields are great for storing a relatively small list of values. 
You can add to the list easily, but under the hood, Solr must read and 
rewrite the full list as well as the full document. And, there is no way to 
address or synchronize individual elements of multivalued fields.


Joins are great... if used in moderation. Heavy use of joins is not a great 
idea.


-- Jack Krupansky

-Original Message- 
From: Marcelo Elias Del Valle

Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization always 
the best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url: http://wiki.apache.org/solr/Join
   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category:
Technical]
   webpage2: [id:2, url: http://stackoverflow.com;, category:
Technical]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr 



Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Marcelo Elias Del Valle
Jack,

 When you say: large number of values in a single document you also
mean a block in a block join, right? Exactly the same thing, agree?
 In my case, I have just 1 insert and no updates. Even in this case,
you think a large document or block would be a really bad idea? I am more
worried about the search time.

Best regards,
Marcelo.


2013/7/10 Jack Krupansky j...@basetechnology.com

 Simple answer: avoid large number of values in a single document. There
 should only be a modest to moderate number of fields in a single document.

 Is the data relatively static, or subject to frequent updates? To update
 any field of a single document, even with atomic update, requires Solr to
 read and rewrite every field of the document. So, lots of smaller documents
 are best for a frequent update scenario.

 Multivalues fields are great for storing a relatively small list of
 values. You can add to the list easily, but under the hood, Solr must read
 and rewrite the full list as well as the full document. And, there is no
 way to address or synchronize individual elements of multivalued fields.

 Joins are great... if used in moderation. Heavy use of joins is not a
 great idea.

 -- Jack Krupansky

 -Original Message- From: Marcelo Elias Del Valle
 Sent: Wednesday, July 10, 2013 5:37 PM
 To: solr-user@lucene.apache.org
 Subject: amount of values in a multi value field - is denormalization
 always the best option?


 Hello,

I have asked a question recently about solr limitations and some about
 joins. It comes that this question is about both at the same time.
I am trying to figure how to denormalize my data so I will need just 1
 document in my index instead of performing a join. I figure one way of
 doing this is storing an entity as a multivalued field, instead of storing
 different fields.
Let me give an example. Consider the entities:

 User:
id: 1
type: Joan of Arc
age: 27

 Webpage:
id: 1
url: http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join
category: Technical
user_id: 1

id: 2
url: http://stackoverflow.com
category: Technical
user_id: 1

Instead of creating 1 document for user, 1 for webpage 1 and 1 for
 webpage 2 (1 parent and 2 childs) I could store webpages in a user
 multivalued field, as follows:

 User:
id: 1
name: Joan of Arc
age: 27
webpage1: [id:1, url: 
 http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join,
 category:
 Technical]
webpage2: [id:2, url: http://stackoverflow.com;, category:
 Technical]

It would probably perform better than the join, right? However, it made
 me think about solr limitations again. What if I have 200 million webpges
 (200 million fields) per user? Or imagine a case where I could have 200
 million values on a field, like in the case I need to index every html DOM
 element (div, a, etc.) for each web page user visited.
I mean, if I need to do the query and this is a business requirement no
 matter what, although denormalizing could be better than using query time
 joins, I wonder it distributing the data present in this single document
 along the cluster wouldn't give me better performance. And this is
 something I won't get with block joins or multivalued fields...
I guess there is probably no right answer for this question (at least
 not a known one), and I know I should create a POC to check how each
 perform... But do you think a so large number of values in a single
 document could make denormalization not possible in an extreme case like
 this? Would you share my thoughts if I said denormalization is not always
 the right option?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr




-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr


Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Roman Chyla
On Wed, Jul 10, 2013 at 5:37 PM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 Hello,

 I have asked a question recently about solr limitations and some about
 joins. It comes that this question is about both at the same time.
 I am trying to figure how to denormalize my data so I will need just 1
 document in my index instead of performing a join. I figure one way of
 doing this is storing an entity as a multivalued field, instead of storing
 different fields.
 Let me give an example. Consider the entities:

 User:
 id: 1
 type: Joan of Arc
 age: 27

 Webpage:
 id: 1
 url: http://wiki.apache.org/solr/Join
 category: Technical
 user_id: 1

 id: 2
 url: http://stackoverflow.com
 category: Technical
 user_id: 1

 Instead of creating 1 document for user, 1 for webpage 1 and 1 for
 webpage 2 (1 parent and 2 childs) I could store webpages in a user
 multivalued field, as follows:

 User:
 id: 1
 name: Joan of Arc
 age: 27
 webpage1: [id:1, url: http://wiki.apache.org/solr/Join;, category:
 Technical]
 webpage2: [id:2, url: http://stackoverflow.com;, category:
 Technical]

 It would probably perform better than the join, right? However, it made
 me think about solr limitations again. What if I have 200 million webpges
 (200 million fields) per user? Or imagine a case where I could have 200
 million values on a field, like in the case I need to index every html DOM
 element (div, a, etc.) for each web page user visited.
 I mean, if I need to do the query and this is a business requirement no
 matter what, although denormalizing could be better than using query time
 joins, I wonder it distributing the data present in this single document
 along the cluster wouldn't give me better performance. And this is
 something I won't get with block joins or multivalued fields...


Indeed, and when you think of it, then there are only (2?) alternatives

1. let you distributed search cluster have the knowledge of relations
2. denormalize  duplicate the data


 I guess there is probably no right answer for this question (at least
 not a known one), and I know I should create a POC to check how each
 perform... But do you think a so large number of values in a single
 document could make denormalization not possible in an extreme case like
 this? Would you share my thoughts if I said denormalization is not always
 the right option?


Aren't words of natural language (and whatever crap there comes with them
in the fulltext) similar? You may not want to retrieve relations between
every word that you indexed, but still you can index millions of unique
tokens (well, having 200 millions seems to high). But if you were having
such a high number of unique values, you can think of indexing hash values
- search for 'near-duplicates' could be acceptable too.

And so, with lucene, only the denormalization will give you anywhere closer
to acceptable search speed. If you look at the code that executes the join
search, you would see that values for the 1st order search are harvested,
then a new search (or lookup) is performed - so it has to be almost always
slower than the inverted index lookup

roman



 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr



Re: amount of values in a multi value field - is denormalization always the best option?

2013-07-10 Thread Jack Krupansky
Join is a query operation - it has nothing to do with the number of values 
(fields and multivalued fields) in a Solr/Lucene document.


Block insert isn't available yet anyway, so we don't have any clear 
assessments of its performance.


Generally, any kind of large block of data is not a great idea.

1. Break things down.
2. Keep things simple.
3. Join is not simple.
4. Only use non-simple features in careful moderation.

There is no reasonable short cut to doing a robust data model. Shortcuts may 
seem enticing in the short run, but will eat you alive in the long run.


-- Jack Krupansky

-Original Message- 
From: Marcelo Elias Del Valle

Sent: Wednesday, July 10, 2013 6:52 PM
To: solr-user@lucene.apache.org
Subject: Re: amount of values in a multi value field - is denormalization 
always the best option?


Jack,

When you say: large number of values in a single document you also
mean a block in a block join, right? Exactly the same thing, agree?
In my case, I have just 1 insert and no updates. Even in this case,
you think a large document or block would be a really bad idea? I am more
worried about the search time.

Best regards,
Marcelo.


2013/7/10 Jack Krupansky j...@basetechnology.com


Simple answer: avoid large number of values in a single document. There
should only be a modest to moderate number of fields in a single document.

Is the data relatively static, or subject to frequent updates? To update
any field of a single document, even with atomic update, requires Solr to
read and rewrite every field of the document. So, lots of smaller 
documents

are best for a frequent update scenario.

Multivalues fields are great for storing a relatively small list of
values. You can add to the list easily, but under the hood, Solr must read
and rewrite the full list as well as the full document. And, there is no
way to address or synchronize individual elements of multivalued fields.

Joins are great... if used in moderation. Heavy use of joins is not a
great idea.

-- Jack Krupansky

-Original Message- From: Marcelo Elias Del Valle
Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization
always the best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url: 
http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join

   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: [id:1, url: 
http://wiki.apache.org/solr/**Joinhttp://wiki.apache.org/solr/Join,

category:
Technical]
   webpage2: [id:2, url: http://stackoverflow.com;, category:
Technical]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr





--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr