Re: How to define my data in schema.xml

2013-06-19 Thread Mysurf Mail
Well,
Avoiding flattening the db to a flat table sounds like a great plan.
I found this solution
http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example

import.a join. not handling a flat table.



On Tue, Jun 18, 2013 at 5:53 PM, Jack Krupansky j...@basetechnology.comwrote:

 You can in fact have multiple collections in Solr and do a limited amount
 of joining, and Solr has multivalued fields as well, but none of those
 techniques should be used to avoid the process of flattening and
 denormalizing a relational data model. It is hard work, but yes, it is
 required to use Solr effectively.

 Again, start with the queries - what problem are you trying to solve.
 Nobody stores data just for the sake of storing it - how will the data be
 used?


 -- Jack Krupansky

 -Original Message- From: Mysurf Mail
 Sent: Tuesday, June 18, 2013 9:58 AM

 To: solr-user@lucene.apache.org
 Subject: Re: How to define my data in schema.xml

 Hi Jack,
 Thanks, for you kind comment.

 I am truly in the beginning of data modeling my schema over an existing
 working DB.
 I have used the school-teachers-student db as an example scenario.
 (a, I have written it as a disclaimer in my first post. b. I really do not
 know anyone that has 300 hobbies too.)

 In real life my db is obviously much different,
 I just used this as an example of potential pitfalls that will occur if I
 use my old db data modeling notions.
 obviously, the old relational modeling idioms do not apply here.

 Now, my question was referring to the fact that I would really like to
 avoid a flat table/join/view because of the reason listed above.
 So, my scenario is answering a plain user generated text search over a
 MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship).

 So, I come here for tips. Should I use one combined index (treat it as a
 nosql source) or separate indices or another. any other ways to define
 relation data ?
 Thanks.



 On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  It sounds like you still have a lot of work to do on your data model. No
 matter how you slice it, 8 billion rows/fields/whatever is still way too
 much for any engine to search on a single server. If you have 8 billion of
 anything, a heavily sharded SolrCloud cluster is probably warranted. Don't
 plan ahead to put more than 100 million rows on a single node; plan on a
 proof of concept implementation to determine that number.

 When we in Solr land say flattened or denormalized, we mean in an
 intelligent, smart, thoughtful sense, not a mindless, mechanical
 flattening. It is an opportunity for you to reconsider your data models,
 both old and new.

 Maybe data modeling is beyond your skill set. If so, have a chat with your
 boss and ask for some assistance, training, whatever.

 Actually, I am suspicious of your 8 billion number - change each of those
 300's to realistic, average numbers. Each teacher teaches 300 courses?
 Right. Each Student has 300 hobbies? If you say so, but...

 Don't worry about schema.xml until you get your data model under control.

 For an initial focus, try envisioning the use cases for user queries. That
 will guide you in thinking about how the data would need to be organized
 to
 satisfy those user queries.

 -- Jack Krupansky

 -Original Message- From: Mysurf Mail
 Sent: Tuesday, June 18, 2013 2:20 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to define my data in schema.xml


 Thanks for your reply.
 I have tried the simplest approach and it works absolutely fantastic.
 Huge table - 0s to result.

 two problems as I described earlier, and that is what I try to solve:
 1. I create a flat table just for solar. This requires maintenance and
 develop. Can I run solr over my regular tables?
This is my simplest approach. Working over my relational tables,
 2. When you query a flat table by school name, as I described, if the
 school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
 studentHobbies,
you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
 great on solar - searching for the school name will retrieve 8.1 B rows.
 3. Lets say all my searches are user generated free text search that is
 searching name and comments columns.
 Thanks.


 On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote:

  On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote:

  Thanks for your quick reply. Here are some notes:
 
  1. Consider that all tables in my example have two columns: Name 
  Description which I would like to index and search.
  2. I have no other reason to create flat table other than for solar. So
  I
  would like to see if I can avoid it.
  3. If in my example I will have a flat table then obviously it will 
 hold
 a
  lot of rows for a single school.
  By searching the exact school name I will likely receive a lot of
 rows.
  (my flat table has its own pk)

 Yes, all

Re: How to define my data in schema.xml

2013-06-18 Thread Mysurf Mail
Thanks for your reply.
I have tried the simplest approach and it works absolutely fantastic.
Huge table - 0s to result.

two problems as I described earlier, and that is what I try to solve:
1. I create a flat table just for solar. This requires maintenance and
develop. Can I run solr over my regular tables?
This is my simplest approach. Working over my relational tables,
2. When you query a flat table by school name, as I described, if the
school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
studentHobbies,
you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
great on solar - searching for the school name will retrieve 8.1 B rows.
3. Lets say all my searches are user generated free text search that is
searching name and comments columns.
Thanks.


On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote:
  Thanks for your quick reply. Here are some notes:
 
  1. Consider that all tables in my example have two columns: Name 
  Description which I would like to index and search.
  2. I have no other reason to create flat table other than for solar. So I
  would like to see if I can avoid it.
  3. If in my example I will have a flat table then obviously it will hold
 a
  lot of rows for a single school.
  By searching the exact school name I will likely receive a lot of
 rows.
  (my flat table has its own pk)

 Yes, all of this is definitely the case, but in practice
 it does not matter. Solr can efficiently search through
 millions of rows. To start with, just try the simplest
 approach, and only complicate things as and when
 needed.

  That is something I would like to avoid and I thought I can avoid
 this
  by defining teachers and students as multiple value or something like
 this
  and than teacherCourses and studentHobbies  as 1:n respectively.
  This is quite similiar to my real life demand, so I came here to get
  some tips as a solr noob.

 You have still not described what are the searches that
 you would want to do. Again, I would suggest starting
 with the most straightforward approach.

 Regards,
 Gora



Re: How to define my data in schema.xml

2013-06-18 Thread Jack Krupansky
It sounds like you still have a lot of work to do on your data model. No 
matter how you slice it, 8 billion rows/fields/whatever is still way too 
much for any engine to search on a single server. If you have 8 billion of 
anything, a heavily sharded SolrCloud cluster is probably warranted. Don't 
plan ahead to put more than 100 million rows on a single node; plan on a 
proof of concept implementation to determine that number.


When we in Solr land say flattened or denormalized, we mean in an 
intelligent, smart, thoughtful sense, not a mindless, mechanical 
flattening. It is an opportunity for you to reconsider your data models, 
both old and new.


Maybe data modeling is beyond your skill set. If so, have a chat with your 
boss and ask for some assistance, training, whatever.


Actually, I am suspicious of your 8 billion number - change each of those 
300's to realistic, average numbers. Each teacher teaches 300 courses? 
Right. Each Student has 300 hobbies? If you say so, but...


Don't worry about schema.xml until you get your data model under control.

For an initial focus, try envisioning the use cases for user queries. That 
will guide you in thinking about how the data would need to be organized to 
satisfy those user queries.


-- Jack Krupansky

-Original Message- 
From: Mysurf Mail

Sent: Tuesday, June 18, 2013 2:20 AM
To: solr-user@lucene.apache.org
Subject: Re: How to define my data in schema.xml

Thanks for your reply.
I have tried the simplest approach and it works absolutely fantastic.
Huge table - 0s to result.

two problems as I described earlier, and that is what I try to solve:
1. I create a flat table just for solar. This requires maintenance and
develop. Can I run solr over my regular tables?
   This is my simplest approach. Working over my relational tables,
2. When you query a flat table by school name, as I described, if the
school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
studentHobbies,
   you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
great on solar - searching for the school name will retrieve 8.1 B rows.
3. Lets say all my searches are user generated free text search that is
searching name and comments columns.
Thanks.


On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote:


On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote:
 Thanks for your quick reply. Here are some notes:

 1. Consider that all tables in my example have two columns: Name 
 Description which I would like to index and search.
 2. I have no other reason to create flat table other than for solar. So 
 I

 would like to see if I can avoid it.
 3. If in my example I will have a flat table then obviously it will hold
a
 lot of rows for a single school.
 By searching the exact school name I will likely receive a lot of
rows.
 (my flat table has its own pk)

Yes, all of this is definitely the case, but in practice
it does not matter. Solr can efficiently search through
millions of rows. To start with, just try the simplest
approach, and only complicate things as and when
needed.

 That is something I would like to avoid and I thought I can avoid
this
 by defining teachers and students as multiple value or something like
this
 and than teacherCourses and studentHobbies  as 1:n respectively.
 This is quite similiar to my real life demand, so I came here to get
 some tips as a solr noob.

You have still not described what are the searches that
you would want to do. Again, I would suggest starting
with the most straightforward approach.

Regards,
Gora





Re: How to define my data in schema.xml

2013-06-18 Thread Mysurf Mail
Hi Jack,
Thanks, for you kind comment.

I am truly in the beginning of data modeling my schema over an existing
working DB.
I have used the school-teachers-student db as an example scenario.
(a, I have written it as a disclaimer in my first post. b. I really do not
know anyone that has 300 hobbies too.)

In real life my db is obviously much different,
I just used this as an example of potential pitfalls that will occur if I
use my old db data modeling notions.
obviously, the old relational modeling idioms do not apply here.

Now, my question was referring to the fact that I would really like to
avoid a flat table/join/view because of the reason listed above.
So, my scenario is answering a plain user generated text search over a
MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship).

So, I come here for tips. Should I use one combined index (treat it as a
nosql source) or separate indices or another. any other ways to define
relation data ?
Thanks.



On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky j...@basetechnology.comwrote:

 It sounds like you still have a lot of work to do on your data model. No
 matter how you slice it, 8 billion rows/fields/whatever is still way too
 much for any engine to search on a single server. If you have 8 billion of
 anything, a heavily sharded SolrCloud cluster is probably warranted. Don't
 plan ahead to put more than 100 million rows on a single node; plan on a
 proof of concept implementation to determine that number.

 When we in Solr land say flattened or denormalized, we mean in an
 intelligent, smart, thoughtful sense, not a mindless, mechanical
 flattening. It is an opportunity for you to reconsider your data models,
 both old and new.

 Maybe data modeling is beyond your skill set. If so, have a chat with your
 boss and ask for some assistance, training, whatever.

 Actually, I am suspicious of your 8 billion number - change each of those
 300's to realistic, average numbers. Each teacher teaches 300 courses?
 Right. Each Student has 300 hobbies? If you say so, but...

 Don't worry about schema.xml until you get your data model under control.

 For an initial focus, try envisioning the use cases for user queries. That
 will guide you in thinking about how the data would need to be organized to
 satisfy those user queries.

 -- Jack Krupansky

 -Original Message- From: Mysurf Mail
 Sent: Tuesday, June 18, 2013 2:20 AM
 To: solr-user@lucene.apache.org
 Subject: Re: How to define my data in schema.xml


 Thanks for your reply.
 I have tried the simplest approach and it works absolutely fantastic.
 Huge table - 0s to result.

 two problems as I described earlier, and that is what I try to solve:
 1. I create a flat table just for solar. This requires maintenance and
 develop. Can I run solr over my regular tables?
This is my simplest approach. Working over my relational tables,
 2. When you query a flat table by school name, as I described, if the
 school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
 studentHobbies,
you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
 great on solar - searching for the school name will retrieve 8.1 B rows.
 3. Lets say all my searches are user generated free text search that is
 searching name and comments columns.
 Thanks.


 On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote:

  On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote:
  Thanks for your quick reply. Here are some notes:
 
  1. Consider that all tables in my example have two columns: Name 
  Description which I would like to index and search.
  2. I have no other reason to create flat table other than for solar. So
  I
  would like to see if I can avoid it.
  3. If in my example I will have a flat table then obviously it will hold
 a
  lot of rows for a single school.
  By searching the exact school name I will likely receive a lot of
 rows.
  (my flat table has its own pk)

 Yes, all of this is definitely the case, but in practice
 it does not matter. Solr can efficiently search through
 millions of rows. To start with, just try the simplest
 approach, and only complicate things as and when
 needed.

  That is something I would like to avoid and I thought I can avoid
 this
  by defining teachers and students as multiple value or something like
 this
  and than teacherCourses and studentHobbies  as 1:n respectively.
  This is quite similiar to my real life demand, so I came here to get
  some tips as a solr noob.

 You have still not described what are the searches that
 you would want to do. Again, I would suggest starting
 with the most straightforward approach.

 Regards,
 Gora





Re: How to define my data in schema.xml

2013-06-18 Thread Jack Krupansky
You can in fact have multiple collections in Solr and do a limited amount of 
joining, and Solr has multivalued fields as well, but none of those 
techniques should be used to avoid the process of flattening and 
denormalizing a relational data model. It is hard work, but yes, it is 
required to use Solr effectively.


Again, start with the queries - what problem are you trying to solve. Nobody 
stores data just for the sake of storing it - how will the data be used?


-- Jack Krupansky

-Original Message- 
From: Mysurf Mail

Sent: Tuesday, June 18, 2013 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: How to define my data in schema.xml

Hi Jack,
Thanks, for you kind comment.

I am truly in the beginning of data modeling my schema over an existing
working DB.
I have used the school-teachers-student db as an example scenario.
(a, I have written it as a disclaimer in my first post. b. I really do not
know anyone that has 300 hobbies too.)

In real life my db is obviously much different,
I just used this as an example of potential pitfalls that will occur if I
use my old db data modeling notions.
obviously, the old relational modeling idioms do not apply here.

Now, my question was referring to the fact that I would really like to
avoid a flat table/join/view because of the reason listed above.
So, my scenario is answering a plain user generated text search over a
MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship).

So, I come here for tips. Should I use one combined index (treat it as a
nosql source) or separate indices or another. any other ways to define
relation data ?
Thanks.



On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky 
j...@basetechnology.comwrote:



It sounds like you still have a lot of work to do on your data model. No
matter how you slice it, 8 billion rows/fields/whatever is still way too
much for any engine to search on a single server. If you have 8 billion of
anything, a heavily sharded SolrCloud cluster is probably warranted. Don't
plan ahead to put more than 100 million rows on a single node; plan on a
proof of concept implementation to determine that number.

When we in Solr land say flattened or denormalized, we mean in an
intelligent, smart, thoughtful sense, not a mindless, mechanical
flattening. It is an opportunity for you to reconsider your data models,
both old and new.

Maybe data modeling is beyond your skill set. If so, have a chat with your
boss and ask for some assistance, training, whatever.

Actually, I am suspicious of your 8 billion number - change each of those
300's to realistic, average numbers. Each teacher teaches 300 courses?
Right. Each Student has 300 hobbies? If you say so, but...

Don't worry about schema.xml until you get your data model under control.

For an initial focus, try envisioning the use cases for user queries. That
will guide you in thinking about how the data would need to be organized 
to

satisfy those user queries.

-- Jack Krupansky

-Original Message- From: Mysurf Mail
Sent: Tuesday, June 18, 2013 2:20 AM
To: solr-user@lucene.apache.org
Subject: Re: How to define my data in schema.xml


Thanks for your reply.
I have tried the simplest approach and it works absolutely fantastic.
Huge table - 0s to result.

two problems as I described earlier, and that is what I try to solve:
1. I create a flat table just for solar. This requires maintenance and
develop. Can I run solr over my regular tables?
   This is my simplest approach. Working over my relational tables,
2. When you query a flat table by school name, as I described, if the
school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
studentHobbies,
   you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
great on solar - searching for the school name will retrieve 8.1 B rows.
3. Lets say all my searches are user generated free text search that is
searching name and comments columns.
Thanks.


On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote:

 On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote:

 Thanks for your quick reply. Here are some notes:

 1. Consider that all tables in my example have two columns: Name 
 Description which I would like to index and search.
 2. I have no other reason to create flat table other than for solar. So
 I
 would like to see if I can avoid it.
 3. If in my example I will have a flat table then obviously it will 
 hold

a
 lot of rows for a single school.
 By searching the exact school name I will likely receive a lot of
rows.
 (my flat table has its own pk)

Yes, all of this is definitely the case, but in practice
it does not matter. Solr can efficiently search through
millions of rows. To start with, just try the simplest
approach, and only complicate things as and when
needed.

 That is something I would like to avoid and I thought I can avoid
this
 by defining teachers and students as multiple value or something like
this
 and than

How to define my data in schema.xml

2013-06-17 Thread Mysurf Mail
Hi,
I have created a flat table from my DB and defined a solr core on it.
It works excellent so far.

My problem is that my table has two hierarchies. So when flatted it is too
big.
Lets consider the following example scenario

My Tables are

School
Students (1:n with school)
Teachers(1:n with school)

Now, each school has many students and teachers but each student/teacher
has another multivalue field. i.e. the following table

studentHobbies - 1:N with students
teacherCourses - 1:N with teachers

My main Entity is School and that what I want to get in the result.
Flattening does not help me much and is very expensive.

Can you direct me to how I define 1:n relationships  ( and 1:n:n)
In data-config.xml
Thanks.


Re: How to define my data in schema.xml

2013-06-17 Thread Gora Mohanty
On 17 June 2013 21:39, Mysurf Mail stammail...@gmail.com wrote:
 Hi,
 I have created a flat table from my DB and defined a solr core on it.
 It works excellent so far.

 My problem is that my table has two hierarchies. So when flatted it is too
 big.

What do you mean by too big? Have you actually tried
indexing the data into Solr, and does the performance
not meet your needs, or are you guessing from the size
of the tables?

 Lets consider the following example scenario

 My Tables are

 School
 Students (1:n with school)
 Teachers(1:n with school)
[...]

Um, all of this crucially depends on what your 'n' is.
Plus, you need to describe your use case in much
more detail. At the moment, you are asking us to
guess at what you are trying to do, which is inefficient,
and unlikely to solve your problem.

Regards,
Gora


Re: How to define my data in schema.xml

2013-06-17 Thread Mysurf Mail
Thanks for your quick reply. Here are some notes:

1. Consider that all tables in my example have two columns: Name 
Description which I would like to index and search.
2. I have no other reason to create flat table other than for solar. So I
would like to see if I can avoid it.
3. If in my example I will have a flat table then obviously it will hold a
lot of rows for a single school.
By searching the exact school name I will likely receive a lot of rows.
(my flat table has its own pk)
That is something I would like to avoid and I thought I can avoid this
by defining teachers and students as multiple value or something like this
and than teacherCourses and studentHobbies  as 1:n respectively.
This is quite similiar to my real life demand, so I came here to get
some tips as a solr noob.


On Mon, Jun 17, 2013 at 9:08 PM, Gora Mohanty g...@mimirtech.com wrote:

 On 17 June 2013 21:39, Mysurf Mail stammail...@gmail.com wrote:
  Hi,
  I have created a flat table from my DB and defined a solr core on it.
  It works excellent so far.
 
  My problem is that my table has two hierarchies. So when flatted it is
 too
  big.

 What do you mean by too big? Have you actually tried
 indexing the data into Solr, and does the performance
 not meet your needs, or are you guessing from the size
 of the tables?

  Lets consider the following example scenario
 
  My Tables are
 
  School
  Students (1:n with school)
  Teachers(1:n with school)
 [...]

 Um, all of this crucially depends on what your 'n' is.
 Plus, you need to describe your use case in much
 more detail. At the moment, you are asking us to
 guess at what you are trying to do, which is inefficient,
 and unlikely to solve your problem.

 Regards,
 Gora



Re: How to define my data in schema.xml

2013-06-17 Thread Gora Mohanty
On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote:
 Thanks for your quick reply. Here are some notes:

 1. Consider that all tables in my example have two columns: Name 
 Description which I would like to index and search.
 2. I have no other reason to create flat table other than for solar. So I
 would like to see if I can avoid it.
 3. If in my example I will have a flat table then obviously it will hold a
 lot of rows for a single school.
 By searching the exact school name I will likely receive a lot of rows.
 (my flat table has its own pk)

Yes, all of this is definitely the case, but in practice
it does not matter. Solr can efficiently search through
millions of rows. To start with, just try the simplest
approach, and only complicate things as and when
needed.

 That is something I would like to avoid and I thought I can avoid this
 by defining teachers and students as multiple value or something like this
 and than teacherCourses and studentHobbies  as 1:n respectively.
 This is quite similiar to my real life demand, so I came here to get
 some tips as a solr noob.

You have still not described what are the searches that
you would want to do. Again, I would suggest starting
with the most straightforward approach.

Regards,
Gora