Re: How to define my data in schema.xml
Well, Avoiding flattening the db to a flat table sounds like a great plan. I found this solution http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example import.a join. not handling a flat table. On Tue, Jun 18, 2013 at 5:53 PM, Jack Krupansky j...@basetechnology.comwrote: You can in fact have multiple collections in Solr and do a limited amount of joining, and Solr has multivalued fields as well, but none of those techniques should be used to avoid the process of flattening and denormalizing a relational data model. It is hard work, but yes, it is required to use Solr effectively. Again, start with the queries - what problem are you trying to solve. Nobody stores data just for the sake of storing it - how will the data be used? -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Hi Jack, Thanks, for you kind comment. I am truly in the beginning of data modeling my schema over an existing working DB. I have used the school-teachers-student db as an example scenario. (a, I have written it as a disclaimer in my first post. b. I really do not know anyone that has 300 hobbies too.) In real life my db is obviously much different, I just used this as an example of potential pitfalls that will occur if I use my old db data modeling notions. obviously, the old relational modeling idioms do not apply here. Now, my question was referring to the fact that I would really like to avoid a flat table/join/view because of the reason listed above. So, my scenario is answering a plain user generated text search over a MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship). So, I come here for tips. Should I use one combined index (treat it as a nosql source) or separate indices or another. any other ways to define relation data ? Thanks. On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky j...@basetechnology.com* *wrote: It sounds like you still have a lot of work to do on your data model. No matter how you slice it, 8 billion rows/fields/whatever is still way too much for any engine to search on a single server. If you have 8 billion of anything, a heavily sharded SolrCloud cluster is probably warranted. Don't plan ahead to put more than 100 million rows on a single node; plan on a proof of concept implementation to determine that number. When we in Solr land say flattened or denormalized, we mean in an intelligent, smart, thoughtful sense, not a mindless, mechanical flattening. It is an opportunity for you to reconsider your data models, both old and new. Maybe data modeling is beyond your skill set. If so, have a chat with your boss and ask for some assistance, training, whatever. Actually, I am suspicious of your 8 billion number - change each of those 300's to realistic, average numbers. Each teacher teaches 300 courses? Right. Each Student has 300 hobbies? If you say so, but... Don't worry about schema.xml until you get your data model under control. For an initial focus, try envisioning the use cases for user queries. That will guide you in thinking about how the data would need to be organized to satisfy those user queries. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 2:20 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Thanks for your reply. I have tried the simplest approach and it works absolutely fantastic. Huge table - 0s to result. two problems as I described earlier, and that is what I try to solve: 1. I create a flat table just for solar. This requires maintenance and develop. Can I run solr over my regular tables? This is my simplest approach. Working over my relational tables, 2. When you query a flat table by school name, as I described, if the school has 300 student, 300 teachers, 300 with 300 teacherCourses, 300 studentHobbies, you get 8.1 Billion rows (300*300*300*300). As I am sure this will work great on solar - searching for the school name will retrieve 8.1 B rows. 3. Lets say all my searches are user generated free text search that is searching name and comments columns. Thanks. On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote: Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) Yes, all
Re: How to define my data in schema.xml
Thanks for your reply. I have tried the simplest approach and it works absolutely fantastic. Huge table - 0s to result. two problems as I described earlier, and that is what I try to solve: 1. I create a flat table just for solar. This requires maintenance and develop. Can I run solr over my regular tables? This is my simplest approach. Working over my relational tables, 2. When you query a flat table by school name, as I described, if the school has 300 student, 300 teachers, 300 with 300 teacherCourses, 300 studentHobbies, you get 8.1 Billion rows (300*300*300*300). As I am sure this will work great on solar - searching for the school name will retrieve 8.1 B rows. 3. Lets say all my searches are user generated free text search that is searching name and comments columns. Thanks. On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote: Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) Yes, all of this is definitely the case, but in practice it does not matter. Solr can efficiently search through millions of rows. To start with, just try the simplest approach, and only complicate things as and when needed. That is something I would like to avoid and I thought I can avoid this by defining teachers and students as multiple value or something like this and than teacherCourses and studentHobbies as 1:n respectively. This is quite similiar to my real life demand, so I came here to get some tips as a solr noob. You have still not described what are the searches that you would want to do. Again, I would suggest starting with the most straightforward approach. Regards, Gora
Re: How to define my data in schema.xml
It sounds like you still have a lot of work to do on your data model. No matter how you slice it, 8 billion rows/fields/whatever is still way too much for any engine to search on a single server. If you have 8 billion of anything, a heavily sharded SolrCloud cluster is probably warranted. Don't plan ahead to put more than 100 million rows on a single node; plan on a proof of concept implementation to determine that number. When we in Solr land say flattened or denormalized, we mean in an intelligent, smart, thoughtful sense, not a mindless, mechanical flattening. It is an opportunity for you to reconsider your data models, both old and new. Maybe data modeling is beyond your skill set. If so, have a chat with your boss and ask for some assistance, training, whatever. Actually, I am suspicious of your 8 billion number - change each of those 300's to realistic, average numbers. Each teacher teaches 300 courses? Right. Each Student has 300 hobbies? If you say so, but... Don't worry about schema.xml until you get your data model under control. For an initial focus, try envisioning the use cases for user queries. That will guide you in thinking about how the data would need to be organized to satisfy those user queries. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 2:20 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Thanks for your reply. I have tried the simplest approach and it works absolutely fantastic. Huge table - 0s to result. two problems as I described earlier, and that is what I try to solve: 1. I create a flat table just for solar. This requires maintenance and develop. Can I run solr over my regular tables? This is my simplest approach. Working over my relational tables, 2. When you query a flat table by school name, as I described, if the school has 300 student, 300 teachers, 300 with 300 teacherCourses, 300 studentHobbies, you get 8.1 Billion rows (300*300*300*300). As I am sure this will work great on solar - searching for the school name will retrieve 8.1 B rows. 3. Lets say all my searches are user generated free text search that is searching name and comments columns. Thanks. On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote: Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) Yes, all of this is definitely the case, but in practice it does not matter. Solr can efficiently search through millions of rows. To start with, just try the simplest approach, and only complicate things as and when needed. That is something I would like to avoid and I thought I can avoid this by defining teachers and students as multiple value or something like this and than teacherCourses and studentHobbies as 1:n respectively. This is quite similiar to my real life demand, so I came here to get some tips as a solr noob. You have still not described what are the searches that you would want to do. Again, I would suggest starting with the most straightforward approach. Regards, Gora
Re: How to define my data in schema.xml
Hi Jack, Thanks, for you kind comment. I am truly in the beginning of data modeling my schema over an existing working DB. I have used the school-teachers-student db as an example scenario. (a, I have written it as a disclaimer in my first post. b. I really do not know anyone that has 300 hobbies too.) In real life my db is obviously much different, I just used this as an example of potential pitfalls that will occur if I use my old db data modeling notions. obviously, the old relational modeling idioms do not apply here. Now, my question was referring to the fact that I would really like to avoid a flat table/join/view because of the reason listed above. So, my scenario is answering a plain user generated text search over a MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship). So, I come here for tips. Should I use one combined index (treat it as a nosql source) or separate indices or another. any other ways to define relation data ? Thanks. On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky j...@basetechnology.comwrote: It sounds like you still have a lot of work to do on your data model. No matter how you slice it, 8 billion rows/fields/whatever is still way too much for any engine to search on a single server. If you have 8 billion of anything, a heavily sharded SolrCloud cluster is probably warranted. Don't plan ahead to put more than 100 million rows on a single node; plan on a proof of concept implementation to determine that number. When we in Solr land say flattened or denormalized, we mean in an intelligent, smart, thoughtful sense, not a mindless, mechanical flattening. It is an opportunity for you to reconsider your data models, both old and new. Maybe data modeling is beyond your skill set. If so, have a chat with your boss and ask for some assistance, training, whatever. Actually, I am suspicious of your 8 billion number - change each of those 300's to realistic, average numbers. Each teacher teaches 300 courses? Right. Each Student has 300 hobbies? If you say so, but... Don't worry about schema.xml until you get your data model under control. For an initial focus, try envisioning the use cases for user queries. That will guide you in thinking about how the data would need to be organized to satisfy those user queries. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 2:20 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Thanks for your reply. I have tried the simplest approach and it works absolutely fantastic. Huge table - 0s to result. two problems as I described earlier, and that is what I try to solve: 1. I create a flat table just for solar. This requires maintenance and develop. Can I run solr over my regular tables? This is my simplest approach. Working over my relational tables, 2. When you query a flat table by school name, as I described, if the school has 300 student, 300 teachers, 300 with 300 teacherCourses, 300 studentHobbies, you get 8.1 Billion rows (300*300*300*300). As I am sure this will work great on solar - searching for the school name will retrieve 8.1 B rows. 3. Lets say all my searches are user generated free text search that is searching name and comments columns. Thanks. On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote: Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) Yes, all of this is definitely the case, but in practice it does not matter. Solr can efficiently search through millions of rows. To start with, just try the simplest approach, and only complicate things as and when needed. That is something I would like to avoid and I thought I can avoid this by defining teachers and students as multiple value or something like this and than teacherCourses and studentHobbies as 1:n respectively. This is quite similiar to my real life demand, so I came here to get some tips as a solr noob. You have still not described what are the searches that you would want to do. Again, I would suggest starting with the most straightforward approach. Regards, Gora
Re: How to define my data in schema.xml
You can in fact have multiple collections in Solr and do a limited amount of joining, and Solr has multivalued fields as well, but none of those techniques should be used to avoid the process of flattening and denormalizing a relational data model. It is hard work, but yes, it is required to use Solr effectively. Again, start with the queries - what problem are you trying to solve. Nobody stores data just for the sake of storing it - how will the data be used? -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 9:58 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Hi Jack, Thanks, for you kind comment. I am truly in the beginning of data modeling my schema over an existing working DB. I have used the school-teachers-student db as an example scenario. (a, I have written it as a disclaimer in my first post. b. I really do not know anyone that has 300 hobbies too.) In real life my db is obviously much different, I just used this as an example of potential pitfalls that will occur if I use my old db data modeling notions. obviously, the old relational modeling idioms do not apply here. Now, my question was referring to the fact that I would really like to avoid a flat table/join/view because of the reason listed above. So, my scenario is answering a plain user generated text search over a MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship). So, I come here for tips. Should I use one combined index (treat it as a nosql source) or separate indices or another. any other ways to define relation data ? Thanks. On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky j...@basetechnology.comwrote: It sounds like you still have a lot of work to do on your data model. No matter how you slice it, 8 billion rows/fields/whatever is still way too much for any engine to search on a single server. If you have 8 billion of anything, a heavily sharded SolrCloud cluster is probably warranted. Don't plan ahead to put more than 100 million rows on a single node; plan on a proof of concept implementation to determine that number. When we in Solr land say flattened or denormalized, we mean in an intelligent, smart, thoughtful sense, not a mindless, mechanical flattening. It is an opportunity for you to reconsider your data models, both old and new. Maybe data modeling is beyond your skill set. If so, have a chat with your boss and ask for some assistance, training, whatever. Actually, I am suspicious of your 8 billion number - change each of those 300's to realistic, average numbers. Each teacher teaches 300 courses? Right. Each Student has 300 hobbies? If you say so, but... Don't worry about schema.xml until you get your data model under control. For an initial focus, try envisioning the use cases for user queries. That will guide you in thinking about how the data would need to be organized to satisfy those user queries. -- Jack Krupansky -Original Message- From: Mysurf Mail Sent: Tuesday, June 18, 2013 2:20 AM To: solr-user@lucene.apache.org Subject: Re: How to define my data in schema.xml Thanks for your reply. I have tried the simplest approach and it works absolutely fantastic. Huge table - 0s to result. two problems as I described earlier, and that is what I try to solve: 1. I create a flat table just for solar. This requires maintenance and develop. Can I run solr over my regular tables? This is my simplest approach. Working over my relational tables, 2. When you query a flat table by school name, as I described, if the school has 300 student, 300 teachers, 300 with 300 teacherCourses, 300 studentHobbies, you get 8.1 Billion rows (300*300*300*300). As I am sure this will work great on solar - searching for the school name will retrieve 8.1 B rows. 3. Lets say all my searches are user generated free text search that is searching name and comments columns. Thanks. On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty g...@mimirtech.com wrote: On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote: Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) Yes, all of this is definitely the case, but in practice it does not matter. Solr can efficiently search through millions of rows. To start with, just try the simplest approach, and only complicate things as and when needed. That is something I would like to avoid and I thought I can avoid this by defining teachers and students as multiple value or something like this and than
How to define my data in schema.xml
Hi, I have created a flat table from my DB and defined a solr core on it. It works excellent so far. My problem is that my table has two hierarchies. So when flatted it is too big. Lets consider the following example scenario My Tables are School Students (1:n with school) Teachers(1:n with school) Now, each school has many students and teachers but each student/teacher has another multivalue field. i.e. the following table studentHobbies - 1:N with students teacherCourses - 1:N with teachers My main Entity is School and that what I want to get in the result. Flattening does not help me much and is very expensive. Can you direct me to how I define 1:n relationships ( and 1:n:n) In data-config.xml Thanks.
Re: How to define my data in schema.xml
On 17 June 2013 21:39, Mysurf Mail stammail...@gmail.com wrote: Hi, I have created a flat table from my DB and defined a solr core on it. It works excellent so far. My problem is that my table has two hierarchies. So when flatted it is too big. What do you mean by too big? Have you actually tried indexing the data into Solr, and does the performance not meet your needs, or are you guessing from the size of the tables? Lets consider the following example scenario My Tables are School Students (1:n with school) Teachers(1:n with school) [...] Um, all of this crucially depends on what your 'n' is. Plus, you need to describe your use case in much more detail. At the moment, you are asking us to guess at what you are trying to do, which is inefficient, and unlikely to solve your problem. Regards, Gora
Re: How to define my data in schema.xml
Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) That is something I would like to avoid and I thought I can avoid this by defining teachers and students as multiple value or something like this and than teacherCourses and studentHobbies as 1:n respectively. This is quite similiar to my real life demand, so I came here to get some tips as a solr noob. On Mon, Jun 17, 2013 at 9:08 PM, Gora Mohanty g...@mimirtech.com wrote: On 17 June 2013 21:39, Mysurf Mail stammail...@gmail.com wrote: Hi, I have created a flat table from my DB and defined a solr core on it. It works excellent so far. My problem is that my table has two hierarchies. So when flatted it is too big. What do you mean by too big? Have you actually tried indexing the data into Solr, and does the performance not meet your needs, or are you guessing from the size of the tables? Lets consider the following example scenario My Tables are School Students (1:n with school) Teachers(1:n with school) [...] Um, all of this crucially depends on what your 'n' is. Plus, you need to describe your use case in much more detail. At the moment, you are asking us to guess at what you are trying to do, which is inefficient, and unlikely to solve your problem. Regards, Gora
Re: How to define my data in schema.xml
On 18 June 2013 01:10, Mysurf Mail stammail...@gmail.com wrote: Thanks for your quick reply. Here are some notes: 1. Consider that all tables in my example have two columns: Name Description which I would like to index and search. 2. I have no other reason to create flat table other than for solar. So I would like to see if I can avoid it. 3. If in my example I will have a flat table then obviously it will hold a lot of rows for a single school. By searching the exact school name I will likely receive a lot of rows. (my flat table has its own pk) Yes, all of this is definitely the case, but in practice it does not matter. Solr can efficiently search through millions of rows. To start with, just try the simplest approach, and only complicate things as and when needed. That is something I would like to avoid and I thought I can avoid this by defining teachers and students as multiple value or something like this and than teacherCourses and studentHobbies as 1:n respectively. This is quite similiar to my real life demand, so I came here to get some tips as a solr noob. You have still not described what are the searches that you would want to do. Again, I would suggest starting with the most straightforward approach. Regards, Gora