Re: Indexing and searching of sharded/ partitioned databases and tables
Thanks guys. Now I can easily search thru 10TB of my personal photos, videos, music and other stuff :) At some point I had split them into multiple db and tables and inserts to a single db/ table were taking too much time once the index grew beyond 1gig. I was storing all the possible metadata about the media. I used two hex characters for naming tables/dbs and ended up with 256 db, each with 256 tables :D . Don't ask me why I had done it this way. Let's just say I was exploring sharding some years ago and got too excited and did that :D. Alas, never touched it again to finish the search portion till now when I really wanted to find a particular photo :) The pk is unique across all the tables so no issues there. I think I should be able to run it off a single server at my home. Thanks and Best Regards, Jayant On Wed, Oct 7, 2009 at 4:52 AM, Shalin Shekhar Mangar wrote: > On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore > wrote: > >> >> You can write an automated program which will change the DB conf details in >> that xml and fire the full import command. You can use >> http://localhost:8983/solr/dataimport url to check the status of the data >> import. >> >> > Also note that full-import deletes all existing documents. So if you write > such a program which changes DB conf details, make sure you invoke the > "import" command (new in Solr 1.4) to avoid deleting the other documents. > > -- > Regards, > Shalin Shekhar Mangar. > -- www.jkg.in | http://www.jkg.in/contact-me/ Jayant Kr. Gandhi
Re: Indexing and searching of sharded/ partitioned databases and tables
On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore wrote: > > You can write an automated program which will change the DB conf details in > that xml and fire the full import command. You can use > http://localhost:8983/solr/dataimport url to check the status of the data > import. > > Also note that full-import deletes all existing documents. So if you write such a program which changes DB conf details, make sure you invoke the "import" command (new in Solr 1.4) to avoid deleting the other documents. -- Regards, Shalin Shekhar Mangar.
Re: Indexing and searching of sharded/ partitioned databases and tables
On Wed, Oct 7, 2009 at 5:09 PM, Sandeep Tagore wrote: > > Hi Jayant, > You can use Solr to achieve your objective. > The data-config.xml which you posted is incomplete. > > Sandeep, the data-config that Jayant posted is not incomplete. The declaration is not necessary if the name of the column in the database and the field name in schema.xml is the same. > I would like to suggest you a way to index the full data. > Try to index a database at a time. Sample xml conf. > > url="jdbc:mysql://localhost/Db1" user="user-name" password="password" /> > > > > > > > > > > > > > > > > > > > > You can write an automated program which will change the DB conf details in > that xml and fire the full import command. You can use > http://localhost:8983/solr/dataimport url to check the status of the data > import. > > You could do that but I don't think it is required. If you do want to do this, it is possible to post the data-config.xml to /dataimport (this is how the dataimport.jsp works) > But be careful while declaring the field. Make sure that you > are > not overwriting the records. > Yes, good point. That is a typical problem with sharded databases with auto-increment primary key. If you do not have unique keys, you can concatenate the shard name with the value of the primary key. -- Regards, Shalin Shekhar Mangar.
Re: Indexing and searching of sharded/ partitioned databases and tables
Hi Jayant, You can use Solr to achieve your objective. The data-config.xml which you posted is incomplete. I would like to suggest you a way to index the full data. Try to index a database at a time. Sample xml conf. You can write an automated program which will change the DB conf details in that xml and fire the full import command. You can use http://localhost:8983/solr/dataimport url to check the status of the data import. But be careful while declaring the field. Make sure that you are not overwriting the records. And if you are working on large data sets, you can use Solr Sharding concept. Let us know if you have any issues. Regards, Sandeep Tagore -- View this message in context: http://www.nabble.com/Indexing-and-searching-of-sharded--partitioned-databases-and-tables-tp25782544p25783916.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing and searching of sharded/ partitioned databases and tables
Comments inline: On Wed, Oct 7, 2009 at 2:01 PM, Jayant Kumar Gandhi wrote: > > Lets say I have 3 mysql databases each with 3 tables. > > Db1 : Tbl1, Tbl2, Tbl3 > Db2 : Tbl1, Tbl2, Tbl3 > Db3 : Tbl1, Tbl2, Tbl3 > > All databases have the same number of tables with same table names as > shown above. All tables have exactly the same structure as well. Each > table has three fields: > id, name, category > > Since the data is distributed this way, I don't have a way to search > for a particular record using 'name'. I must look for it in all the 9 > tables. This is not scalable when lets say I have 20 databases each > with 20 tables, meaning 400 queries needed to find a single record. > > Solr seemed like the solution to help. > > I followed the wiki tutorials: > http://wiki.apache.org/solr/DataImportHandler > http://wiki.apache.org/solr/DIHQuickStart > http://wiki.apache.org/solr/DataImportHandlerFaq > > The following are my config files so far: > > solrconfig.xml > > class="org.apache.solr.handler.dataimport.DataImportHandler"> > > data-config.xml > > > > > dataconfig.xml (so far) > > > driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db1" > user="user-name" password="password" /> > driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db2" > user="user-name" password="password" /> > driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db3" > user="user-name" password="password" /> > > > > > > > > > > > > > > > Doubts/ Questions: > > > - Is this the right away to achieve indexing this data? > - Is there a better way to achieve this? Imagine 20 databases with 20 > tables each translates to 400 lines in the XML. This doesn't scale for > something like 200 databases and 200 tables each. Will solr continue > to work/ index properly if I had 4 entity rows without going out > of memory? > Seems OK. Your original database is sharded so I'm guessing the amount of data is quite large. The number of root entities does not matter. What matters is the total number of documents. As you go from indexing 20 database shards to 200 shards, you will likely cross a point where indexing all of them on a single Solr box is either impossible (due to the large number of documents) or very slow. Similarly, response times may also suffer. Solr supports distributed search wherein you can shard your Solr index each having a disjoint set of documents. You can continue to query Solr normally (except for providing an additional shards request parameter) and Solr will make sure it gets results from all shards, merges and returns them as if you were querying a single Solr instance. See http://wiki.apache.org/solr/DistributedSearch for more details. > - I will really want that I can search thru the complete database for > a 'name' and do things like 'category' filtering etc easily > independent of the entity name/ datasource. For me they are all > records of the same type. > > That is very much possible out of the box. -- Regards, Shalin Shekhar Mangar.
Indexing and searching of sharded/ partitioned databases and tables
Hi All, I am new to Solr. I looking forward for Solr to index data that is partitioned into multiple databases and tables and have questions regarding dataconfig.xml. I have given the doubts at the end. Lets say I have 3 mysql databases each with 3 tables. Db1 : Tbl1, Tbl2, Tbl3 Db2 : Tbl1, Tbl2, Tbl3 Db3 : Tbl1, Tbl2, Tbl3 All databases have the same number of tables with same table names as shown above. All tables have exactly the same structure as well. Each table has three fields: id, name, category Since the data is distributed this way, I don't have a way to search for a particular record using 'name'. I must look for it in all the 9 tables. This is not scalable when lets say I have 20 databases each with 20 tables, meaning 400 queries needed to find a single record. Solr seemed like the solution to help. I followed the wiki tutorials: http://wiki.apache.org/solr/DataImportHandler http://wiki.apache.org/solr/DIHQuickStart http://wiki.apache.org/solr/DataImportHandlerFaq The following are my config files so far: solrconfig.xml data-config.xml dataconfig.xml (so far) Doubts/ Questions: - Is this the right away to achieve indexing this data? - Is there a better way to achieve this? Imagine 20 databases with 20 tables each translates to 400 lines in the XML. This doesn't scale for something like 200 databases and 200 tables each. Will solr continue to work/ index properly if I had 4 entity rows without going out of memory? - I will really want that I can search thru the complete database for a 'name' and do things like 'category' filtering etc easily independent of the entity name/ datasource. For me they are all records of the same type. Thanks and Best Regards, Jayant -- www.jkg.in | http://www.jkg.in/contact-me/ Jayant Kr. Gandhi