Re: Multi-core support for indexing multiple servers
I have two sources/servers--one of them is Magento. Since Magento has a more or less out of the box integration with Solr, my thought was to run Solr server from the Magento instance and then use DIH to get/merge content from the other source/server. Seem feasible/appropriate? I spec'd it out and it seems to make sense... R On Nov 11, 2013, at 11:25 PM, Liu Bo diabl...@gmail.com wrote: like Erick said, merge data from different datasource could be very difficult, SolrJ is much easier to use but may need another application to do handle index process if you don't want to extends solr much. I eventually end up with a customized request handler which use SolrWriter from DIH package to index data, So that I can fully control the index process, quite like SolrJ, you can write code to convert your data into SolrInputDocument, and then post them to SolrWriter, SolrWriter will handles the rest stuff. On 8 November 2013 21:46, Erick Erickson erickerick...@gmail.com wrote: Yep, you can define multiple data sources for use with DIH. Combining data from those multiple sources into a single index can be a bit tricky with DIH, personally I tend to prefer SolrJ, but that's mostly personal preference, especially if I want to get some parallelism going on. But whatever works Erick On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.com wrote: Eric, Just a question :-), wouldn't it be easy to use DIH to pull data from multiple data sources. I do use DIH to do that comfortably. I have three data sources - MySQL - URLDataSource that returns XML from an .NET application - URLDataSource that connects to an API and return XML Here is part of data-config data source settings dataSource type=JdbcDataSource name=solr driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root password=root/ dataSource name=CRMServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ dataSource name=ImageServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ Of course, in application I do the same. To construct my results, I do connect to MySQL and those two data sources. Basically we have two point of indexing - Using DIH at one time indexing - At application whenever there is transaction to the details that we are storing in Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html Sent from the Solr - User mailing list archive at Nabble.com. -- All the best Liu Bo
Re: Multi-core support for indexing multiple servers
As far as I know about magento, it's DB schema is designed for extensible property storage and relationships between db tables are kind of complex. Product has its attribute sets and properties which are stored in different tables. Configurable product may have different attribute values for each of it's sub simple products. Handle relationship like this in DIH won't be easy, especially when you want to group attributes of a configurable product into one document. But if you just need to search on name and description but not other attributes, you can try write DIH on catalog_product_flat_x tables, magento may have several of them. We used to use lucene core to provide search on magento products, what we do is using SOAP service provided by magento to get products, and then converting them to lucene document. Indexes are updated daily. This hides lots of magento implementation details but it's kind of slow. On 12 November 2013 22:41, Robert Veliz rob...@mavenbridge.com wrote: I have two sources/servers--one of them is Magento. Since Magento has a more or less out of the box integration with Solr, my thought was to run Solr server from the Magento instance and then use DIH to get/merge content from the other source/server. Seem feasible/appropriate? I spec'd it out and it seems to make sense... R On Nov 11, 2013, at 11:25 PM, Liu Bo diabl...@gmail.com wrote: like Erick said, merge data from different datasource could be very difficult, SolrJ is much easier to use but may need another application to do handle index process if you don't want to extends solr much. I eventually end up with a customized request handler which use SolrWriter from DIH package to index data, So that I can fully control the index process, quite like SolrJ, you can write code to convert your data into SolrInputDocument, and then post them to SolrWriter, SolrWriter will handles the rest stuff. On 8 November 2013 21:46, Erick Erickson erickerick...@gmail.com wrote: Yep, you can define multiple data sources for use with DIH. Combining data from those multiple sources into a single index can be a bit tricky with DIH, personally I tend to prefer SolrJ, but that's mostly personal preference, especially if I want to get some parallelism going on. But whatever works Erick On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.com wrote: Eric, Just a question :-), wouldn't it be easy to use DIH to pull data from multiple data sources. I do use DIH to do that comfortably. I have three data sources - MySQL - URLDataSource that returns XML from an .NET application - URLDataSource that connects to an API and return XML Here is part of data-config data source settings dataSource type=JdbcDataSource name=solr driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root password=root/ dataSource name=CRMServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ dataSource name=ImageServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ Of course, in application I do the same. To construct my results, I do connect to MySQL and those two data sources. Basically we have two point of indexing - Using DIH at one time indexing - At application whenever there is transaction to the details that we are storing in Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html Sent from the Solr - User mailing list archive at Nabble.com. -- All the best Liu Bo -- All the best Liu Bo
Re: Multi-core support for indexing multiple servers
like Erick said, merge data from different datasource could be very difficult, SolrJ is much easier to use but may need another application to do handle index process if you don't want to extends solr much. I eventually end up with a customized request handler which use SolrWriter from DIH package to index data, So that I can fully control the index process, quite like SolrJ, you can write code to convert your data into SolrInputDocument, and then post them to SolrWriter, SolrWriter will handles the rest stuff. On 8 November 2013 21:46, Erick Erickson erickerick...@gmail.com wrote: Yep, you can define multiple data sources for use with DIH. Combining data from those multiple sources into a single index can be a bit tricky with DIH, personally I tend to prefer SolrJ, but that's mostly personal preference, especially if I want to get some parallelism going on. But whatever works Erick On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.com wrote: Eric, Just a question :-), wouldn't it be easy to use DIH to pull data from multiple data sources. I do use DIH to do that comfortably. I have three data sources - MySQL - URLDataSource that returns XML from an .NET application - URLDataSource that connects to an API and return XML Here is part of data-config data source settings dataSource type=JdbcDataSource name=solr driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root password=root/ dataSource name=CRMServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ dataSource name=ImageServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ Of course, in application I do the same. To construct my results, I do connect to MySQL and those two data sources. Basically we have two point of indexing - Using DIH at one time indexing - At application whenever there is transaction to the details that we are storing in Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html Sent from the Solr - User mailing list archive at Nabble.com. -- All the best Liu Bo
Re: Multi-core support for indexing multiple servers
Yep, you can define multiple data sources for use with DIH. Combining data from those multiple sources into a single index can be a bit tricky with DIH, personally I tend to prefer SolrJ, but that's mostly personal preference, especially if I want to get some parallelism going on. But whatever works Erick On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.comwrote: Eric, Just a question :-), wouldn't it be easy to use DIH to pull data from multiple data sources. I do use DIH to do that comfortably. I have three data sources - MySQL - URLDataSource that returns XML from an .NET application - URLDataSource that connects to an API and return XML Here is part of data-config data source settings dataSource type=JdbcDataSource name=solr driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root password=root/ dataSource name=CRMServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ dataSource name=ImageServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ Of course, in application I do the same. To construct my results, I do connect to MySQL and those two data sources. Basically we have two point of indexing - Using DIH at one time indexing - At application whenever there is transaction to the details that we are storing in Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi-core support for indexing multiple servers
Hi Rob, mlti-core approach is different. You could have two cares lets say marketing-core [Has its own schema.xml and data-config.xml] magento-core [Has its own schema.xml and data-config.xml] each core have their own schema.xml and data-config.xml If you go by multi-core approach I guess you won't be able to achieve what you described or what you needed. You can query across two cores but that is expensive and tedious. The one you explained with having document type is just single core (Single index) and you differentiate each document by their type lets say document_type=marketing OR document_type=magento I think you could go by having single-index (Single-core) with document_type as differentiator. Also note that if you have common fields between two databases, you don't need to re-define those fields. You can make use of the same field for two databases. Lets say you have field 'title' in marketing database and magento database. You could have one 'title' field defined in schema.xml, no need to define two title fields. Also carefully look at each fields default values in schema.xml Lets say you have some fields in marketing database and those fields does not exists in magento db. When your done with indexing, if the fields does not have values they will not show up in the result. If you want it that way you don't need to define default=. If you still want to appear the field regardless of data or no data you would have to mention default= Ex: field name=year type=int indexed=true stored=true multiValued=false bdefault=*/ To index two databases together, you can try with DataImportHandler. In DataImportHandler you can query multiple data sources. Good thing about DataImportHandler is that your datasource could be data bases (MySQL, MS-SQL, etc), URLDataSource etc. Hope that is helpful -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099746.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi-core support for indexing multiple servers
Rob: What I think you're missing is that you are responsible for pulling the data from your separate sources and pushing it to solr via an update command. You can do this in SolrJ, PHP, or any other package that supports a Solr client. You simply address your requests (both update and query) to the right core on your central server, e.g. http://myserver:8983/solr/core/update http://myserver:8983/solr/othercore/update etc. s far as using DIH is concerned, you have to supply credentials, including the connection URL in the database case. But I think you're right, you're not using multiple cores. You'll probably have to write something (I use SolrJ) that can talk to your two data sources, then combine the information into Solr documents and push them to your Solr server. From there, querying is usually fronted by an application and the indexes are entirely self-contained on the Solr server so no reaching out is necessary. Best, Erick On Thu, Nov 7, 2013 at 3:50 AM, manju16832003 manju16832...@gmail.comwrote: Hi Rob, mlti-core approach is different. You could have two cares lets say marketing-core [Has its own schema.xml and data-config.xml] magento-core [Has its own schema.xml and data-config.xml] each core have their own schema.xml and data-config.xml If you go by multi-core approach I guess you won't be able to achieve what you described or what you needed. You can query across two cores but that is expensive and tedious. The one you explained with having document type is just single core (Single index) and you differentiate each document by their type lets say document_type=marketing OR document_type=magento I think you could go by having single-index (Single-core) with document_type as differentiator. Also note that if you have common fields between two databases, you don't need to re-define those fields. You can make use of the same field for two databases. Lets say you have field 'title' in marketing database and magento database. You could have one 'title' field defined in schema.xml, no need to define two title fields. Also carefully look at each fields default values in schema.xml Lets say you have some fields in marketing database and those fields does not exists in magento db. When your done with indexing, if the fields does not have values they will not show up in the result. If you want it that way you don't need to define default=. If you still want to appear the field regardless of data or no data you would have to mention default= Ex: field name=year type=int indexed=true stored=true multiValued=false bdefault=*/ To index two databases together, you can try with DataImportHandler. In DataImportHandler you can query multiple data sources. Good thing about DataImportHandler is that your datasource could be data bases (MySQL, MS-SQL, etc), URLDataSource etc. Hope that is helpful -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099746.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi-core support for indexing multiple servers
Eric, Just a question :-), wouldn't it be easy to use DIH to pull data from multiple data sources. I do use DIH to do that comfortably. I have three data sources - MySQL - URLDataSource that returns XML from an .NET application - URLDataSource that connects to an API and return XML Here is part of data-config data source settings dataSource type=JdbcDataSource name=solr driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root password=root/ dataSource name=CRMServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ dataSource name=ImageServer type=URLDataSource encoding=UTF-8 connectionTimeout=5000 readTimeout=1/ Of course, in application I do the same. To construct my results, I do connect to MySQL and those two data sources. Basically we have two point of indexing - Using DIH at one time indexing - At application whenever there is transaction to the details that we are storing in Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multi-core support for indexing multiple servers
On 11/6/2013 11:38 PM, Rob Veliz wrote: Trying to find specific information to support the following scenario: - I have one site running on one server with marketing content, blog, etc. I want to index. - I have another site running on Magento on a different server with ecommerce content (products). - Both servers live in completely different environments. - I would like to create one single search index between both sites and make that index searchable from both sites. I think I can/should use the multi-core approach and spin off a new server to host Solr but can anyone verify this is the best/most appropriate approach? Are there any other details I need to consider? Can anyone provide a step by step for making this happen to validate my own technical plan? Any help appreciated...was initially thinking I needed SolrCloud but that seems like overkill for my primary use case. SolrCloud makes for *easy* redundancy. There is a three-server minimum if you want it to be fault-tolerant for both Solr and Zookeeper. The third server would only run zookeeper and could be an extremely inexpensive machine. The other two servers would run both Solr and Zookeeper. Redundancy without cloud is possible, it's just not as automated, and can be done with two servers. It is highly recommended that redundant servers are not separated geographically. This is especially important with SolrCloud, as Zookeeper redundancy requires that a majority of the servers be operational. That can be extremely difficult to guarantee in a multi-datacenter model, if one assumes that an entire datacenter can disappear from the network. If you don't care about redundancy, then you'd just run a single server, and SolrCloud wouldn't provide much benefit. Multiple cores is a good way to go -- the two indexes would be logically separate, but you'd be able to use either one. With SolrCloud, it would be multiple collections. Thanks, Shawn
Re: Multi-core support for indexing multiple servers
Great feedback, thanks. So the multi-core structure I have then is a single Solr server set up, essentially hosted by one domain owner (but to be used by both). My question is how does that Solr server connect to the 2 Web applications to create the 1 master index (to be used when searching on either Web app)? It feels like I just reference the Solr server from within the Web app search templates (e.g. PHP files). That is logical in terms of pulling the data into the Web apps, but it's still not clear to me how the data from those 2 Web apps actually gets into the Solr server if Solr server doesn't live on the same server as the Web app(s). Any thoughts? On Wed, Nov 6, 2013 at 10:57 PM, Shawn Heisey s...@elyograg.org wrote: On 11/6/2013 11:38 PM, Rob Veliz wrote: Trying to find specific information to support the following scenario: - I have one site running on one server with marketing content, blog, etc. I want to index. - I have another site running on Magento on a different server with ecommerce content (products). - Both servers live in completely different environments. - I would like to create one single search index between both sites and make that index searchable from both sites. I think I can/should use the multi-core approach and spin off a new server to host Solr but can anyone verify this is the best/most appropriate approach? Are there any other details I need to consider? Can anyone provide a step by step for making this happen to validate my own technical plan? Any help appreciated...was initially thinking I needed SolrCloud but that seems like overkill for my primary use case. SolrCloud makes for *easy* redundancy. There is a three-server minimum if you want it to be fault-tolerant for both Solr and Zookeeper. The third server would only run zookeeper and could be an extremely inexpensive machine. The other two servers would run both Solr and Zookeeper. Redundancy without cloud is possible, it's just not as automated, and can be done with two servers. It is highly recommended that redundant servers are not separated geographically. This is especially important with SolrCloud, as Zookeeper redundancy requires that a majority of the servers be operational. That can be extremely difficult to guarantee in a multi-datacenter model, if one assumes that an entire datacenter can disappear from the network. If you don't care about redundancy, then you'd just run a single server, and SolrCloud wouldn't provide much benefit. Multiple cores is a good way to go -- the two indexes would be logically separate, but you'd be able to use either one. With SolrCloud, it would be multiple collections. Thanks, Shawn -- *Rob Veliz*, Founder | *Mavenbridge* | rob...@mavenbridge.com | M: +1 (206) 909 - 3490 Follow us at: http://twitter.com/mavenbridge
Re: Multi-core support for indexing multiple servers
On 11/7/2013 12:07 AM, Rob Veliz wrote: Great feedback, thanks. So the multi-core structure I have then is a single Solr server set up, essentially hosted by one domain owner (but to be used by both). My question is how does that Solr server connect to the 2 Web applications to create the 1 master index (to be used when searching on either Web app)? It feels like I just reference the Solr server from within the Web app search templates (e.g. PHP files). That is logical in terms of pulling the data into the Web apps, but it's still not clear to me how the data from those 2 Web apps actually gets into the Solr server if Solr server doesn't live on the same server as the Web app(s). Any thoughts? Solr uses HTTP calls. It is REST-like, though there has been some recent work to make parts of it actually use true REST, that paradigm might later be extended to the entire interface. There are a number of Solr API packages for PHP that give you an obect-oriented interface to Solr that won't require learning Solr's HTTP interface - you write PHP code to access Solr. These are two of them that I have heard about. I've not actually used these, as I have little personal experience with writing PHP: http://pecl.php.net/package/solr http://www.solarium-project.org/ If you are planning a single master index, that's not multicore. Having more than one document type in a single index is possible, they just have to overlap on at least one field - whatever field is the uniqueKey for the index. Thanks, Shawn
Re: Multi-core support for indexing multiple servers
I've been reading about Solarium--definitely useful. Could you elaborate here: If you are planning a single master index, that's not multicore. Having more than one document type in a single index is possible, they just have to overlap on at least one field - whatever field is the uniqueKey for the index. What I'm trying to do is index marketing pages from one server AND index product pages from a different ecommerce server and then combine those results into a single index, so when I search for foo from either site, I get the exact same results for foo. If that's not multi-core, what's the right approach to accomplish this? On Wed, Nov 6, 2013 at 11:29 PM, Shawn Heisey s...@elyograg.org wrote: On 11/7/2013 12:07 AM, Rob Veliz wrote: Great feedback, thanks. So the multi-core structure I have then is a single Solr server set up, essentially hosted by one domain owner (but to be used by both). My question is how does that Solr server connect to the 2 Web applications to create the 1 master index (to be used when searching on either Web app)? It feels like I just reference the Solr server from within the Web app search templates (e.g. PHP files). That is logical in terms of pulling the data into the Web apps, but it's still not clear to me how the data from those 2 Web apps actually gets into the Solr server if Solr server doesn't live on the same server as the Web app(s). Any thoughts? Solr uses HTTP calls. It is REST-like, though there has been some recent work to make parts of it actually use true REST, that paradigm might later be extended to the entire interface. There are a number of Solr API packages for PHP that give you an obect-oriented interface to Solr that won't require learning Solr's HTTP interface - you write PHP code to access Solr. These are two of them that I have heard about. I've not actually used these, as I have little personal experience with writing PHP: http://pecl.php.net/package/solr http://www.solarium-project.org/ If you are planning a single master index, that's not multicore. Having more than one document type in a single index is possible, they just have to overlap on at least one field - whatever field is the uniqueKey for the index. Thanks, Shawn -- *Rob Veliz*, Founder | *Mavenbridge* | rob...@mavenbridge.com | M: +1 (206) 909 - 3490 Follow us at: http://twitter.com/mavenbridge