Re: Multi-core support for indexing multiple servers

2013-11-12 Thread Robert Veliz
I have two sources/servers--one of them is Magento. Since Magento has a more or 
less out of the box integration with Solr, my thought was to run Solr server 
from the Magento instance and then use DIH to get/merge content from the other 
source/server. Seem feasible/appropriate?  I spec'd it out and it seems to make 
sense...

R

 On Nov 11, 2013, at 11:25 PM, Liu Bo diabl...@gmail.com wrote:
 
 like Erick said, merge data from different datasource could be very
 difficult, SolrJ is much easier to use but may need another application to
 do handle index process if you don't want to extends solr much.
 
 I eventually end up with a customized request handler which use SolrWriter
 from DIH package to index data,
 
 So that I can fully control the index process, quite like SolrJ, you can
 write code to convert your data into SolrInputDocument, and then post them
 to SolrWriter, SolrWriter will handles the rest stuff.
 
 
 On 8 November 2013 21:46, Erick Erickson erickerick...@gmail.com wrote:
 
 Yep, you can define multiple data sources for use with DIH.
 
 Combining data from those multiple sources into a single
 index can be a bit tricky with DIH, personally I tend to prefer
 SolrJ, but that's mostly personal preference, especially if
 I want to get some parallelism going on.
 
 But whatever works
 
 Erick
 
 
 On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.com
 wrote:
 
 Eric,
 Just a question :-), wouldn't it be easy to use DIH to pull data from
 multiple data sources.
 
 I do use DIH to do that comfortably. I have three data sources
 - MySQL
 - URLDataSource that returns XML from an .NET application
 - URLDataSource that connects to an API and return XML
 
 Here is part of data-config data source settings
 dataSource type=JdbcDataSource name=solr
 driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root
 password=root/
   dataSource name=CRMServer type=URLDataSource encoding=UTF-8
 connectionTimeout=5000 readTimeout=1/
   dataSource name=ImageServer type=URLDataSource
 encoding=UTF-8
 connectionTimeout=5000 readTimeout=1/
 
 
 Of course, in application I do the same.
 To construct my results, I do connect to MySQL and those two data
 sources.
 
 Basically we have two point of indexing
 - Using DIH at one time indexing
 - At application whenever there is transaction to the details that we
 are
 storing in Solr.
 
 
 
 
 
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 -- 
 All the best
 
 Liu Bo


Re: Multi-core support for indexing multiple servers

2013-11-12 Thread Liu Bo
As far as I know about magento, it's DB schema is designed for extensible
property storage and relationships between db tables are kind of complex.

Product has its attribute sets and properties which are stored in different
tables. Configurable product may have different attribute values for each
of it's sub simple products.

Handle relationship like this in DIH won't be easy, especially when you
want to group attributes of a configurable product into one document.

But if you just need to search on name and description but not other
attributes, you can try write DIH on catalog_product_flat_x tables, magento
may have several of them.

We used to use lucene core to provide search on magento products, what we
do is using SOAP service provided by magento to get products, and then
converting them to lucene document. Indexes are updated daily. This hides
lots of magento implementation details but it's kind of slow.




On 12 November 2013 22:41, Robert Veliz rob...@mavenbridge.com wrote:

 I have two sources/servers--one of them is Magento. Since Magento has a
 more or less out of the box integration with Solr, my thought was to run
 Solr server from the Magento instance and then use DIH to get/merge content
 from the other source/server. Seem feasible/appropriate?  I spec'd it out
 and it seems to make sense...

 R

  On Nov 11, 2013, at 11:25 PM, Liu Bo diabl...@gmail.com wrote:
 
  like Erick said, merge data from different datasource could be very
  difficult, SolrJ is much easier to use but may need another application
 to
  do handle index process if you don't want to extends solr much.
 
  I eventually end up with a customized request handler which use
 SolrWriter
  from DIH package to index data,
 
  So that I can fully control the index process, quite like SolrJ, you can
  write code to convert your data into SolrInputDocument, and then post
 them
  to SolrWriter, SolrWriter will handles the rest stuff.
 
 
  On 8 November 2013 21:46, Erick Erickson erickerick...@gmail.com
 wrote:
 
  Yep, you can define multiple data sources for use with DIH.
 
  Combining data from those multiple sources into a single
  index can be a bit tricky with DIH, personally I tend to prefer
  SolrJ, but that's mostly personal preference, especially if
  I want to get some parallelism going on.
 
  But whatever works
 
  Erick
 
 
  On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.com
  wrote:
 
  Eric,
  Just a question :-), wouldn't it be easy to use DIH to pull data from
  multiple data sources.
 
  I do use DIH to do that comfortably. I have three data sources
  - MySQL
  - URLDataSource that returns XML from an .NET application
  - URLDataSource that connects to an API and return XML
 
  Here is part of data-config data source settings
  dataSource type=JdbcDataSource name=solr
  driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root
  password=root/
dataSource name=CRMServer type=URLDataSource
 encoding=UTF-8
  connectionTimeout=5000 readTimeout=1/
dataSource name=ImageServer type=URLDataSource
  encoding=UTF-8
  connectionTimeout=5000 readTimeout=1/
 
 
  Of course, in application I do the same.
  To construct my results, I do connect to MySQL and those two data
  sources.
 
  Basically we have two point of indexing
  - Using DIH at one time indexing
  - At application whenever there is transaction to the details that we
  are
  storing in Solr.
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
  --
  All the best
 
  Liu Bo




-- 
All the best

Liu Bo


Re: Multi-core support for indexing multiple servers

2013-11-11 Thread Liu Bo
like Erick said, merge data from different datasource could be very
difficult, SolrJ is much easier to use but may need another application to
do handle index process if you don't want to extends solr much.

I eventually end up with a customized request handler which use SolrWriter
from DIH package to index data,

So that I can fully control the index process, quite like SolrJ, you can
write code to convert your data into SolrInputDocument, and then post them
to SolrWriter, SolrWriter will handles the rest stuff.


On 8 November 2013 21:46, Erick Erickson erickerick...@gmail.com wrote:

 Yep, you can define multiple data sources for use with DIH.

 Combining data from those multiple sources into a single
 index can be a bit tricky with DIH, personally I tend to prefer
 SolrJ, but that's mostly personal preference, especially if
 I want to get some parallelism going on.

 But whatever works

 Erick


 On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.com
 wrote:

  Eric,
  Just a question :-), wouldn't it be easy to use DIH to pull data from
  multiple data sources.
 
  I do use DIH to do that comfortably. I have three data sources
   - MySQL
   - URLDataSource that returns XML from an .NET application
   - URLDataSource that connects to an API and return XML
 
  Here is part of data-config data source settings
  dataSource type=JdbcDataSource name=solr
  driver=com.mysql.jdbc.Driver
  url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root
  password=root/
 dataSource name=CRMServer type=URLDataSource encoding=UTF-8
  connectionTimeout=5000 readTimeout=1/
 dataSource name=ImageServer type=URLDataSource
 encoding=UTF-8
  connectionTimeout=5000 readTimeout=1/
 
 
  Of course, in application I do the same.
  To construct my results, I do connect to MySQL and those two data
 sources.
 
  Basically we have two point of indexing
   - Using DIH at one time indexing
   - At application whenever there is transaction to the details that we
 are
  storing in Solr.
 
 
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 




-- 
All the best

Liu Bo


Re: Multi-core support for indexing multiple servers

2013-11-08 Thread Erick Erickson
Yep, you can define multiple data sources for use with DIH.

Combining data from those multiple sources into a single
index can be a bit tricky with DIH, personally I tend to prefer
SolrJ, but that's mostly personal preference, especially if
I want to get some parallelism going on.

But whatever works

Erick


On Thu, Nov 7, 2013 at 11:17 PM, manju16832003 manju16832...@gmail.comwrote:

 Eric,
 Just a question :-), wouldn't it be easy to use DIH to pull data from
 multiple data sources.

 I do use DIH to do that comfortably. I have three data sources
  - MySQL
  - URLDataSource that returns XML from an .NET application
  - URLDataSource that connects to an API and return XML

 Here is part of data-config data source settings
 dataSource type=JdbcDataSource name=solr
 driver=com.mysql.jdbc.Driver
 url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root
 password=root/
dataSource name=CRMServer type=URLDataSource encoding=UTF-8
 connectionTimeout=5000 readTimeout=1/
dataSource name=ImageServer type=URLDataSource encoding=UTF-8
 connectionTimeout=5000 readTimeout=1/


 Of course, in application I do the same.
 To construct my results, I do connect to MySQL and those two data sources.

 Basically we have two point of indexing
  - Using DIH at one time indexing
  - At application whenever there is transaction to the details that we are
 storing in Solr.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multi-core support for indexing multiple servers

2013-11-07 Thread manju16832003
Hi Rob,
mlti-core approach is different. You could have two cares lets say
marketing-core [Has its own schema.xml and data-config.xml]
magento-core [Has its own schema.xml and data-config.xml]

each core have their own schema.xml and data-config.xml
If you go by multi-core approach I guess you won't be able to achieve what
you described or what you needed. You can 
query across two cores but that is expensive and tedious.

The one you explained with having document type is just single core (Single
index) and you differentiate each 
document by their type

lets say document_type=marketing OR document_type=magento

I think you could go by having single-index (Single-core) with document_type
as differentiator.

Also note that if you have common fields between two databases, you don't
need to re-define those fields. 
You can make use of the same field for two databases.

Lets say you have field 'title' in marketing database and magento database.
You could have one 'title' field defined
in schema.xml, no need to define two title fields. Also carefully look at
each fields default values in schema.xml
Lets say you have some fields in marketing database and those fields does
not exists in magento db. When your done
with indexing, if the fields does not have values they will not show up in
the result. If you want it that way you 
don't need to define default=. If you still want to appear the field
regardless of data or no data you would have
to mention default=
Ex:
field name=year type=int indexed=true stored=true 
multiValued=false bdefault=*/

To index two databases together, you can try with DataImportHandler.
In DataImportHandler you can query multiple data sources. Good thing about
DataImportHandler is that your datasource
could be data bases (MySQL, MS-SQL, etc), URLDataSource etc.

Hope that is helpful



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099746.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multi-core support for indexing multiple servers

2013-11-07 Thread Erick Erickson
Rob:

What I think you're missing is that you are responsible
for pulling the data from your separate sources and
pushing it to solr via an update command. You can
do this in SolrJ, PHP, or any other package that supports
a Solr client. You simply address your requests (both
update and query) to the right core on your central server,
e.g. http://myserver:8983/solr/core/update
  http://myserver:8983/solr/othercore/update

etc.

s far as using DIH is concerned, you have to supply
credentials, including the connection URL in the database
case.

But I think you're right, you're not using multiple cores.
You'll probably have to write something (I use SolrJ) that
can talk to your two data sources, then combine the
information into Solr documents and push them to your
Solr server.

From there, querying is usually fronted by an application
and the indexes are entirely self-contained on the Solr
server so no reaching out is necessary.

Best,
Erick


On Thu, Nov 7, 2013 at 3:50 AM, manju16832003 manju16832...@gmail.comwrote:

 Hi Rob,
 mlti-core approach is different. You could have two cares lets say
 marketing-core [Has its own schema.xml and data-config.xml]
 magento-core [Has its own schema.xml and data-config.xml]

 each core have their own schema.xml and data-config.xml
 If you go by multi-core approach I guess you won't be able to achieve what
 you described or what you needed. You can
 query across two cores but that is expensive and tedious.

 The one you explained with having document type is just single core (Single
 index) and you differentiate each
 document by their type

 lets say document_type=marketing OR document_type=magento

 I think you could go by having single-index (Single-core) with
 document_type
 as differentiator.

 Also note that if you have common fields between two databases, you don't
 need to re-define those fields.
 You can make use of the same field for two databases.

 Lets say you have field 'title' in marketing database and magento database.
 You could have one 'title' field defined
 in schema.xml, no need to define two title fields. Also carefully look at
 each fields default values in schema.xml
 Lets say you have some fields in marketing database and those fields does
 not exists in magento db. When your done
 with indexing, if the fields does not have values they will not show up in
 the result. If you want it that way you
 don't need to define default=. If you still want to appear the field
 regardless of data or no data you would have
 to mention default=
 Ex:
 field name=year type=int indexed=true stored=true
 multiValued=false bdefault=*/

 To index two databases together, you can try with DataImportHandler.
 In DataImportHandler you can query multiple data sources. Good thing about
 DataImportHandler is that your datasource
 could be data bases (MySQL, MS-SQL, etc), URLDataSource etc.

 Hope that is helpful



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099746.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multi-core support for indexing multiple servers

2013-11-07 Thread manju16832003
Eric,
Just a question :-), wouldn't it be easy to use DIH to pull data from
multiple data sources.

I do use DIH to do that comfortably. I have three data sources
 - MySQL
 - URLDataSource that returns XML from an .NET application
 - URLDataSource that connects to an API and return XML

Here is part of data-config data source settings
dataSource type=JdbcDataSource name=solr driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost/employeeDB batchSize=-1 user=root
password=root/
   dataSource name=CRMServer type=URLDataSource encoding=UTF-8
connectionTimeout=5000 readTimeout=1/
   dataSource name=ImageServer type=URLDataSource encoding=UTF-8
connectionTimeout=5000 readTimeout=1/
   

Of course, in application I do the same.
To construct my results, I do connect to MySQL and those two data sources.

Basically we have two point of indexing
 - Using DIH at one time indexing
 - At application whenever there is transaction to the details that we are
storing in Solr.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multi-core-support-for-indexing-multiple-servers-tp4099729p4099933.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Multi-core support for indexing multiple servers

2013-11-06 Thread Shawn Heisey
On 11/6/2013 11:38 PM, Rob Veliz wrote:
 Trying to find specific information to support the following scenario:
 
 - I have one site running on one server with marketing content, blog, etc.
 I want to index.
 - I have another site running on Magento on a different server with
 ecommerce content (products).
 - Both servers live in completely different environments.
 - I would like to create one single search index between both sites and
 make that index searchable from both sites.
 
 I think I can/should use the multi-core approach and spin off a new server
 to host Solr but can anyone verify this is the best/most appropriate
 approach?  Are there any other details I need to consider?  Can anyone
 provide a step by step for making this happen to validate my own technical
 plan?  Any help appreciated...was initially thinking I needed SolrCloud but
 that seems like overkill for my primary use case.

SolrCloud makes for *easy* redundancy.  There is a three-server minimum
if you want it to be fault-tolerant for both Solr and Zookeeper.  The
third server would only run zookeeper and could be an extremely
inexpensive machine.  The other two servers would run both Solr and
Zookeeper.  Redundancy without cloud is possible, it's just not as
automated, and can be done with two servers.

It is highly recommended that redundant servers are not separated
geographically.  This is especially important with SolrCloud, as
Zookeeper redundancy requires that a majority of the servers be
operational.  That can be extremely difficult to guarantee in a
multi-datacenter model, if one assumes that an entire datacenter can
disappear from the network.

If you don't care about redundancy, then you'd just run a single server,
and SolrCloud wouldn't provide much benefit.

Multiple cores is a good way to go -- the two indexes would be logically
separate, but you'd be able to use either one.  With SolrCloud, it would
be multiple collections.

Thanks,
Shawn



Re: Multi-core support for indexing multiple servers

2013-11-06 Thread Rob Veliz
Great feedback, thanks.  So the multi-core structure I have then is a
single Solr server set up, essentially hosted by one domain owner (but to
be used by both).  My question is how does that Solr server connect to the
2 Web applications to create the 1 master index (to be used when searching
on either Web app)?  It feels like I just reference the Solr server from
within the Web app search templates (e.g. PHP files).  That is logical in
terms of pulling the data into the Web apps, but it's still not clear to me
how the data from those 2 Web apps actually gets into the Solr server if
Solr server doesn't live on the same server as the Web app(s).  Any
thoughts?


On Wed, Nov 6, 2013 at 10:57 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/6/2013 11:38 PM, Rob Veliz wrote:
  Trying to find specific information to support the following scenario:
 
  - I have one site running on one server with marketing content, blog,
 etc.
  I want to index.
  - I have another site running on Magento on a different server with
  ecommerce content (products).
  - Both servers live in completely different environments.
  - I would like to create one single search index between both sites and
  make that index searchable from both sites.
 
  I think I can/should use the multi-core approach and spin off a new
 server
  to host Solr but can anyone verify this is the best/most appropriate
  approach?  Are there any other details I need to consider?  Can anyone
  provide a step by step for making this happen to validate my own
 technical
  plan?  Any help appreciated...was initially thinking I needed SolrCloud
 but
  that seems like overkill for my primary use case.

 SolrCloud makes for *easy* redundancy.  There is a three-server minimum
 if you want it to be fault-tolerant for both Solr and Zookeeper.  The
 third server would only run zookeeper and could be an extremely
 inexpensive machine.  The other two servers would run both Solr and
 Zookeeper.  Redundancy without cloud is possible, it's just not as
 automated, and can be done with two servers.

 It is highly recommended that redundant servers are not separated
 geographically.  This is especially important with SolrCloud, as
 Zookeeper redundancy requires that a majority of the servers be
 operational.  That can be extremely difficult to guarantee in a
 multi-datacenter model, if one assumes that an entire datacenter can
 disappear from the network.

 If you don't care about redundancy, then you'd just run a single server,
 and SolrCloud wouldn't provide much benefit.

 Multiple cores is a good way to go -- the two indexes would be logically
 separate, but you'd be able to use either one.  With SolrCloud, it would
 be multiple collections.

 Thanks,
 Shawn




-- 
*Rob Veliz*, Founder | *Mavenbridge* | rob...@mavenbridge.com | M: +1 (206)
909 - 3490

Follow us at: http://twitter.com/mavenbridge


Re: Multi-core support for indexing multiple servers

2013-11-06 Thread Shawn Heisey
On 11/7/2013 12:07 AM, Rob Veliz wrote:
 Great feedback, thanks.  So the multi-core structure I have then is a
 single Solr server set up, essentially hosted by one domain owner (but to
 be used by both).  My question is how does that Solr server connect to the
 2 Web applications to create the 1 master index (to be used when searching
 on either Web app)?  It feels like I just reference the Solr server from
 within the Web app search templates (e.g. PHP files).  That is logical in
 terms of pulling the data into the Web apps, but it's still not clear to me
 how the data from those 2 Web apps actually gets into the Solr server if
 Solr server doesn't live on the same server as the Web app(s).  Any
 thoughts?

Solr uses HTTP calls.  It is REST-like, though there has been some
recent work to make parts of it actually use true REST, that paradigm
might later be extended to the entire interface.

There are a number of Solr API packages for PHP that give you an
obect-oriented interface to Solr that won't require learning Solr's HTTP
interface - you write PHP code to access Solr.  These are two of them
that I have heard about.  I've not actually used these, as I have little
personal experience with writing PHP:

http://pecl.php.net/package/solr
http://www.solarium-project.org/

If you are planning a single master index, that's not multicore.  Having
more than one document type in a single index is possible, they just
have to overlap on at least one field - whatever field is the uniqueKey
for the index.

Thanks,
Shawn



Re: Multi-core support for indexing multiple servers

2013-11-06 Thread Rob Veliz
I've been reading about Solarium--definitely useful.  Could you elaborate
here:

If you are planning a single master index, that's not multicore.  Having
more than one document type in a single index is possible, they just
have to overlap on at least one field - whatever field is the uniqueKey
for the index.

What I'm trying to do is index marketing pages from one server AND index
product pages from a different ecommerce server and then combine those
results into a single index, so when I search for foo from either site, I
get the exact same results for foo.  If that's not multi-core, what's the
right approach to accomplish this?


On Wed, Nov 6, 2013 at 11:29 PM, Shawn Heisey s...@elyograg.org wrote:

 On 11/7/2013 12:07 AM, Rob Veliz wrote:
  Great feedback, thanks.  So the multi-core structure I have then is a
  single Solr server set up, essentially hosted by one domain owner (but to
  be used by both).  My question is how does that Solr server connect to
 the
  2 Web applications to create the 1 master index (to be used when
 searching
  on either Web app)?  It feels like I just reference the Solr server from
  within the Web app search templates (e.g. PHP files).  That is logical in
  terms of pulling the data into the Web apps, but it's still not clear to
 me
  how the data from those 2 Web apps actually gets into the Solr server if
  Solr server doesn't live on the same server as the Web app(s).  Any
  thoughts?

 Solr uses HTTP calls.  It is REST-like, though there has been some
 recent work to make parts of it actually use true REST, that paradigm
 might later be extended to the entire interface.

 There are a number of Solr API packages for PHP that give you an
 obect-oriented interface to Solr that won't require learning Solr's HTTP
 interface - you write PHP code to access Solr.  These are two of them
 that I have heard about.  I've not actually used these, as I have little
 personal experience with writing PHP:

 http://pecl.php.net/package/solr
 http://www.solarium-project.org/

 If you are planning a single master index, that's not multicore.  Having
 more than one document type in a single index is possible, they just
 have to overlap on at least one field - whatever field is the uniqueKey
 for the index.

 Thanks,
 Shawn




-- 
*Rob Veliz*, Founder | *Mavenbridge* | rob...@mavenbridge.com | M: +1 (206)
909 - 3490

Follow us at: http://twitter.com/mavenbridge