Re: Specifying multiple documents in DataImportHandler dataConfig

Bertie Shen Sat, 07 Nov 2009 09:44:17 -0800

I have the same problem. I had thought we could specify multiple <document>
blah blah blah</document>s, each of which is mapping one table in the RDBMS.
But I found it was not the case. It only picks the first <document>blah blah
blah</document> to do indexing.


I think Rupert's  and my request are pretty common. Basically there are
multiple tables in RDBMS, and we want each row in each table become a
document in Lucene index. How can we write one data config.xml file to let
DataImportHandler import multiple tables at the same time?

Rupert, have you figured out a way to do it?

Thanks.


On Tue, Sep 8, 2009 at 3:42 PM, Rupert Fiasco <rufia...@gmail.com> wrote:

> Maybe I should be more clear: I have multiple tables in my DB that I
> need to save to my Solr index. In my app code I have logic to persist
> each table, which maps to an application model to Solr. This is fine.
> I am just trying to speed up indexing time by using DIH instead of
> going through my application. From what I understand of DIH I can
> specify one dataSource element and then a series of document/entity
> sets, for each of my models. But like I said before, DIH only appears
> to want to index the first document declared under the dataSource tag.
>
> -Rupert
>
> On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco<rufia...@gmail.com> wrote:
> > I am using the DataImportHandler with a JDBC datasource. From my
> > understanding of DIH, for each of my "content types" e.g. Blog posts,
> > Mesh Categories, etc I would construct a series of document/entity
> > sets, like
> >
> > <dataConfig>
> > <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://...." />
> >
> >    <!-- BLOG ENTRIES -->
> >    <document name="blog_entries">
> >      <entity name="blog_entries" query="select
> > id,title,keywords,summary,data,title as name_fc,'BlogEntry' as type
> > from blog_entries">
> >        <field column="id" name="pk_i" />
> >        <field column="id" name="id" />
> >        <field column="title" name="text_t" />
> >        <field column="data" name="text_t" />
> >      </entity>
> >    </document>
> >
> >    <!-- MESH CATEGORIES -->
> >    <document name="mesh_category">
> >      <entity name="mesh_categories" query="select
> > id,name,node_key,name as name_fc,'MeshCategory' as type from
> > mesh_categories">
> >        <field column="id" name="pk_i" />
> >        <field column="id" name="id" />
> >        <field column="name" name="text_t" />
> >        <field column="node_key" name="string" />
> >        <field column="name_fc" name="facet_value" />
> >        <field column="type" name="type_t" />
> >      </entity>
> >    </document>
> > </datasource>
> > </dataConfig>
> >
> >
> > Solr parses this just fine and allows me to issue a
> > /dataimport?command=full-import and it runs, but it only runs against
> > the "first" document (blog_entries). It doesnt run against the 2nd
> > document (mesh_categories).
> >
> > If I remove the 2 document elements and wrap both entity sets in just
> > one document tag, then both sets get indexed, which seemingly achieves
> > my goal. This just doesnt make sense from my understanding of how DIH
> > works. My 2 content types are indeed separate so they logically
> > represent two document types, not one.
> >
> > Is this correct? What am I missing here?
> >
> > Thanks
> > -Rupert
> >
>

Re: Specifying multiple documents in DataImportHandler dataConfig

Reply via email to