I'm not going to be much help here since DIH is a mystery to me, I usually go
with a SolrJ program when DIH gets beyond simple cases. But have you
seen:
http://wiki.apache.org/solr/DataImportHandler#interactive
It's a tool that helps you see what's going on with your query.
Best
Erick
On Mon, Jan 9, 2012 at 8:39 PM, Mike O'Leary tmole...@uw.edu wrote:
I am trying to index the contents of a database for the first time, and I am
only getting the primary key of the table represented by the top level entity
in my data-config.xml file to be indexed. The database I am starting with has
three tables:
The table called docs has columns called doc_id, type and last_modified. The
primary key is doc_id.
The table called codes has columns called id, doc_id, origin, type, code and
last_modified. The primary key is id. doc_id is a foreign key to the doc_id
column in the docs table.
The table called texts has columns called id, doc_id, origin, type, text and
last_modified. The primary key is id. doc_id is a foreign key to the doc_id
column in the docs table.
My data-config.xml file looks like this:
dataConfig
dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
url=jdbc:mysql://localhost:3306/bioscope user=db_user
password=/
document name=bioscope
entity name=docs pk=doc_id query=SELECT doc_id, type FROM
bioscope.docs
deltaQuery=SELECT doc_id FROM bioscope.docs where last_modified
'${dataimporter.last_index_time}'
field column=doc_id name=DOC_ID/
field column=type name=DOC_TYPE/
entity name=codes pk=id query=SELECT id, origin, type, code FROM
bioscope.codes WHERE doc_id='${docs.doc_id}'
deltaQuery=SELECT doc_id FROM bioscope.codes WHERE
last_modified '${dataimporter.last_index_time}'
parentDeltaQuery=SELECT doc_id from bioscope.docs WHERE
doc_id='${codes.doc_id}'
field column=id name=CODE_ID/
field column=doc_id name=DOC_ID/
field column=origin name=CODE_ORIGIN/
field column=type name=CODE_TYPE/
field column=code name=CODE_VALUE/
/entity
entity name=notes pk=id query=SELECT id, origin, type, text FROM
bioscope.texts WHERE doc_id='${docs.doc_id}'
deltaQuery=SELECT doc_id FROM bioscope.texts WHERE
last_modified '${dataimporter.last_index_time}'
parentDeltaQuery=SELECT doc_id from bioscope.docs WHERE
doc_id='${texts.doc_id}'
field column=id name=NOTE_ID/
field column=doc_id name=DOC_ID/
field column=origin name=NOTE_ORIGIN/
field column=type name=NOTE_TYPE/
field column=text name=NOTE_TEXT/
/entity
/entity
/document
/dataConfig
I added these lines to the schema.xml file:
field name=DOC_ID type=string indexed=true omitNorms=true
stored=true/
field name=DOC_TYPE type=string indexed=true omitNorms=true
stored=true/
field name=CODE_ID type=string indexed=true omitNorms=true
stored=true/
field name=CODE_ORIGIN type=string indexed=true omitNorms=true
stored=true/
field name=CODE_TYPE type=string indexed=true omitNorms=true
stored=true/
field name=CODE_VALUE type=string indexed=true omitNorms=true
stored=true/
field name=NOTE_ID type=string indexed=true omitNorms=true
stored=true/
field name=NOTE_ORIGIN type=string indexed=true omitNorms=true
stored=true/
field name=NOTE_TYPE type=string indexed=true omitNorms=true
stored=true/
field name=NOTE_TEXT type=text_ws indexed=true omitNorms=true
stored=true/
...
uniqueKeyDOC_ID/uniqueKey
defaultSearchFieldNOTE_TEXT/defaultSearchField
When I run the full-import operation, only the DOC_ID values are written to
the index. When I run a program that dumps the index contents as an xml
string, the output looks like this:
?xml version=1.0 ?
documents
document
field name=DOC_ID value=97634811
/field
/document
document
field name=DOC_ID value=97634910
/field
/document
...
/documents
Since this is new to me, I am sure that I have simply left something out or
specified something the wrong way, but I haven't been able to spot what I
have been doing wrong when I have gone over the configuration files that I am
using. Can anyone help me figure out why the other database contents are not
being indexed?
Thanks,
Mike