Re: Getting started with indexing a database

2012-01-15 Thread Rakesh Varna
Hi Mike,
   Can you try removing ' field column=doc_id name=DOC_ID / from the
nested entities? Just keep it in the top level entity.

Regards,
Rakesh Varna

On Wed, Jan 11, 2012 at 7:26 AM, Gora Mohanty g...@mimirtech.com wrote:

 On Tue, Jan 10, 2012 at 7:09 AM, Mike O'Leary tmole...@uw.edu wrote:
 [...]
  My data-config.xml file looks like this:
 
  dataConfig
   dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
   url=jdbc:mysql://localhost:3306/bioscope user=db_user
 password=/
   document name=bioscope
 entity name=docs pk=doc_id query=SELECT doc_id, type FROM
 bioscope.docs
 deltaQuery=SELECT doc_id FROM bioscope.docs where
 last_modified  '${dataimporter.last_index_time}'
   field column=doc_id name=DOC_ID/
   field column=type name=DOC_TYPE/

 Your SELECT above does not include the field type

   entity name=codes pk=id query=SELECT id, origin, type, code
 FROM bioscope.codes WHERE doc_id='${docs.doc_id}'
  ^^ This should be: WHERE id=='${docs.doc_id}' as 'id' is
 what
you are selecting in this entity.

 Same issue for the second nested entity, i.e., replace doc_id= with id=

 Regards,
 Gora



Re: Getting started with indexing a database

2012-01-11 Thread Erick Erickson
I'm not going to be much help here since DIH is a mystery to me, I usually go
with a SolrJ program when DIH gets beyond simple cases. But have you
seen:
http://wiki.apache.org/solr/DataImportHandler#interactive

It's a tool that helps you see what's going on with your query.

Best
Erick

On Mon, Jan 9, 2012 at 8:39 PM, Mike O'Leary tmole...@uw.edu wrote:
 I am trying to index the contents of a database for the first time, and I am 
 only getting the primary key of the table represented by the top level entity 
 in my data-config.xml file to be indexed. The database I am starting with has 
 three tables:

 The table called docs has columns called doc_id, type and last_modified. The 
 primary key is doc_id.
 The table called codes has columns called id, doc_id, origin, type, code and 
 last_modified. The primary key is id. doc_id is a foreign key to the doc_id 
 column in the docs table.
 The table called texts has columns called id, doc_id, origin, type, text and 
 last_modified. The primary key is id. doc_id is a foreign key to the doc_id 
 column in the docs table.

 My data-config.xml file looks like this:

 dataConfig
  dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
              url=jdbc:mysql://localhost:3306/bioscope user=db_user 
 password=/
  document name=bioscope
    entity name=docs pk=doc_id query=SELECT doc_id, type FROM 
 bioscope.docs
            deltaQuery=SELECT doc_id FROM bioscope.docs where last_modified  
 '${dataimporter.last_index_time}'
      field column=doc_id name=DOC_ID/
      field column=type name=DOC_TYPE/
      entity name=codes pk=id query=SELECT id, origin, type, code FROM 
 bioscope.codes WHERE doc_id='${docs.doc_id}'
              deltaQuery=SELECT doc_id FROM bioscope.codes WHERE 
 last_modified  '${dataimporter.last_index_time}'
              parentDeltaQuery=SELECT doc_id from bioscope.docs WHERE 
 doc_id='${codes.doc_id}'
        field column=id name=CODE_ID/
        field column=doc_id name=DOC_ID/
        field column=origin name=CODE_ORIGIN/
        field column=type name=CODE_TYPE/
        field column=code name=CODE_VALUE/
      /entity
      entity name=notes pk=id query=SELECT id, origin, type, text FROM 
 bioscope.texts WHERE doc_id='${docs.doc_id}'
              deltaQuery=SELECT doc_id FROM bioscope.texts WHERE 
 last_modified  '${dataimporter.last_index_time}'
              parentDeltaQuery=SELECT doc_id from bioscope.docs WHERE 
 doc_id='${texts.doc_id}'
        field column=id name=NOTE_ID/
        field column=doc_id name=DOC_ID/
        field column=origin name=NOTE_ORIGIN/
        field column=type name=NOTE_TYPE/
        field column=text name=NOTE_TEXT/
      /entity
    /entity
  /document
 /dataConfig

 I added these lines to the schema.xml file:

 field name=DOC_ID type=string indexed=true omitNorms=true 
 stored=true/
 field name=DOC_TYPE type=string indexed=true omitNorms=true 
 stored=true/

 field name=CODE_ID type=string indexed=true omitNorms=true 
 stored=true/
 field name=CODE_ORIGIN type=string indexed=true omitNorms=true 
 stored=true/
 field name=CODE_TYPE type=string indexed=true omitNorms=true 
 stored=true/
 field name=CODE_VALUE type=string indexed=true omitNorms=true 
 stored=true/

 field name=NOTE_ID type=string indexed=true omitNorms=true 
 stored=true/
 field name=NOTE_ORIGIN type=string indexed=true omitNorms=true 
 stored=true/
 field name=NOTE_TYPE type=string indexed=true omitNorms=true 
 stored=true/
 field name=NOTE_TEXT type=text_ws indexed=true omitNorms=true 
 stored=true/

 ...

 uniqueKeyDOC_ID/uniqueKey
 defaultSearchFieldNOTE_TEXT/defaultSearchField

 When I run the full-import operation, only the DOC_ID values are written to 
 the index. When I run a program that dumps the index contents as an xml 
 string, the output looks like this:

 ?xml version=1.0 ?
 documents
  document
    field name=DOC_ID value=97634811
    /field
  /document
  document
    field name=DOC_ID value=97634910
    /field
  /document
 ...
 /documents

 Since this is new to me, I am sure that I have simply left something out or 
 specified something the wrong way, but I haven't been able to spot what I 
 have been doing wrong when I have gone over the configuration files that I am 
 using. Can anyone help me figure out why the other database contents are not 
 being indexed?
 Thanks,
 Mike



Re: Getting started with indexing a database

2012-01-11 Thread Gora Mohanty
On Tue, Jan 10, 2012 at 7:09 AM, Mike O'Leary tmole...@uw.edu wrote:
[...]
 My data-config.xml file looks like this:

 dataConfig
  dataSource type=JdbcDataSource driver=com.mysql.jdbc.Driver
              url=jdbc:mysql://localhost:3306/bioscope user=db_user 
 password=/
  document name=bioscope
    entity name=docs pk=doc_id query=SELECT doc_id, type FROM 
 bioscope.docs
            deltaQuery=SELECT doc_id FROM bioscope.docs where last_modified  
 '${dataimporter.last_index_time}'
      field column=doc_id name=DOC_ID/
      field column=type name=DOC_TYPE/

Your SELECT above does not include the field type

      entity name=codes pk=id query=SELECT id, origin, type, code FROM 
 bioscope.codes WHERE doc_id='${docs.doc_id}'
 ^^ This should be: WHERE id=='${docs.doc_id}' as 'id' is what
you are selecting in this entity.

Same issue for the second nested entity, i.e., replace doc_id= with id=

Regards,
Gora