Tika Integration problem with DIH and JDBC

Dan Davis Fri, 10 Oct 2014 11:18:44 -0700

What I want to do is to pull an URL out of an Oracle database, and then use
TikaEntityProcessor and BinURLDataSource to go fetch and process that
URL.   I'm having a problem with this that seems general to JDBC with Tika
- I get an exception as follows:


Exception in entity :
extract:org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query:
http://www.cdc.gov/healthypets/pets/wildlife.html Processing Document
# 14
        at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
...

Steps to reproduce any problem should be:


   - Try it with the XML and verify you get two documents and they contain
   text (schema browser with the text field)
   - Try it with a JDBC sqlite3 dataSource and verify that you get an
   exception, and advise me what may be the problem in my configuration ...

Now, I've tried this 3 ways:


   - My Oracle database - fails as above
   - An SQLite3 database to see if it is Oracle specific - fails with
   "Unable to execute query", but doesn't have the URL as part of the message.
   - An XML file listing two URLs - succeeds without error.

For the SQL attempts, setting onError="skip" leads the data from the
database to be indexed, but the exception is logged for each root entity.
I can tell that nothing is indexed from the text extraction by browsing the
"text" field from the schema browser and seeing how few terms there are.
The exceptions also sort of give it away, but it is good to be careful :)

This is using:

   - Tomcat 7.0.55
   - Solr 4.10.1
   - and JDBC drivers
      - ojdbc7.jar
      - sqlite-jdbc-3.7.2.jar

Excerpt of solrconfig.xml:

  <!-- Data Import Handler for Health Topics -->
  <requestHandler name="/dih-healthtopics" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-healthtopics.xml</str>
    </lst>
  </requestHandler>

  <!-- Data Import Handler that imports a single URL via Tika -->
  <requestHandler name="/dih-smallxml" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-smallxml.xml</str>
    </lst>
  </requestHandler>

    <!-- Data Import Handler that imports a single URL via Tika -->
  <requestHandler name="/dih-smallsqlite" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-smallsqlite.xml</str>
    </lst>
  </requestHandler>


The data import handlers and a copy-paste from Solr logging are attached.

Exception in entity : 
extract:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable 
to execute query:  Processing Document # 1
        at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:71)
        at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:283)
        at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:240)
        at 
org.apache.solr.handler.dataimport.JdbcDataSource.getData(JdbcDataSource.java:44)
        at 
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:188)
        at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:112)
        at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:502)
        at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
        at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
        at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:232)
        at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:416)
        at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:480)
        at 
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody(DataImportHandler.java:189)
        at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
        at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
        at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
        at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
        at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
        at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
        at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
        at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
        at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
        at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
        at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
        at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
        at 
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2440)
        at 
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2429)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at 
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: [SQLITE_MISUSE]  Library used incorrectly 
(not an error)
        at org.sqlite.DB.newSQLException(DB.java:383)
        at org.sqlite.DB.newSQLException(DB.java:387)
        at org.sqlite.DB.execute(DB.java:339)
        at org.sqlite.Stmt.exec(Stmt.java:65)
        at org.sqlite.Stmt.execute(Stmt.java:114)
        at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.<init>(JdbcDataSource.java:276)
        ... 35 more

<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>

    <!-- Requires the java system property for Oracle -->
    <dataSource name="db" driver="org.sqlite.JDBC" url="jdbc:sqlite:/Q:/Documents/MedlinePlus/smalldb.sqlite3"/>

    <script><![CDATA[ 
        function mungeId(row) { 
            var id = row.get('topic_id');
            if (id === null) {
                row.remove('topic_id');
            } else {
                row.put('topic_id', "sqlite-"+id);
            }
            return row;
        }  
    ]]></script>
 
    <document>
       <entity name="topic" rootEntity="true" dataSource="db"
                query="SELECT topic_id, title, url FROM topicsites"
                transformer="script:mungeId">

            <field name="id" column="topic_id" />
            <field name="url" column="url" />
            <field name="title" column="title" />

            <!-- To use Apache Tika to parse whatever we need to parse, HTML or PDF -->
            <dataSource name="bin" driver="BinURLDataSource" />

            <!-- tikaconfig.xml is required, without it, the binary data source doesn't load -->
            <entity name="extract" dataSource="bin" rootEntity="false" processor="TikaEntityProcessor" url="${topic.URL}" query="null" onError="skip">
                <field column="Author" meta="true" name="author"/>
                <field column="text" name="text"/>
            </entity>

        </entity>
    </document>
</dataConfig>

<dataConfig>

    <!-- Requires the java system property for Oracle -->
    <dataSource name="oltp01_prod" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@oltp01_prod" user="impromptu" password="nlm1public_user"/>

    <script><![CDATA[ 
        function mungeId(row) { 
            var id = row.get('MEDSITE_ID');
            if (id === null || true === id.isEmpty() || id === '') {
                row.remove('MEDSITE_ID');
            } else {
                row.put('MEDSITE_ID', "medsite-"+id);
            }
            return row;
        }  
    ]]></script>
 
    <document>
       <entity name="healthtopic" rootEntity="true" dataSource="oltp01_prod"
                query="SELECT medsite_id, title, url, text_description FROM MEDPLUS.PUBLIC_TOPIC_SITES_US_V WHERE url NOT LIKE '%.pdf%'"
                transformer="script:mungeId">

            <field name="description" column="TEXT_DESCRIPTION" />
            <field name="id" column="MEDSITE_ID" />
            <field name="url" column="URL" />

            <!-- To use Apache Tika to parse whatever we need to parse, HTML or PDF -->
            <dataSource name="bin" driver="BinURLDataSource" />

            <!-- tikaconfig.xml is required, without it, the binary data source doesn't load -->
            <entity name="extract" dataSource="bin" rootEntity="false" processor="TikaEntityProcessor" url="${healthtopic.URL}" query="null" onError="skip">
                <field column="Author" meta="true" name="author"/>
                <field column="text" name="text"/>
            </entity>

        </entity>
    </document>
</dataConfig>

<?xml version="1.0" encoding="UTF-8"?>
<topics>
	<topic>
		<medsite_id>77070</medsite_id>
		<title>Swine Influenza/Variant Influenza Viruses</title>
		<url>http://www.cdc.gov/flu/swineflu/</url>
	</topic>
	<topic>
		<medsite_id>101871</medsite_id>
		<title>Osteoarthritis: Questions to Discuss with Your Doctor</title>
		<url>http://www.health.harvard.edu/fhg/doctor/osteoarth.shtml</url>
	</topic>
</topics>

<dataConfig>
    <dataSource type="BinURLDataSource" name="data"/>
    <dataSource type="FileDataSource" name="f"/>
    <document>
        <entity name="rec" rootEntity="true" processor="XPathEntityProcessor" url="simple.xml" forEach="/topics/topic" dataSource="f">
            <field column="id" xpath="//medsite_id" name="id"/>
            <field column="url" xpath="//url" name="url"/>
            <field column="title" xpath="//title" name="url"/>
            <entity name="extract" processor="TikaEntityProcessor" url="${rec.url}" dataSource="data" rootEntity="false">
                <field column="text" name="text" />
                <field column="Author" name="author" meta="true" />
            </entity>
        </entity>
    </document>
</dataConfig>

Tika Integration problem with DIH and JDBC

Reply via email to