On 6/18/2010 2:42 PM, Chris Hostetter wrote:
: > I don't think DIH can do that, but who knows, let's see what others say.

: Looks like the ExtractingRequestHandler uses Tika as well.  I might just use
: this but I'm wondering if there will be a large performance difference between
: using it to batch content in over rolling my own Transformer?

I'm confused ... You're using DIH, and some of your fields are URLs to documents that you want to parse with Tika?

Why would you need a custom Transformer?

http://wiki.apache.org/solr/DataImportHandler#Tika_Integration
http://wiki.apache.org/solr/TikaEntityProcessor

-Hoss

Ok, I'm trying to integrate the TikaEntityProcessor as suggested. I'm using Solr Version: 1.4.0 and getting the following error:

java.lang.ClassNotFoundException: Unable to load BinURLDataSource or org.apache.solr.handler.dataimport.BinURLDataSource

curl -s http://test.html|curl http://localhost:9080/solr/update/extract?extractOnly=true --data-binary @- -H 'Content-type:text/html'

... works fine so presumably my Tika processor is working.


My data-config.xml looks like this:

<dataConfig>
  <dataSource type="JdbcDataSource"
    driver="oracle.jdbc.driver.OracleDriver"
    url="jdbc:oracle:thin:@whatever:12345:whatever"
    user="me"
    name="ds-db"
    password="secret"/>

  <dataSource type="BinURLDataSource"
    name="ds-url"/>

  <document>
    <entity name="my_database"
     dataSource="ds-db"
     query="select * from my_database where rownum &lt;=2">
      <field column="CONTENT_ID"                name="content_id"/>
      <field column="CMS_TITLE"                 name="cms_title"/>
      <field column="FORM_TITLE"                name="form_title"/>
      <field column="FILE_SIZE"                 name="file_size"/>
      <field column="KEYWORDS"                  name="keywords"/>
      <field column="DESCRIPTION"               name="description"/>
      <field column="CONTENT_URL"               name="content_url"/>
    </entity>

    <entity name="my_database_url"
     dataSource="ds-url"
query="select CONTENT_URL from my_database where content_id='${my_database.CONTENT_ID}'">
     <entity processor="TikaEntityProcessor"
      dataSource="ds-url"
      format="text">
      url="http://www.mysite.com/${my_database.content_url}";
      <field column="text"/>
     </entity>
    </entity>

  </document>
</dataConfig>

I added the entity name="my_database_url" section to an existing (working) database entity to be able to have Tika index the content pointed to by the content_url.

Is there anything obviously wrong with what I've tried so far?


Thanks - Tod

Reply via email to