i can do it like this but then the content isn't copied to text. it's just in 
text_test

<entity name="tika" processor="TikaEntityProcessor" 
url="${rec.path}${rec.file}" dataSource="dataUrl" >
        <field column="text" name="text_test">
        <copyField source="text_test" dest="text" />
</entity>


On 22. Aug 2013, at 6:12 PM, Andreas Owen wrote:

> i put it in the tika-entity as attribute, but it doesn't change anything. my 
> bigger concern is why text_test isn't populated at all
> 
> On 22. Aug 2013, at 5:27 PM, Alexandre Rafalovitch wrote:
> 
>> Can you try SOLR-4530 switch:
>> https://issues.apache.org/jira/browse/SOLR-4530
>> 
>> Specifically, setting htmlMapper="identity" on the entity definition. This
>> will tell Tika to send full HTML rather than a seriously stripped one.
>> 
>> Regards,
>> Alex.
>> 
>> Personal website: http://www.outerthoughts.com/
>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>> - Time is the quality of nature that keeps events from happening all at
>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>> 
>> 
>> On Thu, Aug 22, 2013 at 11:02 AM, Andreas Owen <a...@conx.ch> wrote:
>> 
>>> i'm trying to index a html page and only user the div with the
>>> id="content". unfortunately nothing is working within the tika-entity, only
>>> the standard text (content) is populated.
>>> 
>>>       do i have to use copyField for test_text to get the data?
>>>       or is there a problem with the entity-hirarchy?
>>>       or is the xpath wrong, even though i've tried it without and just
>>> using text?
>>>       or should i use the updateextractor?
>>> 
>>> data-config.xml:
>>> 
>>> <dataConfig>
>>>       <dataSource type="BinFileDataSource" name="data"/>
>>>       <dataSource type="BinURLDataSource" name="dataUrl"/>
>>>       <dataSource type="URLDataSource" baseUrl="
>>> http://127.0.0.1/tkb/internet/"; name="main"/>
>>> <document>
>>>       <entity name="rec" processor="XPathEntityProcessor"
>>> url="docImportUrl.xml" forEach="/docs/doc" dataSource="main">
>>>               <field column="title" xpath="//title" />
>>>               <field column="id" xpath="//id" />
>>>               <field column="file" xpath="//file" />
>>>               <field column="path" xpath="//path" />
>>>               <field column="url" xpath="//url" />
>>>               <field column="Author" xpath="//author" />
>>> 
>>>               <entity name="tika" processor="TikaEntityProcessor"
>>> url="${rec.path}${rec.file}" dataSource="dataUrl" >
>>>                       <!-- <copyField source="text" dest="text_test" />
>>> -->
>>>                       <field column="text_test"
>>> xpath="//div[@id='content']" />
>>>               </entity>
>>>       </entity>
>>> </document>
>>> </dataConfig>
>>> 
>>> docImporterUrl.xml:
>>> 
>>> <?xml version="1.0" encoding="utf-8"?>
>>> <docs>
>>> <doc>
>>>               <id>5</id>
>>>               <author>tkb</author>
>>>               <title>Startseite</title>
>>>               <description>blabla ...</description>
>>>               <file>http://localhost/tkb/internet/index.cfm</file>
>>>               <url>http://localhost/tkb/internet/index.cfm/url</url>
>>>               <path2>http\specialConf</path2>
>>>       </doc>
>>>       <doc>
>>>               <id>6</id>
>>>               <author>tkb</author>
>>>               <title>Eigenheim</title>
>>>               <description>Machen Sie sich erste Gedanken über den
>>> Erwerb von Wohneigentum? Oder haben Sie bereits konkrete Pläne oder gar ein
>>> spruchreifes Projekt? Wir beraten Sie gerne in allen Fragen rund um den
>>> Erwerb oder Bau von Wohneigentum, damit Ihr Vorhaben auch in finanzieller
>>> Hinsicht gelingt.</description>
>>>               <file>
>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm</file>
>>>               <url>
>>> http://127.0.0.1/tkb/internet/private/beratung/eigenheim.htm/url</url>
>>>       </doc>
>>> </docs>

Reply via email to