Re: DIH load only selected documents with XPathEntityProcessor

Gora Mohanty Sat, 08 Jan 2011 05:39:02 -0800

On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:
> Hello list,
>
> is it possible to load only selected documents with XPathEntityProcessor?
> While loading docs I want to drop/skip/ignore documents with missing URL.
>
> Example:
> <documents>
>    <document>
>        <title>first title</title>
>        <id>identifier_01</id>
>        <link>http://www.foo.com/path/bar.html</link>
>    </document>
>    <document>
>        <title>second title</title>
>        <id>identifier_02</id>
>        <link></link>
>    </document>
> </documents>
>
> The first document should be loaded, the second document should be ignored
> because it has an empty link (should also work for missing link field).
[...]


You can use a ScriptTransformer, along with $skipRow/$skipDoc.
E.g., something like this for your data import configuration file:

<dataConfig>
    <script><![CDATA[
      function skipRow(row) {
        var link = row.get( 'link' );
        if( link == null || link == '' ) {
          row.put( '$skipRow', 'true' );
        }
        return row;
      }
    ]]></script>
    <dataSource type="FileDataSource" />
    <document>
        <entity name="f" processor="FileListEntityProcessor"
baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
recursive="true" rootEntity="false" dataSource="null">
            <entity name="top" processor="XPathEntityProcessor"
forEach="/documents/document" url="${f.fileAbsolutePath}"
transformer="script:skipRow">
               <field column="link" xpath="/documents/document/link"/>
               <field column="title" xpath="/documents/document/title"/>
               <field column="id" xpath="/documents/document/id"/>
            </entity>
        </entity>
    </document>
</dataConfig>

Regards,
Gora

Re: DIH load only selected documents with XPathEntityProcessor

Reply via email to