I've been troubleshooting an issue where we're trying to load documents through DIH's URLDataSource and XPathEntityProcessor, where we want to leverage the $hasMore feature to request to a new URL.

I've been tinkering with this using a very simple example, two XML files -

solr.xml:
  <add>
    <doc>
     <field name="id">SOLR1000</field>
    </doc>
    <doc>
     <field name="id">**HASMORE**</field>
    </doc>
  </add>

solr2.xml
  <add>
    <doc>
      <field name="id">SOLR2k</field>
    </doc>
  </add>

My DIH config is:

<?xml version="1.0"?>
<dataConfig>
<dataSource type="URLDataSource" baseUrl="file:///Users/erikhatcher/dev/solr/example/exampledocs/ "
             readTimeout="180000" connectionTimeout="60000"/>

 <script>
   <![CDATA[
     function checkForMore(row, context) {
       print("### checkForMore: " + row);
       if (row.get('id') == '**HASMORE**') {
         print("#### hasMore ####");
         row.put('$hasMore', 'true');
row.put('$nextUrl', 'file:///Users/erikhatcher/dev/solr/example/exampledocs/solr2.xml') ;
         row.put('$skipRow', 'true');
       } else {
         row.put('$hasMore', 'false');
       }
       return row;
     }
   ]]>
 </script>

 <document name="docs">
   <entity name="doc"
           processor="XPathEntityProcessor"
           url="solr.xml"
           forEach="/add/doc"
           stream="true"
transformer ="DateFormatTransformer,TemplateTransformer,script:checkForMore"
           onError="abort">
     <field column="id" xpath="/add/doc/fie...@name='id']"/>
   </entity>
 </document>
</dataConfig>

Without the else clause in checkForMore to set $hasMore to false, an infinite loop occurs and solr2.xml is requested repeatedly. This is because once $hasMore is set on a row, XPathEntityProcess#readUsefulVars sets it in entity scope and it never gets unset. Is this intentional? Shouldn't $hasMore get reset after more is requested?

On a related note, it would seem useful to allow $hasMore/$skipRow/ $nextUrl to be controlled from the XML data rather than solely from a transformer. But $prefixed fields are ignored by DIH, right?

I'm still looking for that holy grail of a good example leveraging $hasMore/$nextUrl! :)

Thanks,
        Erik

  • DataImportHandler, XPa... Erik Hatcher

Reply via email to