For anyone interested, my issue (I think) was because I had specified the url field as a multivalued field. I wasn't able to create a test case that emulated my problem. This guess is based on gradual fiddling with my configs.

My concern is no longer pressing but I do have a couple questions for the devs to think about:

  1. How should a multivalued field be treated in a child entity?  The
     use case would be the one I presented where I intend url to be
     multivalued.  I'm thinking a for-each type construct should apply.
  2. How should a multivalued field be formatted or custom formatted if
     you intend to use the content of a field in another field,
     possibly nested?



Tricia Williams wrote:
Hi All,

The DataImportHandler is the most fantastic thing that has recently come to Solr. Thank you.

I'm noticing that when I use variables in nested entities that square brackets are wrapped around the variable value when they are used. For example ${x.url} used in the "tika" entity below resolves as [http://publicdomain.ca/content/Sample.pdf] (note the square brackets) so I get the error in my log:

SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: [http://publicdomain.ca/content/Sample.pdf]
        at java.net.URL.<init>(URL.java:567)
        at java.net.URL.<init>(URL.java:464)
        at java.net.URL.<init>(URL.java:413)
at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
aSource.java:78)
at org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat
aSource.java:38)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn
tityProcessor.java:98)

I encountered this previously when I tried to concatenate fields from different entities into one field. I worked around this by gathering fields with an xsl. Not being able to resolve the url for Tika is a little more problematic.

*Is this a bug? If not, how do I remove the brackets so that I can use my variable as it was meant?*

<dataConfig>

   <dataSource type="BinURLDataSource" name="bin"/>

   <dataSource type="FileDataSource" name="fileReader"/>

   <document>

<entity name="f" processor="FileListEntityProcessor" baseDir="/home/pgwillia/content" dataSource="null" fileName=".*xml" rootEntity="false">

<entity name="x" processor="XPathEntityProcessor" dataSource="fileReader" transformer="TemplateTransformer,RegexTransformer" forEach="/RDF/Description" url="${f.fileAbsolutePath}">

            ...

<field column="url" xpath="/RDF/Description/identifier" regex="http://privatedomain:8080/content/"; replaceWith="http://publicdomain.ca/content/"/>

<entity name="tika" processor="TikaEntityProcessor" url="${x.url}" dataSource="bin" format="text">

                       <field column="fulltext" name="text"/>

              </entity>

           </entity>

       </entity>

   </document>

</dataConfig>


Many thanks,
Tricia


Reply via email to