Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Carl Roberts Fri, 23 Jan 2015 11:33:12 -0800

OK - Thanks for the doc.

Is it possible to just provide an empty value to preImportDeleteQuery todisable the delete prior to import?

Will the data still be deleted for each entity during a delta-importinstead of full-import?

Is there any capability in the handler to unzip an XML file from a URLprior to reading it or can I perhaps hook a custom pre-processing handler?


Regards,

Joe


On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:

https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler

Admin UI has the interface, so you can play there once you define it.

You do have to use Curl, there is no built-in scheduler.

Regards,
    Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 13:29, Carl Roberts <carl.roberts.zap...@gmail.com> wrote:

Hi Alex,

If I am understanding this correctly, I can define multiple entities like
this?

<document>
     <entity/>
     <entity/>
     <entity/>
     ...
</document>

How would I trigger loading certain entities during start?

How would I trigger loading other entities during update?

Is there a way to set an auto-update for certain entities so that I don't
have to invoke an update via curl?

Where / how do I specify the preImportDeleteQuery to avoid deleting
everything upon each update?

Is there an example or doc that shows how to do all this?

Regards,

Joe


On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:

You can define both multiple entities in the same file and nested
entities if your list comes from an external source (e.g. a text file
of URLs).
You can also trigger DIH with a name of a specific entity to load just
that.
You can even pass DIH configuration file when you are triggering the
processing start, so you can have different files completely for
initial load and update. Though you can just do the same with
entities.

The only thing to be aware of is that before an entity definition is
processed, a delete command is run. By default, it's "delete all", so
executing one entity will delete everything but then just populate
that one entity's results. You can avoid that by defining
preImportDeleteQuery and having a clear identifier on content
generated by each entity (e.g. source, either extracted or manually
added with TemplateTransformer).

Regards,
     Alex.

----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 11:15, Carl Roberts <carl.roberts.zap...@gmail.com>
wrote:

Hi,

I have the RSS DIH example working with my own RSS feed - here is the
configuration for it.

<dataConfig>
      <dataSource type="URLDataSource" />
      <document>
          <entity name="nvd-rss"
                  pk="link"
                  url="https://nvd.nist.gov/download/nvd-rss.xml";
                  processor="XPathEntityProcessor"
                  forEach="/RDF/item"
                  transformer="DateFormatTransformer">

              <field column="id" xpath="/RDF/item/title"
commonField="true" />
              <field column="link" xpath="/RDF/item/link"
commonField="true"
/>
              <field column="summary" xpath="/RDF/item/description"
commonField="true" />
              <field column="date" xpath="/RDF/item/date"
commonField="true"
/>

          </entity>
      </document>
</dataConfig>

However, my problem is that I also have to load multiple XML feeds into
the
same core.  Here is one example (there are about 10 of them):

http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip


Is there any built-in functionality that would allow me to do this?
Basically, the use-case is to load and index all the XML ZIP files first,
and then check the RSS feed every two hours and update the indexes with
any
new ones.

Regards,

Joe

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Reply via email to