OK - I did a little testing and with full-import and clean=false, I get more and more records when I import the same XML file. I have also checked and I see that my uniqueKey is defined correctly.

Here are my fields in schema.xml:

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <field name="id" type="text_general" indexed="true" stored="true"/>
   <field name="cve" type="text_general" indexed="true" stored="true"/>
   <field name="cwe" type="text_general" indexed="true" stored="true"/>
<field name="vulnerable-configuration" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="vulnerable-software" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="product" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="published" type="text_general" indexed="true" stored="true" /> <field name="modified" type="text_general" indexed="true" stored="true" /> <field name="summary" type="text_general" indexed="true" stored="true" /> <field name="cvss-score" type="text_general" indexed="true" stored="true" /> <field name="cvss-access-vector" type="text_general" indexed="true" stored="true" /> <field name="cvss-access-complexity" type="text_general" indexed="true" stored="true" /> <field name="cvss-authentication" type="text_general" indexed="true" stored="true" /> <field name="cvss-confidentiality-impact" type="text_general" indexed="true" stored="true" /> <field name="cvss-integrity-impact" type="text_general" indexed="true" stored="true" /> <field name="cvss-availability-impact" type="text_general" indexed="true" stored="true" /> <field name="reference" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="security-protection" type="text_general" indexed="true" stored="true" />

And here is uniqueKey in schema.xml:

<uniqueKey>id</uniqueKey>


Here is my rss-data-config.xml:

<dataConfig>
<dataSource type="ZIPURLDataSource" connectionTimeout="15000" readTimeout="30000"/>
    <document>
        <entity name="cve-2002"
                pk="id"
url="https://nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2002.xml.zip";
                processor="XPathEntityProcessor"
                forEach="/nvd/entry"
                transformer="RegexTransformer">
<field column="id" xpath="/nvd/entry/@id" commonField="false" /> <field column="cve" xpath="/nvd/entry/cve-id" commonField="false" /> <field column="cwe" xpath="/nvd/entry/cwe/@id" commonField="false" /> <field column="vulnerable-configuration" xpath="/nvd/entry/vulnerable-configuration/logical-test/fact-ref/@name" commonField="false"/> <field column="vulnerable-software" xpath="/nvd/entry/vulnerable-software-list/product" commonField="false"/> <field column="product" sourceColName="vulnerable-software" commonField="false" regex="cpe:/.:" replaceWith=""/> <field column="product" commonField="false" regex=":" replaceWith=" "/> <field column="published" xpath="/nvd/entry/published-datetime" commonField="false" /> <field column="modified" xpath="/nvd/entry/last-modified-datetime" commonField="false" /> <field column="summary" xpath="/nvd/entry/summary" commonField="false" /> <field column="cvss-score" xpath="/nvd/entry/cvss/base_metrics/score" commonField="false" /> <field column="cvss-access-vector" xpath="/nvd/entry/cvss/base_metrics/access-vector" commonField="false" /> <field column="cvss-access-complexity" xpath="/nvd/entry/cvss/base_metrics/access-complexity" commonField="false" /> <field column="cvss-authentication" xpath="/nvd/entry/cvss/base_metrics/authentication" commonField="false" /> <field column="cvss-confidentiality-impact" xpath="/nvd/entry/cvss/base_metrics/confidentiality-impact" commonField="false" /> <field column="cvss-integrity-impact" xpath="/nvd/entry/cvss/base_metrics/integrity-impact" commonField="false" /> <field column="cvss-availability-impact" xpath="/nvd/entry/cvss/base_metrics/availability-impact" commonField="false" /> <field column="reference" xpath="/nvd/entry/references/reference/@href" commonField="false" /> <field column="security-protection" xpath="/nvd/entry/security-protection" commonField="false" />
        </entity>
    </document>
</dataConfig>

Here is the import command the first time:

*curl "http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import&entity=cve-2002&clean=true"*

Here is the command that outputs the count of records:

*curl "http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=*:*&start=0&&rows=0&fl=*"*

And here is the output:

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "fl":"*",
      "indent":"true",
      "start":"0",
      "q":"*:*",
      "wt":"json",
      "rows":"0"}},
  "response":{"numFound":6717,"start":0,"docs":[]
  }}

Now here is the next full-import command with clean=false:

*"http://127.0.0.1:8983/solr/nvd-rss/dataimport?command=full-import&entity=cve-2002&clean=false"*

And here is the new count:

*curl "http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=*:*&start=0&&rows=0&fl=*"*

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "fl":"*",
      "indent":"true",
      "start":"0",
      "q":"*:*",
      "wt":"json",
      "rows":"0"}},
  "response":{"numFound":13434,"start":0,"docs":[]
  }}

Clearly, this is just importing the same records twice.


What is even more puzzling that if I search for an id value which is unique in the imported XML, I get all records back:

curl "http://localhost:8983/solr/nvd-rss/select?wt=json&indent=true&q=id:CVE-1999-0001&start=0&&rows=0&fl=*";
{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "fl":"*",
      "indent":"true",
      "start":"0",
      "q":"id:CVE-1999-0001",
      "wt":"json",
      "rows":"0"}},
  "response":{"numFound":13434,"start":0,"docs":[]
  }}

On 1/27/15, 2:03 PM, Carl Roberts wrote:
HI Alex, thanks for clarifying this for me. I'll take a look at my setup of the uniqueKey. Perhaps I did not set it right.


On 1/27/15, 12:09 PM, Alexandre Rafalovitch wrote:
What do you mean by "update"? If you mean partial update, DIH does not
do it AFAIK. If you mean replace, it should.

If you are getting duplicate records, maybe your uniqueKey is not set correctly?

clean=false looks to me like the right approach for incremental updates.

Regards,
    Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 27 January 2015 at 11:43, Carl Roberts <carl.roberts.zap...@gmail.com> wrote:
Also, if I try full-import and clean=false with the same XML file, I end up with more records each time the import runs. How can I make SOLR just add the records that are new by id, and update the ones that have an id that
matches the one in the existing index?



On 1/27/15, 11:32 AM, Carl Roberts wrote:
Hi,

What is the recommended way to import and update index records?

I've read the documentation and I've experimented with full-import and
delta-import and I am not seeing the desired results.

Basically, I have 15 RSS feeds that I am importing through
rss-data-config.xml.

The first RSS feed should be a full import and the ones that follow may contain the same id, in which case the existing id in the index should be updated from the record in the new RSS feed. Also there may be new records in the RSS feeds that follow the first one, in which case I want them added
to the index.

When I try full-import for each entity, the index is cleared and I just
end up with the records for the last import.

When I try full-import for each entity, with the clean=false parameter, all the records from each entity are added to the index and I end up with
duplicate records.

When I try delta-import for the entities the follow the first one, I don't
get any new index records.

How should I do this?

Regards,

Joe



Reply via email to