Re: Delete IDs with JDBC connector

Karl Wright Wed, 26 Apr 2017 08:21:17 -0700

Oh, never mind.  I see the issue, which is that without the version query,
documents that don't appear in the result list *at all* are never removed
from the map.  I'll create a ticket.


Karl


On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <daddy...@gmail.com> wrote:

> Hi Julien,
>
> The delete logic in the connector is as follows:
>
> >>>>>>
>     // Now, go through the original id's, and see which ones are still in
> the map.  These
>     // did not appear in the result and are presumed to be gone from the
> database, and thus must be deleted.
>     for (String documentIdentifier : documentIdentifiers)
>     {
>       if (fetchDocuments.contains(documentIdentifier))
>       {
>         String documentVersion = map.get(documentIdentifier);
>         if (documentVersion != null)
>         {
>           // This means we did not see it (or data for it) in the result
> set.  Delete it!
>           activities.noDocument(documentIdentifier,documentVersion);
>           activities.recordActivity(null, ACTIVITY_FETCH,
>             null, documentIdentifier, "NOTFETCHED", "Document was not seen
> by processing query", null);
>         }
>       }
>     }
> <<<<<<
>
> For a JDBC job without a version query, fetchDocuments contains all the
> documents.  But map has the entries removed that were actually fetched.
> Documents that were *not* fetched for whatever reason therefore will not be
> cleaned up.  Here's the code that determines that:
>
> >>>>>>
>             String version = map.get(id);
>             if (version == null)
>               // Does not need refetching
>               continue;
>
>             // This document was marked as "not scan only", so we expect
> to find it.
>             if (Logging.connectors.isDebugEnabled())
>               Logging.connectors.debug("JDBC: Document data result found
> for '"+id+"'");
>             o = row.getValue(JDBCConstants.urlReturnColumnName);
>             if (o == null)
>             {
>               Logging.connectors.debug("JDBC: Document '"+id+"' has a
> null url - skipping");
>               errorCode = activities.NULL_URL;
>               errorDesc = "Excluded because document had a null URL";
>               activities.noDocument(id,version);
>               continue;
>             }
>
>             // This is not right - url can apparently be a BinaryInput
>             String url = JDBCConnection.readAsString(o);
>             boolean validURL;
>             try
>             {
>               // Check to be sure url is valid
>               new java.net.URI(url);
>               validURL = true;
>             }
>             catch (java.net.URISyntaxException e)
>             {
>               validURL = false;
>             }
>
>             if (!validURL)
>             {
>               Logging.connectors.debug("JDBC: Document '"+id+"' has an
> illegal url: '"+url+"' - skipping");
>               errorCode = activities.BAD_URL;
>               errorDesc = "Excluded because document had illegal URL
> ('"+url+"')";
>               activities.noDocument(id,version);
>               continue;
>             }
>
>             // Process the document itself
>             Object contents = row.getValue(JDBCConstants.
> dataReturnColumnName);
>             // Null data is allowed; we just ignore these
>             if (contents == null)
>             {
>               Logging.connectors.debug("JDBC: Document '"+id+"' seems to
> have null data - skipping");
>               errorCode = "NULLDATA";
>               errorDesc = "Excluded because document had null data";
>               activities.noDocument(id,version);
>               continue;
>             }
>
>             // We will ingest something, so remove this id from the map in
> order that we know what we still
>             // need to delete when all done.
>             map.remove(id);
> <<<<<<
>
> As you see, activities.noDocument() is called for all cases, except the
> one where the document version is null (which cannot happen since all
> document versions for this case will be the empty string).  So I am at a
> loss to understand why the delete is not happening.
>
> The only way I can think of is that if you clicked one of the buttons on
> the output connection's view page that told MCF to "forget" all the history
> for that connection.
>
> Thanks,
> Karl
>
>
>
> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> I was manually starting the job for test purpose, but even if I schedule
>> it with job invocation "Complete" and "Scan every document once", the
>> missing IDs from the database are not deleted in my Solr index (no trace of
>> any 'document deletion' event in the history).
>> I should mention that I only use the 'Seeding query' and 'Data query' and
>> I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding
>> query.
>>
>> Julien
>>
>> Le 26.04.2017 16:05, Karl Wright a écrit :
>>
>> Hi Julien,
>>
>> How are you starting the job?  If you use "Start minimal", deletion would
>> not take place.  If your job is a continuous one, this is also the case.
>>
>> Thanks,
>> Karl
>>
>> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote:
>>
>>> Hi the MCF community,
>>>
>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database
>>> and index the data into a Solr server, and it works very well. However,
>>> when I perform a delta re-crawl, the new IDs are correctly retrieved from
>>> the Database but those who have been deleted are not "detected" by the
>>> connector and thus, are still present in my Solr index.
>>> I would like to know if normally it should work and that I maybe have
>>> missed something in the configuration of the job, or if this is not
>>> implemented ?
>>> The only way I found to solve this issue is to reset the seeding of the
>>> job, but it is very time and resource consuming.
>>>
>>> Best regards,
>>> Julien Massiera
>>
>>
>>
>

Re: Delete IDs with JDBC connector

Reply via email to