Oh, never mind. I see the issue, which is that without the version query, documents that don't appear in the result list *at all* are never removed from the map. I'll create a ticket.
Karl On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <daddy...@gmail.com> wrote: > Hi Julien, > > The delete logic in the connector is as follows: > > >>>>>> > // Now, go through the original id's, and see which ones are still in > the map. These > // did not appear in the result and are presumed to be gone from the > database, and thus must be deleted. > for (String documentIdentifier : documentIdentifiers) > { > if (fetchDocuments.contains(documentIdentifier)) > { > String documentVersion = map.get(documentIdentifier); > if (documentVersion != null) > { > // This means we did not see it (or data for it) in the result > set. Delete it! > activities.noDocument(documentIdentifier,documentVersion); > activities.recordActivity(null, ACTIVITY_FETCH, > null, documentIdentifier, "NOTFETCHED", "Document was not seen > by processing query", null); > } > } > } > <<<<<< > > For a JDBC job without a version query, fetchDocuments contains all the > documents. But map has the entries removed that were actually fetched. > Documents that were *not* fetched for whatever reason therefore will not be > cleaned up. Here's the code that determines that: > > >>>>>> > String version = map.get(id); > if (version == null) > // Does not need refetching > continue; > > // This document was marked as "not scan only", so we expect > to find it. > if (Logging.connectors.isDebugEnabled()) > Logging.connectors.debug("JDBC: Document data result found > for '"+id+"'"); > o = row.getValue(JDBCConstants.urlReturnColumnName); > if (o == null) > { > Logging.connectors.debug("JDBC: Document '"+id+"' has a > null url - skipping"); > errorCode = activities.NULL_URL; > errorDesc = "Excluded because document had a null URL"; > activities.noDocument(id,version); > continue; > } > > // This is not right - url can apparently be a BinaryInput > String url = JDBCConnection.readAsString(o); > boolean validURL; > try > { > // Check to be sure url is valid > new java.net.URI(url); > validURL = true; > } > catch (java.net.URISyntaxException e) > { > validURL = false; > } > > if (!validURL) > { > Logging.connectors.debug("JDBC: Document '"+id+"' has an > illegal url: '"+url+"' - skipping"); > errorCode = activities.BAD_URL; > errorDesc = "Excluded because document had illegal URL > ('"+url+"')"; > activities.noDocument(id,version); > continue; > } > > // Process the document itself > Object contents = row.getValue(JDBCConstants. > dataReturnColumnName); > // Null data is allowed; we just ignore these > if (contents == null) > { > Logging.connectors.debug("JDBC: Document '"+id+"' seems to > have null data - skipping"); > errorCode = "NULLDATA"; > errorDesc = "Excluded because document had null data"; > activities.noDocument(id,version); > continue; > } > > // We will ingest something, so remove this id from the map in > order that we know what we still > // need to delete when all done. > map.remove(id); > <<<<<< > > As you see, activities.noDocument() is called for all cases, except the > one where the document version is null (which cannot happen since all > document versions for this case will be the empty string). So I am at a > loss to understand why the delete is not happening. > > The only way I can think of is that if you clicked one of the buttons on > the output connection's view page that told MCF to "forget" all the history > for that connection. > > Thanks, > Karl > > > > On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote: > >> Hi Karl, >> >> I was manually starting the job for test purpose, but even if I schedule >> it with job invocation "Complete" and "Scan every document once", the >> missing IDs from the database are not deleted in my Solr index (no trace of >> any 'document deletion' event in the history). >> I should mention that I only use the 'Seeding query' and 'Data query' and >> I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding >> query. >> >> Julien >> >> Le 26.04.2017 16:05, Karl Wright a écrit : >> >> Hi Julien, >> >> How are you starting the job? If you use "Start minimal", deletion would >> not take place. If your job is a continuous one, this is also the case. >> >> Thanks, >> Karl >> >> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote: >> >>> Hi the MCF community, >>> >>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database >>> and index the data into a Solr server, and it works very well. However, >>> when I perform a delta re-crawl, the new IDs are correctly retrieved from >>> the Database but those who have been deleted are not "detected" by the >>> connector and thus, are still present in my Solr index. >>> I would like to know if normally it should work and that I maybe have >>> missed something in the configuration of the job, or if this is not >>> implemented ? >>> The only way I found to solve this issue is to reset the seeding of the >>> job, but it is very time and resource consuming. >>> >>> Best regards, >>> Julien Massiera >> >> >> >