I committed a fix to trunk, and also uploaded a patch to the ticket. Please let me know if it works for you.
Thanks, Karl On Wed, Apr 26, 2017 at 11:24 AM, <julien.massi...@francelabs.com> wrote: > Oh OK so I finally don't have to investigate :) > > Thanks Karl ! > > Julien > > Le 26.04.2017 17:20, Karl Wright a écrit : > > Oh, never mind. I see the issue, which is that without the version query, > documents that don't appear in the result list *at all* are never removed > from the map. I'll create a ticket. > > Karl > > > On Wed, Apr 26, 2017 at 11:10 AM, Karl Wright <daddy...@gmail.com> wrote: > >> Hi Julien, >> >> The delete logic in the connector is as follows: >> >> >>>>>> >> // Now, go through the original id's, and see which ones are still in >> the map. These >> // did not appear in the result and are presumed to be gone from the >> database, and thus must be deleted. >> for (String documentIdentifier : documentIdentifiers) >> { >> if (fetchDocuments.contains(documentIdentifier)) >> { >> String documentVersion = map.get(documentIdentifier); >> if (documentVersion != null) >> { >> // This means we did not see it (or data for it) in the result >> set. Delete it! >> activities.noDocument(documentIdentifier,documentVersion); >> activities.recordActivity(null, ACTIVITY_FETCH, >> null, documentIdentifier, "NOTFETCHED", "Document was not >> seen by processing query", null); >> } >> } >> } >> <<<<<< >> >> For a JDBC job without a version query, fetchDocuments contains all the >> documents. But map has the entries removed that were actually fetched. >> Documents that were *not* fetched for whatever reason therefore will not be >> cleaned up. Here's the code that determines that: >> >> >>>>>> >> String version = map.get(id); >> if (version == null) >> // Does not need refetching >> continue; >> >> // This document was marked as "not scan only", so we expect >> to find it. >> if (Logging.connectors.isDebugEnabled()) >> Logging.connectors.debug("JDBC: Document data result found >> for '"+id+"'"); >> o = row.getValue(JDBCConstants.urlReturnColumnName); >> if (o == null) >> { >> Logging.connectors.debug("JDBC: Document '"+id+"' has a >> null url - skipping"); >> errorCode = activities.NULL_URL; >> errorDesc = "Excluded because document had a null URL"; >> activities.noDocument(id,version); >> continue; >> } >> >> // This is not right - url can apparently be a BinaryInput >> String url = JDBCConnection.readAsString(o); >> boolean validURL; >> try >> { >> // Check to be sure url is valid >> new java.net.URI(url); >> validURL = true; >> } >> catch (java.net.URISyntaxException e) >> { >> validURL = false; >> } >> >> if (!validURL) >> { >> Logging.connectors.debug("JDBC: Document '"+id+"' has an >> illegal url: '"+url+"' - skipping"); >> errorCode = activities.BAD_URL; >> errorDesc = "Excluded because document had illegal URL >> ('"+url+"')"; >> activities.noDocument(id,version); >> continue; >> } >> >> // Process the document itself >> Object contents = row.getValue(JDBCConstants.dat >> aReturnColumnName); >> // Null data is allowed; we just ignore these >> if (contents == null) >> { >> Logging.connectors.debug("JDBC: Document '"+id+"' seems to >> have null data - skipping"); >> errorCode = "NULLDATA"; >> errorDesc = "Excluded because document had null data"; >> activities.noDocument(id,version); >> continue; >> } >> >> // We will ingest something, so remove this id from the map >> in order that we know what we still >> // need to delete when all done. >> map.remove(id); >> <<<<<< >> >> As you see, activities.noDocument() is called for all cases, except the >> one where the document version is null (which cannot happen since all >> document versions for this case will be the empty string). So I am at a >> loss to understand why the delete is not happening. >> >> The only way I can think of is that if you clicked one of the buttons on >> the output connection's view page that told MCF to "forget" all the history >> for that connection. >> >> Thanks, >> Karl >> >> >> >> On Wed, Apr 26, 2017 at 10:42 AM, <julien.massi...@francelabs.com> wrote: >> >>> Hi Karl, >>> >>> I was manually starting the job for test purpose, but even if I schedule >>> it with job invocation "Complete" and "Scan every document once", the >>> missing IDs from the database are not deleted in my Solr index (no trace of >>> any 'document deletion' event in the history). >>> I should mention that I only use the 'Seeding query' and 'Data query' >>> and I am not using the $(STARTTIME) and $(ENDTIME) variables in my seeding >>> query. >>> >>> Julien >>> >>> Le 26.04.2017 16:05, Karl Wright a écrit : >>> >>> Hi Julien, >>> >>> How are you starting the job? If you use "Start minimal", deletion >>> would not take place. If your job is a continuous one, this is also the >>> case. >>> >>> Thanks, >>> Karl >>> >>> On Wed, Apr 26, 2017 at 9:52 AM, <julien.massi...@francelabs.com> wrote: >>> >>>> Hi the MCF community, >>>> >>>> I am using MCF 2.6 with the JDBC connector to crawl an Oracle Database >>>> and index the data into a Solr server, and it works very well. However, >>>> when I perform a delta re-crawl, the new IDs are correctly retrieved from >>>> the Database but those who have been deleted are not "detected" by the >>>> connector and thus, are still present in my Solr index. >>>> I would like to know if normally it should work and that I maybe have >>>> missed something in the configuration of the job, or if this is not >>>> implemented ? >>>> The only way I found to solve this issue is to reset the seeding of the >>>> job, but it is very time and resource consuming. >>>> >>>> Best regards, >>>> Julien Massiera >>> >>> >>> >