indexer-solr is failing to de-duplicate URL encoded URLs. Nutch writes URLs as URL encoded into Solr, however, SolrIndexWriter.java explicitly decodes when deleting, hence failing to match the URL in Solr and therefore failing to deleting them.
In SolrIndexWriter.java, there is a comment: // WORK AROUND FOR NOT REMOVING URL ENCODED URLS!!! and code: try { key = URLDecoder.decode(key, "UTF8"); } catch (UnsupportedEncodingException e) { LOG.error("Error decoding: " + key); throw new IOException("UnsupportedEncodingException for " + key); } catch (IllegalArgumentException e) { LOG.warn("Could not decode: " + key + ", it probably wasn't encoded in the first place.."); } Commenting out the above resolves the issue, but I don't understand why this workaround was added in the first place. Please respond if you know the answer. Thank you, Michael Portnoy