indexer-solr is failing to de-duplicate URL encoded URLs. Nutch writes URLs
as URL encoded into Solr, however, SolrIndexWriter.java explicitly decodes
when deleting, hence failing to match the URL in Solr and therefore failing
to deleting them.

In SolrIndexWriter.java, there is a comment:

// WORK AROUND FOR NOT REMOVING URL ENCODED URLS!!!

and code:

    try {
      key = URLDecoder.decode(key, "UTF8");
    } catch (UnsupportedEncodingException e) {
      LOG.error("Error decoding: " + key);
      throw new IOException("UnsupportedEncodingException for " + key);
    } catch (IllegalArgumentException e) {
      LOG.warn("Could not decode: " + key + ", it probably wasn't encoded
in the first place..");
    }

Commenting out the above resolves the issue, but I don't understand why
this workaround was added in the first place.

Please respond if you know the answer.

Thank you,
Michael Portnoy

Reply via email to