unsubscribe

On Tue, Jan 9, 2024 at 1:20 PM Steve Cohen <[email protected]> wrote:

> Hello,
>
> I am updating a nutch crawl that read files in directories that have
> spaces. The urls show %20 instead of spaces. This doesn't seem to be what
> the behavior was in the past.
>
> In nutch 1.10 I get these results
>
> Nutch 1.10
>
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
> 2018/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
> 2019/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
> 2022/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
> Unknown/ anchor: Shipment Date Unknown/
>
> in Nutch 1.19, I get this
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
> anchor: 2018/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
> anchor: 2019/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
> anchor: 2022/
>   outlink: toUrl:
> file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
> anchor: Shipment Date Unknown/
>
> We are uploading to solr and the links aren't right with the %20s in the
> url. How do I remove the %20s?
>
> Thanks,
> Steve Cohen
>

Reply via email to