unsubscribe On Tue, Jan 9, 2024 at 1:20 PM Steve Cohen <[email protected]> wrote:
> Hello, > > I am updating a nutch crawl that read files in directories that have > spaces. The urls show %20 instead of spaces. This doesn't seem to be what > the behavior was in the past. > > In nutch 1.10 I get these results > > Nutch 1.10 > > > > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /nycor/10-15-2018 and on - Scanned > Outlinks: 4 > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor: > 2018/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor: > 2019/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor: > 2022/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date > Unknown/ anchor: Shipment Date Unknown/ > > in Nutch 1.19, I get this > > > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /nycor/10-15-2018 and on - Scanned > Outlinks: 4 > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/ > anchor: 2018/ > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/ > anchor: 2019/ > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/ > anchor: 2022/ > outlink: toUrl: > file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/ > anchor: Shipment Date Unknown/ > > We are uploading to solr and the links aren't right with the %20s in the > url. How do I remove the %20s? > > Thanks, > Steve Cohen >

