[ https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-2370: ---------------------------------------- Affects Version/s: 1.14 > Saving mapping of dumped file to URL > ------------------------------------ > > Key: NUTCH-2370 > URL: https://issues.apache.org/jira/browse/NUTCH-2370 > Project: Nutch > Issue Type: Improvement > Components: dumpers > Affects Versions: 1.14 > Reporter: Madhav Sharan > Priority: Minor > Fix For: 1.14 > > > - nutch dump [0] is a great tool to simply dump all the crawled files from > nutch segments. > - After dump we loose information about URL from which this file was crawled. > URL is used to name dumped file but that information is encrypted. > - In `reverseUrlDirs` option one can figure out URL by checking the file path > but even accessing file path is little complicated than simple mapping file. > - In `flatdir` there is no way to know actual URL. > I am submitting a PR which edits [0] and saves a json for each crawled > segment which maps a file path to URL. > [0] > https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java -- This message was sent by Atlassian JIRA (v6.3.15#6346)