Hello,

I recently attempted to install and run Nutch in a Cygwin environment,
following the Nutch tutorial (link to tutorial
<https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial>).
However, when executing the first crawl command (bin/nutch inject
crawl/crawldb URLs), I encountered the following error:

...
2024-11-19 20:21:21,246 INFO o.a.n.c.Injector [main] Injector:
crawlDb: crawl/crawldb
2024-11-19 20:21:21,246 INFO o.a.n.c.Injector [main] Injector: urlDir: urls
2024-11-19 20:21:21,246 INFO o.a.n.c.Injector [main] Injector:
Converting injected urls to crawl db entries.
2024-11-19 20:21:21,948 ERROR o.a.n.c.Injector [main] Injector:
java.lang.RuntimeException: java.io.FileNotFoundException:
java.io.FileNotFoundException:
 HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems
        at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:788)
        at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:297)
        at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:313)
...

I suspect this might be a common issue, but I couldn’t locate any
information addressing it for recent versions of Nutch. Is this a tested
use case, or could this potentially be a regression?

Additionally, are there verified steps for setting up and running Nutch on
Cygwin? If not, would you recommend an alternative approach for Windows,
such as WSL2, containers, or another solution?

Thank you for your assistance!

Best regards,
John

Reply via email to