Hi,
We are currently facing a problem when using NUTCH Rest API. We try to run Nutch API through Postman and It works perfectly fine if we don't define the segment pathway. This is the command we run in Postman. Inject { "type":"INJECT", "confId":"default", "crawlId":"crawl01", "args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt", "crawldb": "/tmp/crawl/crawldb" } } Generate { "type":"GENERATE", "confId":"default", "crawlId":"crawl01", "args": { "crawldb": "/tmp/crawl/crawldb", "segment_dir": "/tmp/crawl/segments" } } Fetch { "type":"FETCH", "confId":"default", "crawlId":"crawl01", "args": {"segment": "/tmp/crawl/segments"} } We try to define the pathway to store the crawled data in a specific directory. However, when come to fetch part, it cannot retrieve data from a specific folder (folder name that is generated by current date and time) under the segments folder. We have tried /tmp/crawl/segments/* and it can successfully retrieve the data, but it will also generate a new folder called *. Therefore, may we know if there is any way that could define the folder name in segments folder or is it got other way to change the output directory? Attached is our log for your reference. Kindly advise. Thanks in advance. Best Regards, Shi Wei
2021-12-24 17:27:01,852 INFO crawl.Injector - Injector: starting at 2021-12-24 17:27:01 2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: crawlDb: /tmp/crawl/crawldb 2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: urlDir: /opt/apache-nutch-1.18/runtime/local/urls/seed.txt 2021-12-24 17:27:01,853 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2021-12-24 17:27:01,865 INFO crawl.Injector - Injecting seed URL file file:/opt/apache-nutch-1.18/runtime/local/urls/seed.txt 2021-12-24 17:27:01,866 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2021-12-24 17:27:01,871 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2021-12-24 17:27:01,971 INFO mapreduce.Job - The url to track the job: http://localhost:8080/ 2021-12-24 17:27:01,971 INFO mapreduce.Job - Running job: job_local463605357_0260 2021-12-24 17:27:02,002 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2021-12-24 17:27:02,014 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2021-12-24 17:27:02,034 INFO crawl.Injector - Injector: overwrite: false 2021-12-24 17:27:02,034 INFO crawl.Injector - Injector: update: false 2021-12-24 17:27:02,972 INFO mapreduce.Job - Job job_local463605357_0260 running in uber mode : false 2021-12-24 17:27:02,972 INFO mapreduce.Job - map 100% reduce 100% 2021-12-24 17:27:02,972 INFO mapreduce.Job - Job job_local463605357_0260 completed successfully 2021-12-24 17:27:02,973 INFO mapreduce.Job - Counters: 31 File System Counters FILE: Number of bytes read=503885294 FILE: Number of bytes written=747488148 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=3 Map output records=3 Map output bytes=260 Map output materialized bytes=272 Input split bytes=282 Combine input records=0 Combine output records=0 Reduce input groups=3 Reduce shuffle bytes=272 Reduce input records=3 Reduce output records=3 Spilled Records=6 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=1995440128 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 injector urls_injected=3 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=622 2021-12-24 17:27:02,974 INFO crawl.Injector - Injector: Total urls rejected by filters: 0 2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total urls injected after normalization and filtering: 3 2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total urls injected but already in CrawlDb: 0 2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: Total new urls injected: 3 2021-12-24 17:27:02,975 INFO crawl.Injector - Injector: finished at 2021-12-24 17:27:02, elapsed: 00:00:01 2021-12-24 17:27:04,912 INFO crawl.Generator - Generator: starting at 2021-12-24 17:27:04 2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: filtering: true 2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: normalizing: true 2021-12-24 17:27:04,913 INFO crawl.Generator - Generator: running in local mode, generating exactly one partition. 2021-12-24 17:27:04,914 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2021-12-24 17:27:04,918 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2021-12-24 17:27:05,025 INFO mapreduce.Job - The url to track the job: http://localhost:8080/ 2021-12-24 17:27:05,025 INFO mapreduce.Job - Running job: job_local1719362067_0261 2021-12-24 17:27:05,062 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2021-12-24 17:27:05,062 INFO crawl.AbstractFetchSchedule - defaultInterval=0 2021-12-24 17:27:05,062 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2021-12-24 17:27:05,074 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2021-12-24 17:27:05,094 INFO regex.RegexURLNormalizer - can't find rules for scope 'generate_host_count', using default 2021-12-24 17:27:06,026 INFO mapreduce.Job - Job job_local1719362067_0261 running in uber mode : false 2021-12-24 17:27:06,026 INFO mapreduce.Job - map 100% reduce 100% 2021-12-24 17:27:06,026 INFO mapreduce.Job - Job job_local1719362067_0261 completed successfully 2021-12-24 17:27:06,027 INFO mapreduce.Job - Counters: 30 File System Counters FILE: Number of bytes read=505778868 FILE: Number of bytes written=750607198 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=3 Map output records=3 Map output bytes=347 Map output materialized bytes=359 Input split bytes=114 Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=359 Reduce input records=3 Reduce output records=0 Spilled Records=6 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=6 Total committed heap usage (bytes)=1900019712 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=382 File Output Format Counters Bytes Written=8 2021-12-24 17:27:06,027 INFO crawl.Generator - Generator: number of items rejected during selection: 2021-12-24 17:27:06,028 INFO crawl.Generator - Generator: Partitioning selected urls for politeness. 2021-12-24 17:27:07,029 INFO crawl.Generator - Generator: segment: /tmp/crawl/segments/20211224172707 2021-12-24 17:27:07,030 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2021-12-24 17:27:07,037 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2021-12-24 17:27:07,153 INFO mapreduce.Job - The url to track the job: http://localhost:8080/ 2021-12-24 17:27:07,154 INFO mapreduce.Job - Running job: job_local209587332_0262 2021-12-24 17:27:07,194 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2021-12-24 17:27:08,154 INFO mapreduce.Job - Job job_local209587332_0262 running in uber mode : false 2021-12-24 17:27:08,154 INFO mapreduce.Job - map 100% reduce 100% 2021-12-24 17:27:08,154 INFO mapreduce.Job - Job job_local209587332_0262 completed successfully 2021-12-24 17:27:08,156 INFO mapreduce.Job - Counters: 30 File System Counters FILE: Number of bytes read=507673248 FILE: Number of bytes written=753716105 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=3 Map output records=3 Map output bytes=508 Map output materialized bytes=520 Input split bytes=160 Combine input records=0 Combine output records=0 Reduce input groups=3 Reduce shuffle bytes=520 Reduce input records=3 Reduce output records=3 Spilled Records=6 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=1900019712 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=491 File Output Format Counters Bytes Written=445 2021-12-24 17:27:08,156 INFO crawl.Generator - Generator: finished at 2021-12-24 17:27:08, elapsed: 00:00:03 2021-12-24 17:27:09,043 INFO fetcher.Fetcher - Fetcher: starting at 2021-12-24 17:27:09 2021-12-24 17:27:09,044 INFO fetcher.Fetcher - Fetcher: segment: /tmp/crawl/segments 2021-12-24 17:27:09,045 WARN impl.MetricsSystemImpl - JobTracker metrics system already initialized! 2021-12-24 17:27:09,051 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2021-12-24 17:27:09,066 ERROR fetcher.Fetcher - Fetcher: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/tmp/crawl/segments/crawl_generate at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:332) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) at org.apache.nutch.fetcher.Fetcher$InputFormat.getSplits(Fetcher.java:115) at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:310) at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:327) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1570) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1567) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1567) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1588) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:498) at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:613) at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)