I created  https://issues.apache.org/jira/browse/NUTCH-2931 to track all of 
this work.
If you are interested in working on any of this it would be great to 
collaborate.
There is much more we can do over and above the few tickets I created.
lewismc

On 2021/12/24 10:07:20 sw.l...@quandatics.com wrote:
> Hi, 
> 
>  
> 
> We are currently facing a problem when using NUTCH Rest API. We try to run
> Nutch API through Postman and It works perfectly fine if we don't define the
> segment pathway. This is the command we run in Postman.
> 
>  
> 
> Inject
> 
>  
> 
> {
> 
> "type":"INJECT",
> 
>     "confId":"default",
> 
>     "crawlId":"crawl01",
> 
>     "args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",
> 
>               "crawldb": "/tmp/crawl/crawldb"
> 
>     }
> 
> }
> 
>  
> 
> Generate
> 
>  
> 
> {
> 
> "type":"GENERATE",
> 
>     "confId":"default",
> 
>     "crawlId":"crawl01",
> 
>     "args": {    "crawldb": "/tmp/crawl/crawldb",
> 
>                 "segment_dir": "/tmp/crawl/segments"
> 
>                }
> 
> }
> 
>  
> 
> Fetch 
> 
>  
> 
> {
> 
> "type":"FETCH",
> 
>     "confId":"default",
> 
>     "crawlId":"crawl01",
> 
>     "args": {"segment": "/tmp/crawl/segments"}
> 
> }
> 
>  
> 
> We try to define the pathway to store the crawled data in a specific
> directory. However, when come to fetch part, it cannot retrieve data from a
> specific folder (folder name that is generated by current date and time)
> under the segments folder. We have tried /tmp/crawl/segments/* and it can
> successfully retrieve the data, but it will also generate a new folder
> called *. 
> 
>  
> 
> Therefore, may we know if there is any way that could define the folder name
> in segments folder or is it got other way to change the output directory?
> 
>  
> 
> Attached is our log for your reference. Kindly advise. Thanks in advance.
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
> 

Reply via email to