bin/nutch fetch/parse handle crawl/segments directory
-----------------------------------------------------

                 Key: NUTCH-1001
                 URL: https://issues.apache.org/jira/browse/NUTCH-1001
             Project: Nutch
          Issue Type: Improvement
            Reporter: Gabriele Kahlout
            Priority: Minor


I'm having issues porting scripts across different systems to support the step 
of extracting the latest/only segments resulting from the generate phase.
Variants include:
$ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1]
$ s1=`ls -d crawl/segments/2* | tail -1` #[2]
$ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o 
[a-zA-Z0-9/\-]* |tail -1`
$ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o 
[a-zA-Z0-9/\-]* |tail -1`

And I'm not sure what windows users would have to do. Some users may also do 
with:
bin/nutch fetch with crawl/segments/2*

But I don't see a need in having the user extract/worry-about the latest/only 
segment, and have it a described step in every nutch tutorial. More over only 
fetch and parse expect a segment while other commands are fine with the 
directory of segments.

Therefore, I think it's beneficial if fetch and parse also handle directories 
of segments. 







[1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
[2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to