[ https://issues.apache.org/jira/browse/NUTCH-1001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13131665#comment-13131665 ]
Lewis John McGibbney commented on NUTCH-1001: --------------------------------------------- Gabriele, firstly I appreciate your concerns with the re-indenting issue, however as we are moving to commit NUTCH-865 maybe we can open an issue and look into this in the near future, until then I'm afraid it looks like this is a manual process :0) With regards to a nutchgora patch, if this is not feasible atm, then please feel free to add it to the list which make up NUTCH-1104 which Markus thoughtfully opened up. I think you raise a valid point with regards to errors which can result and exceptions which may be thrown if it is not used correctly, however a method for addressing this would be to add the relevant tests to testFetcheer and testParseData(?) respectively. Although this is a pain, it would ensure that we catch this obvious area of concern. With respect to the -dir option, I think that this is a slightly different issue. I'm totally behind this issue and would like to see it go in 1.4, but I think we have some work to do on it. > bin/nutch fetch/parse handle crawl/segments directory > ----------------------------------------------------- > > Key: NUTCH-1001 > URL: https://issues.apache.org/jira/browse/NUTCH-1001 > Project: Nutch > Issue Type: Improvement > Reporter: Gabriele Kahlout > Priority: Minor > Fix For: 1.4, nutchgora > > Attachments: NUTCH-1001.patch > > > I'm having issues porting scripts across different systems to support the > step of extracting the latest/only segments resulting from the generate phase. > Variants include: > $ export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1` #[1] > $ s1=`ls -d crawl/segments/2* | tail -1` #[2] > $ segment=`$HADOOP_HOME/bin/hadoop dfs -ls crawl/segments | tail -1 | grep -o > [a-zA-Z0-9/\-]* |tail -1` > $ segment=`$HADOOP_HOME/bin/hdfs -ls crawl/segments | tail -1 | grep -o > [a-zA-Z0-9/\-]* |tail -1` > And I'm not sure what windows users would have to do. Some users may also do > with: > bin/nutch fetch with crawl/segments/2* > But I don't see a need in having the user extract/worry-about the latest/only > segment, and have it a described step in every nutch tutorial. More over only > fetch and parse expect a segment while other commands are fine with the > directory of segments. > Therefore, I think it's beneficial if fetch and parse also handle directories > of segments. > [1] http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ > [2] http://wiki.apache.org/nutch/NutchTutorial#Command_Line_Searching -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira