Hi, I reported some typos and incomplete information in nutch 08 tutorial some time ago. It seems that all commiters and voluntaries are busy with more important issues so I took this opportunity and now I am proud to present my *first-small-humble-patch-ever*.
Please review the patch and let me know what should I do better the next time. Note that I made checkout of release-0.7.2 branch (as I found that the source file for the 0.8 tutorial is located here) and generated SVN patch after modification. Thus there is absolute file path from my computer in the patch header (I am not SVN expert - any advice welcomed). Also I added dedup and merge commands examples into tutorial as well. Feel free to remove it if you don't think this fits with original tutorial intend. Regards, Lukas
Index: /home/lukas/workspace/nutch-release-0.7.2/src/site/src/documentation/content/xdocs/tutorial8.xml =================================================================== --- /home/lukas/workspace/nutch-release-0.7.2/src/site/src/documentation/content/xdocs/tutorial8.xml (revision 405528) +++ /home/lukas/workspace/nutch-release-0.7.2/src/site/src/documentation/content/xdocs/tutorial8.xml (working copy) @@ -243,16 +243,19 @@ <p>Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.</p> -<source>bin/nutch invertlinks crawl/linkdb crawl/segments</source> +<source>bin/nutch invertlinks crawl/linkdb -dir crawl/segments</source> <p>To index the segments we use the <code>index</code> command, as follows:</p> -<source>bin/nutch index indexes crawl/linkdb crawl/segments/*</source> +<source>bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*</source> + +<p>Then, we need to delete duplicate pages. This is done with:</p> -<!-- <p>Then, before we can search a set of segments, we need to delete --> -<!-- duplicate pages. This is done with:</p> --> +<source>bin/nutch dedup crawl/indexes</source> -<!-- <source>bin/nutch dedup indexes</source> --> +<p>In the end we merge all individual indexes into one index:</p> + +<source>bin/nutch merge crawl/index crawl/indexes</source> <p>Now we're ready to search!</p>