[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045672#comment-14045672 ]
Hudson commented on NUTCH-1798: ------------------------------- SUCCESS: Integrated in Nutch-nutchgora #1063 (See [https://builds.apache.org/job/Nutch-nutchgora/1063/]) NUTCH-1798 Crawl script not calling index command correctly (Aaron Bedward via jnioche) (jnioche: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1605975) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/bin/crawl > Crawl script not calling index command correctly > ------------------------------------------------ > > Key: NUTCH-1798 > URL: https://issues.apache.org/jira/browse/NUTCH-1798 > Project: Nutch > Issue Type: Bug > Affects Versions: 2.2.1 > Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 > Reporter: Aaron Bedward > Fix For: 2.3 > > Attachments: part-r-00000 > > > Hopefully this is something i am doing wrong. I have checked out 2.x as i > would like to use the new metatag extraction features. I have then run ant > runtime to build, I have updated the nutch-site.xml like so: > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > <property> > <name>elastic.cluster</name> > <value>elasticsearch</value> > <description>The cluster name to discover. Either host and potr must be > defined > or cluster.</description> > </property> > > I have then created a folder called urls and added seed.txt. > i ran the following commands > bin/nutch inject urls > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -all > bin/nutch updatedb > bin/nutch index -all > it runs no errors however no documents have been index > i also tried setting up the following with solr and no documents are indexed > Log: > 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success > 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at > 2014-06-24 02:57:57, time elapsed: 00:00:06 > 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting > 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title > length for indexing set to: 100 > 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.more.MoreIndexingFilter > 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], > pid[21885], build[2181e11/2014-03-25T15:59:51Z] > 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... > 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], > sites [] > 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized > 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... > 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] > bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address > {inet[/10.0.2.15:9301]} > 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master > [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], > added > {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver > Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: > zen-disco-receive(from master > [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) > 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] > elasticsearch/jXIC3VT6THukKDFB7GMw7Q > 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address > {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} > 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started > 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title > length for indexing set to: 100 > 2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.more.MoreIndexingFilter > 2014-06-24 02:58:11,581 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2014-06-24 02:58:11,716 INFO elastic.ElasticIndexWriter - Processing > remaining requests [docs = 0, length = 0, total docs = 0] > 2014-06-24 02:58:11,717 INFO elastic.ElasticIndexWriter - Processing to > finalize last execute > 2014-06-24 02:58:11,717 INFO elasticsearch.node - [Silver] stopping ... > 2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] stopped > 2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] closing ... > 2014-06-24 02:58:11,756 INFO elasticsearch.node - [Silver] closed > 2014-06-24 02:58:11,759 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > 2014-06-24 02:58:12,511 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > 2014-06-24 02:58:12,511 INFO indexer.IndexingJob - Active IndexWriters : > ElasticIndexWriter > elastic.cluster : elastic prefix cluster > elastic.host : hostname > elastic.port : port (default 9300) > elastic.index : elastic index command > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) > elastic.max.bulk.size : elastic bulk index length. (default 2500500 > ~2.5MB) > 2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] > version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] > 2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] initializing > ... > 2014-06-24 02:58:12,555 INFO elasticsearch.plugins - [Lifeguard] loaded [], > sites [] > 2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] initialized > 2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] starting ... > 2014-06-24 02:58:13,032 INFO elasticsearch.transport - [Lifeguard] > bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address > {inet[/10.0.2.15:9301]} > 2014-06-24 02:58:16,063 INFO cluster.service - [Lifeguard] detected_master > [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], > added > {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver > Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: > zen-disco-receive(from master > [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) > 2014-06-24 02:58:16,072 INFO elasticsearch.discovery - [Lifeguard] > elasticsearch/MWiqtTiqS5aC_M7QvGtfyg > 2014-06-24 02:58:16,074 INFO elasticsearch.http - [Lifeguard] bound_address > {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} > 2014-06-24 02:58:16,076 INFO elasticsearch.node - [Lifeguard] started > 2014-06-24 02:58:16,076 INFO indexer.IndexingJob - IndexingJob: done. -- This message was sent by Atlassian JIRA (v6.2#6252)