[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043546#comment-14043546 ]
Julien Nioche commented on NUTCH-1798: -------------------------------------- No problem Aaron! Ok, so it looks like you do have documents in the table that are successfully fetched. Unfortunately 2.x lacks many of the functionalities that 1.x has (not mentioning robustness) and that are useful for testing e.g. indexer-dummy or [NUTCH-1758]. If you have good reasons to use 2.x and not 1.x, the best approach would be to either port these 2 patches to 2.x or debug in local mode to see what's happening. See [http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse_.28NOT_VERIFIED.29] (no idea who thought it was not verified, it should work fine) for advice on how to debug. > Unable to get any documents to index in elastic search > ------------------------------------------------------ > > Key: NUTCH-1798 > URL: https://issues.apache.org/jira/browse/NUTCH-1798 > Project: Nutch > Issue Type: Bug > Affects Versions: 2.3 > Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 > Reporter: Aaron Bedward > Fix For: 2.3 > > Attachments: part-r-00000 > > > Hopefully this is something i am doing wrong. I have checked out 2.x as i > would like to use the new metatag extraction features. I have then run ant > runtime to build, I have updated the nutch-site.xml like so: > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > <property> > <name>elastic.cluster</name> > <value>elasticsearch</value> > <description>The cluster name to discover. Either host and potr must be > defined > or cluster.</description> > </property> > > I have then created a folder called urls and added seed.txt. > i ran the following commands > bin/nutch inject urls > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -all > bin/nutch updatedb > bin/nutch index -all > it runs no errors however no documents have been index > i also tried setting up the following with solr and no documents are indexed > Log: > 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success > 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at > 2014-06-24 02:57:57, time elapsed: 00:00:06 > 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting > 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title > length for indexing set to: 100 > 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.more.MoreIndexingFilter > 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], > pid[21885], build[2181e11/2014-03-25T15:59:51Z] > 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... > 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], > sites [] > 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized > 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... > 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] > bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address > {inet[/10.0.2.15:9301]} > 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master > [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], > added > {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver > Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: > zen-disco-receive(from master > [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) > 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] > elasticsearch/jXIC3VT6THukKDFB7GMw7Q > 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address > {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} > 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started > 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title > length for indexing set to: 100 > 2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.more.MoreIndexingFilter > 2014-06-24 02:58:11,581 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2014-06-24 02:58:11,716 INFO elastic.ElasticIndexWriter - Processing > remaining requests [docs = 0, length = 0, total docs = 0] > 2014-06-24 02:58:11,717 INFO elastic.ElasticIndexWriter - Processing to > finalize last execute > 2014-06-24 02:58:11,717 INFO elasticsearch.node - [Silver] stopping ... > 2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] stopped > 2014-06-24 02:58:11,751 INFO elasticsearch.node - [Silver] closing ... > 2014-06-24 02:58:11,756 INFO elasticsearch.node - [Silver] closed > 2014-06-24 02:58:11,759 WARN mapred.FileOutputCommitter - Output path is > null in cleanup > 2014-06-24 02:58:12,511 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > 2014-06-24 02:58:12,511 INFO indexer.IndexingJob - Active IndexWriters : > ElasticIndexWriter > elastic.cluster : elastic prefix cluster > elastic.host : hostname > elastic.port : port (default 9300) > elastic.index : elastic index command > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) > elastic.max.bulk.size : elastic bulk index length. (default 2500500 > ~2.5MB) > 2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] > version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] > 2014-06-24 02:58:12,525 INFO elasticsearch.node - [Lifeguard] initializing > ... > 2014-06-24 02:58:12,555 INFO elasticsearch.plugins - [Lifeguard] loaded [], > sites [] > 2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] initialized > 2014-06-24 02:58:13,025 INFO elasticsearch.node - [Lifeguard] starting ... > 2014-06-24 02:58:13,032 INFO elasticsearch.transport - [Lifeguard] > bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address > {inet[/10.0.2.15:9301]} > 2014-06-24 02:58:16,063 INFO cluster.service - [Lifeguard] detected_master > [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], > added > {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver > Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: > zen-disco-receive(from master > [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) > 2014-06-24 02:58:16,072 INFO elasticsearch.discovery - [Lifeguard] > elasticsearch/MWiqtTiqS5aC_M7QvGtfyg > 2014-06-24 02:58:16,074 INFO elasticsearch.http - [Lifeguard] bound_address > {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} > 2014-06-24 02:58:16,076 INFO elasticsearch.node - [Lifeguard] started > 2014-06-24 02:58:16,076 INFO indexer.IndexingJob - IndexingJob: done. -- This message was sent by Atlassian JIRA (v6.2#6252)