[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043546#comment-14043546
 ] 

Julien Nioche commented on NUTCH-1798:
--------------------------------------

No problem Aaron! Ok, so it looks like you do have documents in the table that 
are successfully fetched. Unfortunately 2.x lacks many of the functionalities 
that 1.x has (not mentioning robustness) and that are useful for testing e.g. 
indexer-dummy or [NUTCH-1758]. If you have good reasons to use 2.x and not 1.x, 
the best approach would be to either port these 2 patches to 2.x or debug in 
local mode to see what's happening. 

See 
[http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse_.28NOT_VERIFIED.29]
 (no idea who thought it was not verified, it should work fine) for advice on 
how to debug.

> Unable to get any documents to index in elastic search
> ------------------------------------------------------
>
>                 Key: NUTCH-1798
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1798
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>         Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
>            Reporter: Aaron Bedward
>             Fix For: 2.3
>
>         Attachments: part-r-00000
>
>
> Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
> would like to use the new metatag extraction features.  I have then run ant 
> runtime to build,  I have updated the nutch-site.xml like so:
> <property>
>   <name>plugin.includes</name>
>  
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable 
>   protocol-httpclient, but be aware of possible intermittent problems with 
> the 
>   underlying commons-httpclient library.
>   </description>
> </property>
>   <property>
>       <name>elastic.cluster</name>
>       <value>elasticsearch</value>
>       <description>The cluster name to discover. Either host and potr must be 
> defined
>         or cluster.</description>
>   </property>
>  
> I have then created a folder called urls and added seed.txt.
> i ran the following commands 
> bin/nutch inject urls
> bin/nutch generate -topN 1000  
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb
> bin/nutch index  -all 
> it runs no errors however no documents have been index
> i also tried setting up the following with solr and no documents are indexed
> Log:
> 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
> 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
> 2014-06-24 02:57:57, time elapsed: 00:00:06
> 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
> 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
> length for indexing set to: 100
> 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
> pid[21885], build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
> 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
> sites []
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
> 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
> {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
> [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
> added 
> {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
>  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
> zen-disco-receive(from master 
> [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
> elasticsearch/jXIC3VT6THukKDFB7GMw7Q
> 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
> {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
> 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
> length for indexing set to: 100
> 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:11,581 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:11,716 INFO  elastic.ElasticIndexWriter - Processing 
> remaining requests [docs = 0, length = 0, total docs = 0]
> 2014-06-24 02:58:11,717 INFO  elastic.ElasticIndexWriter - Processing to 
> finalize last execute
> 2014-06-24 02:58:11,717 INFO  elasticsearch.node - [Silver] stopping ...
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] stopped
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] closing ...
> 2014-06-24 02:58:11,756 INFO  elasticsearch.node - [Silver] closed
> 2014-06-24 02:58:11,759 WARN  mapred.FileOutputCommitter - Output path is 
> null in cleanup
> 2014-06-24 02:58:12,511 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:12,511 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
>       elastic.cluster : elastic prefix cluster
>       elastic.host : hostname
>       elastic.port : port  (default 9300)
>       elastic.index : elastic index command 
>       elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) 
>       elastic.max.bulk.size : elastic bulk index length. (default 2500500 
> ~2.5MB)
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] 
> version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] initializing 
> ...
> 2014-06-24 02:58:12,555 INFO  elasticsearch.plugins - [Lifeguard] loaded [], 
> sites []
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] initialized
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] starting ...
> 2014-06-24 02:58:13,032 INFO  elasticsearch.transport - [Lifeguard] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
> {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:16,063 INFO  cluster.service - [Lifeguard] detected_master 
> [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
> added 
> {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
>  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
> zen-disco-receive(from master 
> [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:16,072 INFO  elasticsearch.discovery - [Lifeguard] 
> elasticsearch/MWiqtTiqS5aC_M7QvGtfyg
> 2014-06-24 02:58:16,074 INFO  elasticsearch.http - [Lifeguard] bound_address 
> {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:16,076 INFO  elasticsearch.node - [Lifeguard] started
> 2014-06-24 02:58:16,076 INFO  indexer.IndexingJob - IndexingJob: done.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to