[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14043462#comment-14043462
 ] 

Aaron Bedward commented on NUTCH-1798:
--------------------------------------

Tried the following command

./bin/crawl urls crawl nutch 2

I removed 

  echo "SOLR dedup -> $SOLRURL"
  $bin/nutch solrdedup $commonOptions $SOLRURL
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

from the crawl script however i left the following part the same 

echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

still not documents indexed.  Log file:


2014-06-25 14:48:27,728 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2014-06-25 14:48:27,761 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:28,470 INFO  crawl.DbUpdaterJob - DbUpdaterJob: finished at 
2014-06-25 14:48:28, time elapsed: 00:00:05
2014-06-25 14:48:29,792 INFO  indexer.IndexingJob - IndexingJob: starting
2014-06-25 14:48:30,901 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:30,906 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:31,777 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:31,788 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:31,789 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:34,123 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2014-06-25 14:48:35,116 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
version[1.1.0], pid[10180], build[2181e11/2014-03-25T15:59:51Z]
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
initializing ...
2014-06-25 14:48:35,660 INFO  elasticsearch.plugins - [Master of Vengeance] 
loaded [], sites []
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
initialized
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
starting ...
2014-06-25 14:48:38,970 INFO  elasticsearch.transport - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
{inet[/10.0.2.15:9301]}
2014-06-25 14:48:42,106 INFO  cluster.service - [Master of Vengeance] 
detected_master [Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], added 
{[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, 
reason: zen-disco-receive(from master [[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-25 14:48:42,120 INFO  elasticsearch.discovery - [Master of Vengeance] 
elasticsearch/zH-oAjvTTEyg1l4aOJM_lg
2014-06-25 14:48:42,132 INFO  elasticsearch.http - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address 
{inet[/10.0.2.15:9201]}
2014-06-25 14:48:42,135 INFO  elasticsearch.node - [Master of Vengeance] started
2014-06-25 14:48:42,142 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:42,142 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing remaining 
requests [docs = 0, length = 0, total docs = 0]
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing to 
finalize last execute
2014-06-25 14:48:42,247 INFO  elasticsearch.node - [Master of Vengeance] 
stopping ...
2014-06-25 14:48:42,279 INFO  elasticsearch.node - [Master of Vengeance] stopped
2014-06-25 14:48:42,280 INFO  elasticsearch.node - [Master of Vengeance] 
closing ...
2014-06-25 14:48:42,286 INFO  elasticsearch.node - [Master of Vengeance] closed
2014-06-25 14:48:42,289 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:42,740 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:42,740 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
        elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port  (default 9300)
        elastic.index : elastic index command 
        elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) 
        elastic.max.bulk.size : elastic bulk index length. (default 2500500 
~2.5MB)


2014-06-25 14:48:42,751 INFO  elasticsearch.node - [Morbius] version[1.1.0], 
pid[10180], build[2181e11/2014-03-25T15:59:51Z]
2014-06-25 14:48:42,752 INFO  elasticsearch.node - [Morbius] initializing ...
2014-06-25 14:48:42,781 INFO  elasticsearch.plugins - [Morbius] loaded [], 
sites []
2014-06-25 14:48:43,272 INFO  elasticsearch.node - [Morbius] initialized
2014-06-25 14:48:43,272 INFO  elasticsearch.node - [Morbius] starting ...
2014-06-25 14:48:43,283 INFO  elasticsearch.transport - [Morbius] bound_address 
{inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]}
2014-06-25 14:48:46,321 INFO  cluster.service - [Morbius] detected_master 
[Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
added {[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, 
reason: zen-disco-receive(from master [[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-25 14:48:46,330 INFO  elasticsearch.discovery - [Morbius] 
elasticsearch/NkD8sqGBTfWnwIi91a_0XA
2014-06-25 14:48:46,333 INFO  elasticsearch.http - [Morbius] bound_address 
{inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
2014-06-25 14:48:46,334 INFO  elasticsearch.node - [Morbius] started
2014-06-25 14:48:46,334 INFO  indexer.IndexingJob - IndexingJob: done.



BTW i am using HBASE 0.94.9

> Unable to get any documents to index in elastic search
> ------------------------------------------------------
>
>                 Key: NUTCH-1798
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1798
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 2.3
>         Environment: Ubuntu 13.10, Elasticsearch 1
>            Reporter: Aaron Bedward
>             Fix For: 2.3
>
>
> Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
> would like to use the new metatag extraction features.  I have then run ant 
> runtime to build,  I have updated the nutch-site.xml like so:
> <property>
>   <name>plugin.includes</name>
>  
> <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
>  <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin. By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please enable 
>   protocol-httpclient, but be aware of possible intermittent problems with 
> the 
>   underlying commons-httpclient library.
>   </description>
> </property>
>   <property>
>       <name>elastic.cluster</name>
>       <value>elasticsearch</value>
>       <description>The cluster name to discover. Either host and potr must be 
> defined
>         or cluster.</description>
>   </property>
>  
> I have then created a folder called urls and added seed.txt.
> i ran the following commands 
> bin/nutch inject urls
> bin/nutch generate -topN 1000  
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb
> bin/nutch index  -all 
> it runs no errors however no documents have been index
> i also tried setting up the following with solr and no documents are indexed
> Log:
> 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
> 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
> 2014-06-24 02:57:57, time elapsed: 00:00:06
> 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
> 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
> length for indexing set to: 100
> 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
> pid[21885], build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
> 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
> sites []
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
> 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
> 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
> {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
> [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
> added 
> {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
>  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
> zen-disco-receive(from master 
> [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
> elasticsearch/jXIC3VT6THukKDFB7GMw7Q
> 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
> {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
> 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
> length for indexing set to: 100
> 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2014-06-24 02:58:11,581 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: off
> 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2014-06-24 02:58:11,716 INFO  elastic.ElasticIndexWriter - Processing 
> remaining requests [docs = 0, length = 0, total docs = 0]
> 2014-06-24 02:58:11,717 INFO  elastic.ElasticIndexWriter - Processing to 
> finalize last execute
> 2014-06-24 02:58:11,717 INFO  elasticsearch.node - [Silver] stopping ...
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] stopped
> 2014-06-24 02:58:11,751 INFO  elasticsearch.node - [Silver] closing ...
> 2014-06-24 02:58:11,756 INFO  elasticsearch.node - [Silver] closed
> 2014-06-24 02:58:11,759 WARN  mapred.FileOutputCommitter - Output path is 
> null in cleanup
> 2014-06-24 02:58:12,511 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2014-06-24 02:58:12,511 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
>       elastic.cluster : elastic prefix cluster
>       elastic.host : hostname
>       elastic.port : port  (default 9300)
>       elastic.index : elastic index command 
>       elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) 
>       elastic.max.bulk.size : elastic bulk index length. (default 2500500 
> ~2.5MB)
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] 
> version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z]
> 2014-06-24 02:58:12,525 INFO  elasticsearch.node - [Lifeguard] initializing 
> ...
> 2014-06-24 02:58:12,555 INFO  elasticsearch.plugins - [Lifeguard] loaded [], 
> sites []
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] initialized
> 2014-06-24 02:58:13,025 INFO  elasticsearch.node - [Lifeguard] starting ...
> 2014-06-24 02:58:13,032 INFO  elasticsearch.transport - [Lifeguard] 
> bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
> {inet[/10.0.2.15:9301]}
> 2014-06-24 02:58:16,063 INFO  cluster.service - [Lifeguard] detected_master 
> [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
> added 
> {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
>  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
> zen-disco-receive(from master 
> [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
> 2014-06-24 02:58:16,072 INFO  elasticsearch.discovery - [Lifeguard] 
> elasticsearch/MWiqtTiqS5aC_M7QvGtfyg
> 2014-06-24 02:58:16,074 INFO  elasticsearch.http - [Lifeguard] bound_address 
> {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
> 2014-06-24 02:58:16,076 INFO  elasticsearch.node - [Lifeguard] started
> 2014-06-24 02:58:16,076 INFO  indexer.IndexingJob - IndexingJob: done.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to