[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-26 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Right... i have made a few observations (i may have misunderstood the 
architecture but please bare with me)

I have managed to get 2.x indexing with ES  by making the following changes to 
the crawl script

Line 149:  echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
-Line 150:  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
Line 150:  $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID

Example call: ./bin/crawl urls test http://localhost:9300  2

However i believe the script should  use  $bin/nutch index -D 
solr.server.url=$SOLRURL 

Hope this helps anybody trying to use ES, i will commit my source code MongoDB 
over the weekend

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 

[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-26 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224
 ] 

Aaron Bedward edited comment on NUTCH-1798 at 6/26/14 9:32 PM:
---

Right... i have made a few observations (i may have misunderstood the 
architecture but please bare with me)

I have managed to get 2.x indexing with ES  by making the following changes to 
the crawl script

Line 149:  echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
-Line 150:  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
Line 150:  $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID

Example call: ./bin/crawl urls test http://localhost:9300  2

However i believe the script should  use  $bin/nutch index -D 
solr.server.url=$SOLRURL 

Hope this helps anybody trying to use ES, i will commit my source code for 
MongoDB over the weekend


was (Author: mrbedward):
Right... i have made a few observations (i may have misunderstood the 
architecture but please bare with me)

I have managed to get 2.x indexing with ES  by making the following changes to 
the crawl script

Line 149:  echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
-Line 150:  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
Line 150:  $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID

Example call: ./bin/crawl urls test http://localhost:9300  2

However i believe the script should  use  $bin/nutch index -D 
solr.server.url=$SOLRURL 

Hope this helps anybody trying to use ES, i will commit my source code MongoDB 
over the weekend

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043462#comment-14043462
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Tried the following command

./bin/crawl urls crawl nutch 2

I removed 

  echo SOLR dedup - $SOLRURL
  $bin/nutch solrdedup $commonOptions $SOLRURL
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

from the crawl script however i left the following part the same 

echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

still not documents indexed.  Log file:


2014-06-25 14:48:27,728 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2014-06-25 14:48:27,761 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:28,470 INFO  crawl.DbUpdaterJob - DbUpdaterJob: finished at 
2014-06-25 14:48:28, time elapsed: 00:00:05
2014-06-25 14:48:29,792 INFO  indexer.IndexingJob - IndexingJob: starting
2014-06-25 14:48:30,901 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:30,906 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:31,777 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:31,788 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:31,789 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:34,123 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2014-06-25 14:48:35,116 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
version[1.1.0], pid[10180], build[2181e11/2014-03-25T15:59:51Z]
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
initializing ...
2014-06-25 14:48:35,660 INFO  elasticsearch.plugins - [Master of Vengeance] 
loaded [], sites []
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
initialized
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
starting ...
2014-06-25 14:48:38,970 INFO  elasticsearch.transport - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
{inet[/10.0.2.15:9301]}
2014-06-25 14:48:42,106 INFO  cluster.service - [Master of Vengeance] 
detected_master [Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], added 
{[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, 
reason: zen-disco-receive(from master [[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-25 14:48:42,120 INFO  elasticsearch.discovery - [Master of Vengeance] 
elasticsearch/zH-oAjvTTEyg1l4aOJM_lg
2014-06-25 14:48:42,132 INFO  elasticsearch.http - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address 
{inet[/10.0.2.15:9201]}
2014-06-25 14:48:42,135 INFO  elasticsearch.node - [Master of Vengeance] started
2014-06-25 14:48:42,142 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:42,142 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing remaining 
requests [docs = 0, length = 0, total docs = 0]
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing to 
finalize last execute
2014-06-25 14:48:42,247 INFO  elasticsearch.node - [Master of Vengeance] 
stopping ...
2014-06-25 14:48:42,279 INFO  elasticsearch.node - [Master of Vengeance] stopped
2014-06-25 14:48:42,280 INFO  elasticsearch.node - [Master of Vengeance] 
closing ...
2014-06-25 14:48:42,286 INFO  elasticsearch.node - [Master of Vengeance] closed
2014-06-25 14:48:42,289 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:42,740 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:42,740 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter

[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043462#comment-14043462
 ] 

Aaron Bedward edited comment on NUTCH-1798 at 6/25/14 1:52 PM:
---

Tried the following command

./bin/crawl urls crawl nutch 2

I removed 

  echo SOLR dedup - $SOLRURL
  $bin/nutch solrdedup $commonOptions $SOLRURL
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

from the crawl script however i left the following part the same 

echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

still not documents indexed.  Log file:


2014-06-25 14:48:27,728 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2014-06-25 14:48:27,761 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:28,470 INFO  crawl.DbUpdaterJob - DbUpdaterJob: finished at 
2014-06-25 14:48:28, time elapsed: 00:00:05
2014-06-25 14:48:29,792 INFO  indexer.IndexingJob - IndexingJob: starting
2014-06-25 14:48:30,901 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:30,906 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:31,777 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:31,788 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:31,789 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:34,123 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2014-06-25 14:48:35,116 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
version[1.1.0], pid[10180], build[2181e11/2014-03-25T15:59:51Z]
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
initializing ...
2014-06-25 14:48:35,660 INFO  elasticsearch.plugins - [Master of Vengeance] 
loaded [], sites []
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
initialized
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
starting ...
2014-06-25 14:48:38,970 INFO  elasticsearch.transport - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
{inet[/10.0.2.15:9301]}
2014-06-25 14:48:42,106 INFO  cluster.service - [Master of Vengeance] 
detected_master [Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], added 
{[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, 
reason: zen-disco-receive(from master [[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-25 14:48:42,120 INFO  elasticsearch.discovery - [Master of Vengeance] 
elasticsearch/zH-oAjvTTEyg1l4aOJM_lg
2014-06-25 14:48:42,132 INFO  elasticsearch.http - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address 
{inet[/10.0.2.15:9201]}
2014-06-25 14:48:42,135 INFO  elasticsearch.node - [Master of Vengeance] started
2014-06-25 14:48:42,142 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:42,142 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing remaining 
requests [docs = 0, length = 0, total docs = 0]
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing to 
finalize last execute
2014-06-25 14:48:42,247 INFO  elasticsearch.node - [Master of Vengeance] 
stopping ...
2014-06-25 14:48:42,279 INFO  elasticsearch.node - [Master of Vengeance] stopped
2014-06-25 14:48:42,280 INFO  elasticsearch.node - [Master of Vengeance] 
closing ...
2014-06-25 14:48:42,286 INFO  elasticsearch.node - [Master of Vengeance] closed
2014-06-25 14:48:42,289 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:42,740 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:42,740 INFO  indexer.IndexingJob - Active 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043477#comment-14043477
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Using the command 

$ ./bin/nutch readdb -dump db -crawlId testcrawl


Output


http://google.com/  key:com.google:http/
baseUrl:http://google.com/
status: 4 (status_redir_temp)
fetchTime:  1406290992771
prevFetchTime:  1403698972410
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: TEMP_MOVED, 
args=[http://www.google.co.uk/?gfe_rd=crei=ML-qU-jpIOHR8gfw34CABw]
parseStatus:(null)
title:  null
score:  1.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c
reprUrl:null
batchId:1403698977-26703
metadata _csh_ :
metadata _rs_ :8

http://wiki.apache.org/ key:org.apache.wiki:http/
baseUrl:http://wiki.apache.org/
status: 1 (status_unfetched)
fetchTime:  1403704161974
prevFetchTime:  1403704069065
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: TEMP_MOVED, args=[http://wiki.apache.org/general/]
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:http://wiki.apache.org/
batchId:1403704073-19409
metadata _csh_ :

http://wiki.apache.org/HttpComponents/FrontPage key:
org.apache.wiki:http/HttpComponents/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704161990
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/ant/FrontPagekey:
org.apache.wiki:http/ant/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704161993
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/apachecon/FrontPage  key:
org.apache.wiki:http/apachecon/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162001
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/avalon/FrontPage key:
org.apache.wiki:http/avalon/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162004
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/beehive/FrontPagekey:
org.apache.wiki:http/beehive/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162007
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/cassandra/FrontPage  key:
org.apache.wiki:http/cassandra/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162012
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/clerezza/FrontPage   key:
org.apache.wiki:http/clerezza/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162015
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/cocoon-lenya/FrontPage   key:
org.apache.wiki:http/cocoon-lenya/FrontPage
baseUrl:null
status: 1 

[jira] [Updated] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Bedward updated NUTCH-1798:
-

Attachment: part-r-0

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 

[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043477#comment-14043477
 ] 

Aaron Bedward edited comment on NUTCH-1798 at 6/25/14 2:03 PM:
---

Using the command 

$ ./bin/nutch readdb -dump db -crawlId testcrawl

I have attached the output : part-r-0



was (Author: mrbedward):
Using the command 

$ ./bin/nutch readdb -dump db -crawlId testcrawl


Output


http://google.com/  key:com.google:http/
baseUrl:http://google.com/
status: 4 (status_redir_temp)
fetchTime:  1406290992771
prevFetchTime:  1403698972410
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: TEMP_MOVED, 
args=[http://www.google.co.uk/?gfe_rd=crei=ML-qU-jpIOHR8gfw34CABw]
parseStatus:(null)
title:  null
score:  1.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c
reprUrl:null
batchId:1403698977-26703
metadata _csh_ :
metadata _rs_ :8

http://wiki.apache.org/ key:org.apache.wiki:http/
baseUrl:http://wiki.apache.org/
status: 1 (status_unfetched)
fetchTime:  1403704161974
prevFetchTime:  1403704069065
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: TEMP_MOVED, args=[http://wiki.apache.org/general/]
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:http://wiki.apache.org/
batchId:1403704073-19409
metadata _csh_ :

http://wiki.apache.org/HttpComponents/FrontPage key:
org.apache.wiki:http/HttpComponents/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704161990
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/ant/FrontPagekey:
org.apache.wiki:http/ant/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704161993
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/apachecon/FrontPage  key:
org.apache.wiki:http/apachecon/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162001
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/avalon/FrontPage key:
org.apache.wiki:http/avalon/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162004
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/beehive/FrontPagekey:
org.apache.wiki:http/beehive/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162007
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/cassandra/FrontPage  key:
org.apache.wiki:http/cassandra/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162012
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/clerezza/FrontPage   key:
org.apache.wiki:http/clerezza/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162015
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:

[jira] [Updated] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Bedward updated NUTCH-1798:
-

Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9  (was: Ubuntu 
13.10, Elasticsearch 1)

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:11,581 INFO  

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043531#comment-14043531
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Pardon me i'm a newbie! 

WebTable statistics start
Statistics for WebTable:
status 38 (status_notmodified): 1
status 2 (status_fetched):  78
min score:  0.0
retry 0:1981
jobs:   
{[testcrawl]db_stats-job_local2072251870_0001={jobID=job_local2072251870_0001, 
jobName=[testcrawl]db_stats, counters={File Input Format Counters 
={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, 
MAP_INPUT_RECORDS=1984, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
MAP_OUTPUT_BYTES=105153, COMMITTED_HEAP_BYTES=535683072, CPU_MILLISECONDS=0, 
SPLIT_RAW_BYTES=793, COMBINE_INPUT_RECORDS=7936, REDUCE_INPUT_RECORDS=12, 
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=7936}, 
FileSystemCounters={FILE_BYTES_READ=974939, FILE_BYTES_WRITTEN=1150567}, File 
Output Format Counters ={BYTES_WRITTEN=375
retry 1:3
status 5 (status_redir_perm):   7
max score:  1.0
TOTAL urls: 1984
status 3 (status_gone): 6
status 4 (status_redir_temp):   8
status 1 (status_unfetched):1884
avg score:  0.002016129
WebTable statistics: done


 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043555#comment-14043555
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Ok, choose this version because it already had elastic search integration and i 
was hopping to extract meta tags out the box.

I will try debuging and report back with my progress.

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 

[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043555#comment-14043555
 ] 

Aaron Bedward edited comment on NUTCH-1798 at 6/25/14 2:45 PM:
---

Thank you for your help

Ok, choose this version because it already had elastic search integration and i 
was hopping to extract meta tags out the box.

I will try debuging and report back with my progress.


was (Author: mrbedward):
Ok, choose this version because it already had elastic search integration and i 
was hopping to extract meta tags out the box.

I will try debuging and report back with my progress.

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] 

[jira] [Created] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-24 Thread Aaron Bedward (JIRA)
Aaron Bedward created NUTCH-1798:


 Summary: Unable to get any documents to index in elastic search
 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1
Reporter: Aaron Bedward


Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
would like to use the new metatag extraction features.  I have then run ant 
runtime to build,  I have updated the nutch-site.xml like so:


property
  nameplugin.includes/name
 
valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
 descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  /description
/property

  property
  nameelastic.cluster/name
  valueelasticsearch/value
  descriptionThe cluster name to discover. Either host and potr must be 
defined
or cluster./description
  /property
 

I have then created a folder called urls and added seed.txt.

i ran the following commands 
bin/nutch inject urls
bin/nutch generate -topN 1000  
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb

bin/nutch index  -all 

it runs no errors however no documents have been index

i also tried setting up the following with solr and no documents are indexed


Log:

2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
2014-06-24 02:57:57, time elapsed: 00:00:06
2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
pid[21885], build[2181e11/2014-03-25T15:59:51Z]
2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], sites 
[]
2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] bound_address 
{inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]}
2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
added 
{[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
 Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
zen-disco-receive(from master 
[[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
elasticsearch/jXIC3VT6THukKDFB7GMw7Q
2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
{inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-24 02:58:11,581 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-24 02:58:11,716 INFO  elastic.ElasticIndexWriter - Processing remaining 
requests [docs = 0, length = 0, total docs = 0]