[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-27 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045655#comment-14045655
 ] 

Julien Nioche commented on NUTCH-1798:
--

Good catch Aaron! 
If you look at the nutch script the solrindex command is just an alias for 
org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1 i.e. it expects 
the following argument to be the SOLR URL. The crawl script should either set 
the common options after the SOLR URL or use the generic index command instead 
as you suggested (like in Nutch 1.x). In your case since you are not using SOLR 
at all, you could modify the script so that it does not pass the SOLR config at 
all.
I will rename this issue to reflect the nature of the problem and commit a fix. 
Thanks!
 

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-26 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Right... i have made a few observations (i may have misunderstood the 
architecture but please bare with me)

I have managed to get 2.x indexing with ES  by making the following changes to 
the crawl script

Line 149:  echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
-Line 150:  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
Line 150:  $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID

Example call: ./bin/crawl urls test http://localhost:9300  2

However i believe the script should  use  $bin/nutch index -D 
solr.server.url=$SOLRURL 

Hope this helps anybody trying to use ES, i will commit my source code MongoDB 
over the weekend

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043270#comment-14043270
 ] 

Julien Nioche commented on NUTCH-1798:
--

The parser took only 6 secs which is slighly suspicious. It could be something 
to do with using -all vs specifying a given batch. Could you try running 
another crawl but using the crawl script instead? just modify it so that it 
does not call the solr indexer. It will handle the batchIDs for you.
 
Alternatively can you use the nutch readdb command and see what you are getting?

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1
Reporter: Aaron Bedward
 Fix For: 2.3


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043462#comment-14043462
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Tried the following command

./bin/crawl urls crawl nutch 2

I removed 

  echo SOLR dedup - $SOLRURL
  $bin/nutch solrdedup $commonOptions $SOLRURL
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

from the crawl script however i left the following part the same 

echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
  
  if [ $? -ne 0 ] 
   then exit $? 
  fi

still not documents indexed.  Log file:


2014-06-25 14:48:27,728 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2014-06-25 14:48:27,728 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2014-06-25 14:48:27,761 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:28,470 INFO  crawl.DbUpdaterJob - DbUpdaterJob: finished at 
2014-06-25 14:48:28, time elapsed: 00:00:05
2014-06-25 14:48:29,792 INFO  indexer.IndexingJob - IndexingJob: starting
2014-06-25 14:48:30,901 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:30,906 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:31,777 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:31,788 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:31,789 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:34,123 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2014-06-25 14:48:35,116 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
version[1.1.0], pid[10180], build[2181e11/2014-03-25T15:59:51Z]
2014-06-25 14:48:35,526 INFO  elasticsearch.node - [Master of Vengeance] 
initializing ...
2014-06-25 14:48:35,660 INFO  elasticsearch.plugins - [Master of Vengeance] 
loaded [], sites []
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
initialized
2014-06-25 14:48:38,837 INFO  elasticsearch.node - [Master of Vengeance] 
starting ...
2014-06-25 14:48:38,970 INFO  elasticsearch.transport - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
{inet[/10.0.2.15:9301]}
2014-06-25 14:48:42,106 INFO  cluster.service - [Master of Vengeance] 
detected_master [Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], added 
{[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, 
reason: zen-disco-receive(from master [[Speed 
Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
2014-06-25 14:48:42,120 INFO  elasticsearch.discovery - [Master of Vengeance] 
elasticsearch/zH-oAjvTTEyg1l4aOJM_lg
2014-06-25 14:48:42,132 INFO  elasticsearch.http - [Master of Vengeance] 
bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address 
{inet[/10.0.2.15:9201]}
2014-06-25 14:48:42,135 INFO  elasticsearch.node - [Master of Vengeance] started
2014-06-25 14:48:42,142 INFO  basic.BasicIndexingFilter - Maximum title length 
for indexing set to: 100
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.more.MoreIndexingFilter
2014-06-25 14:48:42,142 INFO  anchor.AnchorIndexingFilter - Anchor 
deduplication is: off
2014-06-25 14:48:42,142 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing remaining 
requests [docs = 0, length = 0, total docs = 0]
2014-06-25 14:48:42,247 INFO  elastic.ElasticIndexWriter - Processing to 
finalize last execute
2014-06-25 14:48:42,247 INFO  elasticsearch.node - [Master of Vengeance] 
stopping ...
2014-06-25 14:48:42,279 INFO  elasticsearch.node - [Master of Vengeance] stopped
2014-06-25 14:48:42,280 INFO  elasticsearch.node - [Master of Vengeance] 
closing ...
2014-06-25 14:48:42,286 INFO  elasticsearch.node - [Master of Vengeance] closed
2014-06-25 14:48:42,289 WARN  mapred.FileOutputCommitter - Output path is null 
in cleanup
2014-06-25 14:48:42,740 INFO  indexer.IndexWriters - Adding 
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2014-06-25 14:48:42,740 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043466#comment-14043466
 ] 

Julien Nioche commented on NUTCH-1798:
--

and what's the output of 'nutch readdb' ?

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1
Reporter: Aaron Bedward
 Fix For: 2.3


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding 
 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043477#comment-14043477
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Using the command 

$ ./bin/nutch readdb -dump db -crawlId testcrawl


Output


http://google.com/  key:com.google:http/
baseUrl:http://google.com/
status: 4 (status_redir_temp)
fetchTime:  1406290992771
prevFetchTime:  1403698972410
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: TEMP_MOVED, 
args=[http://www.google.co.uk/?gfe_rd=crei=ML-qU-jpIOHR8gfw34CABw]
parseStatus:(null)
title:  null
score:  1.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c
reprUrl:null
batchId:1403698977-26703
metadata _csh_ :
metadata _rs_ :8

http://wiki.apache.org/ key:org.apache.wiki:http/
baseUrl:http://wiki.apache.org/
status: 1 (status_unfetched)
fetchTime:  1403704161974
prevFetchTime:  1403704069065
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: TEMP_MOVED, args=[http://wiki.apache.org/general/]
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:http://wiki.apache.org/
batchId:1403704073-19409
metadata _csh_ :

http://wiki.apache.org/HttpComponents/FrontPage key:
org.apache.wiki:http/HttpComponents/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704161990
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/ant/FrontPagekey:
org.apache.wiki:http/ant/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704161993
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/apachecon/FrontPage  key:
org.apache.wiki:http/apachecon/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162001
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/avalon/FrontPage key:
org.apache.wiki:http/avalon/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162004
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/beehive/FrontPagekey:
org.apache.wiki:http/beehive/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162007
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/cassandra/FrontPage  key:
org.apache.wiki:http/cassandra/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162012
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/clerezza/FrontPage   key:
org.apache.wiki:http/clerezza/FrontPage
baseUrl:null
status: 1 (status_unfetched)
fetchTime:  1403704162015
prevFetchTime:  0
fetchInterval:  2592000
retriesSinceFetch:  0
modifiedTime:   0
prevModifiedTime:   0
protocolStatus: (null)
parseStatus:(null)
title:  null
score:  0.0
markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94
reprUrl:null
metadata _csh_ :

http://wiki.apache.org/cocoon-lenya/FrontPage   key:
org.apache.wiki:http/cocoon-lenya/FrontPage
baseUrl:null
status: 1 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043483#comment-14043483
 ] 

Julien Nioche commented on NUTCH-1798:
--

I bet it is large ;-) I meant to use readdb with the -stats option to get a 
global view of how many docs have been fetched. Maybe combine it with -crawlId 
to check the status of the docs for that specific crawlid

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043531#comment-14043531
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Pardon me i'm a newbie! 

WebTable statistics start
Statistics for WebTable:
status 38 (status_notmodified): 1
status 2 (status_fetched):  78
min score:  0.0
retry 0:1981
jobs:   
{[testcrawl]db_stats-job_local2072251870_0001={jobID=job_local2072251870_0001, 
jobName=[testcrawl]db_stats, counters={File Input Format Counters 
={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, 
MAP_INPUT_RECORDS=1984, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, 
MAP_OUTPUT_BYTES=105153, COMMITTED_HEAP_BYTES=535683072, CPU_MILLISECONDS=0, 
SPLIT_RAW_BYTES=793, COMBINE_INPUT_RECORDS=7936, REDUCE_INPUT_RECORDS=12, 
REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, 
REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=7936}, 
FileSystemCounters={FILE_BYTES_READ=974939, FILE_BYTES_WRITTEN=1150567}, File 
Output Format Counters ={BYTES_WRITTEN=375
retry 1:3
status 5 (status_redir_perm):   7
max score:  1.0
TOTAL urls: 1984
status 3 (status_gone): 6
status 4 (status_redir_temp):   8
status 1 (status_unfetched):1884
avg score:  0.002016129
WebTable statistics: done


 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043546#comment-14043546
 ] 

Julien Nioche commented on NUTCH-1798:
--

No problem Aaron! Ok, so it looks like you do have documents in the table that 
are successfully fetched. Unfortunately 2.x lacks many of the functionalities 
that 1.x has (not mentioning robustness) and that are useful for testing e.g. 
indexer-dummy or [NUTCH-1758]. If you have good reasons to use 2.x and not 1.x, 
the best approach would be to either port these 2 patches to 2.x or debug in 
local mode to see what's happening. 

See 
[http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse_.28NOT_VERIFIED.29]
 (no idea who thought it was not verified, it should work fine) for advice on 
how to debug.

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043555#comment-14043555
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Ok, choose this version because it already had elastic search integration and i 
was hopping to extract meta tags out the box.

I will try debuging and report back with my progress.

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-25 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043910#comment-14043910
 ] 

Julien Nioche commented on NUTCH-1798:
--

You are welcome

bq. Ok, choose this version because it already had elastic search integration 
and i was hopping to extract meta tags out the box.

Just commenting on this so that other users don't get the wrong idea. Nutch 1.x 
does both indexing with ElasticSearch and meta tags extraction. 

bq. I will try debuging and report back with my progress.

Great. Would be good to get to the bottom of this one.


 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, 

[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-24 Thread Fjodor Vershinin (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042030#comment-14042030
 ] 

Fjodor Vershinin commented on NUTCH-1798:
-

I just checked, I have the same issue with nutch 2.x and elasticsearch.

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1
Reporter: Aaron Bedward

 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
 2014-06-24 02:58:11,553 INFO  elasticsearch.discovery - [Silver] 
 elasticsearch/jXIC3VT6THukKDFB7GMw7Q
 2014-06-24 02:58:11,562 INFO  elasticsearch.http - [Silver] bound_address 
 {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]}
 2014-06-24 02:58:11,566 INFO  elasticsearch.node - [Silver] started
 2014-06-24 02:58:11,568 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:11,569 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:11,581 INFO  indexer.IndexingFilters - Adding