[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045655#comment-14045655 ] Julien Nioche commented on NUTCH-1798: -- Good catch Aaron! If you look at the nutch script the solrindex command is just an alias for org.apache.nutch.indexer.IndexingJob -D solr.server.url=$1 i.e. it expects the following argument to be the SOLR URL. The crawl script should either set the common options after the SOLR URL or use the generic index command instead as you suggested (like in Nutch 1.x). In your case since you are not using SOLR at all, you could modify the script so that it does not pass the SOLR config at all. I will rename this issue to reflect the nature of the problem and commit a fix. Thanks! Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224 ] Aaron Bedward commented on NUTCH-1798: -- Right... i have made a few observations (i may have misunderstood the architecture but please bare with me) I have managed to get 2.x indexing with ES by making the following changes to the crawl script Line 149: echo Indexing $CRAWL_ID on SOLR index - $SOLRURL -Line 150: $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID Line 150: $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID Example call: ./bin/crawl urls test http://localhost:9300 2 However i believe the script should use $bin/nutch index -D solr.server.url=$SOLRURL Hope this helps anybody trying to use ES, i will commit my source code MongoDB over the weekend Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043270#comment-14043270 ] Julien Nioche commented on NUTCH-1798: -- The parser took only 6 secs which is slighly suspicious. It could be something to do with using -all vs specifying a given batch. Could you try running another crawl but using the crawl script instead? just modify it so that it does not call the solr indexer. It will handle the batchIDs for you. Alternatively can you use the nutch readdb command and see what you are getting? Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1 Reporter: Aaron Bedward Fix For: 2.3 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043462#comment-14043462 ] Aaron Bedward commented on NUTCH-1798: -- Tried the following command ./bin/crawl urls crawl nutch 2 I removed echo SOLR dedup - $SOLRURL $bin/nutch solrdedup $commonOptions $SOLRURL if [ $? -ne 0 ] then exit $? fi from the crawl script however i left the following part the same echo Indexing $CRAWL_ID on SOLR index - $SOLRURL $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID if [ $? -ne 0 ] then exit $? fi still not documents indexed. Log file: 2014-06-25 14:48:27,728 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2014-06-25 14:48:27,728 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2014-06-25 14:48:27,728 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2014-06-25 14:48:27,761 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2014-06-25 14:48:28,470 INFO crawl.DbUpdaterJob - DbUpdaterJob: finished at 2014-06-25 14:48:28, time elapsed: 00:00:05 2014-06-25 14:48:29,792 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-25 14:48:30,901 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-25 14:48:30,906 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-25 14:48:31,777 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-25 14:48:31,788 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-25 14:48:31,789 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-25 14:48:34,123 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-25 14:48:35,116 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-25 14:48:35,526 INFO elasticsearch.node - [Master of Vengeance] version[1.1.0], pid[10180], build[2181e11/2014-03-25T15:59:51Z] 2014-06-25 14:48:35,526 INFO elasticsearch.node - [Master of Vengeance] initializing ... 2014-06-25 14:48:35,660 INFO elasticsearch.plugins - [Master of Vengeance] loaded [], sites [] 2014-06-25 14:48:38,837 INFO elasticsearch.node - [Master of Vengeance] initialized 2014-06-25 14:48:38,837 INFO elasticsearch.node - [Master of Vengeance] starting ... 2014-06-25 14:48:38,970 INFO elasticsearch.transport - [Master of Vengeance] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-25 14:48:42,106 INFO cluster.service - [Master of Vengeance] detected_master [Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, reason: zen-disco-receive(from master [[Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-25 14:48:42,120 INFO elasticsearch.discovery - [Master of Vengeance] elasticsearch/zH-oAjvTTEyg1l4aOJM_lg 2014-06-25 14:48:42,132 INFO elasticsearch.http - [Master of Vengeance] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-25 14:48:42,135 INFO elasticsearch.node - [Master of Vengeance] started 2014-06-25 14:48:42,142 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-25 14:48:42,142 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-25 14:48:42,247 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0] 2014-06-25 14:48:42,247 INFO elastic.ElasticIndexWriter - Processing to finalize last execute 2014-06-25 14:48:42,247 INFO elasticsearch.node - [Master of Vengeance] stopping ... 2014-06-25 14:48:42,279 INFO elasticsearch.node - [Master of Vengeance] stopped 2014-06-25 14:48:42,280 INFO elasticsearch.node - [Master of Vengeance] closing ... 2014-06-25 14:48:42,286 INFO elasticsearch.node - [Master of Vengeance] closed 2014-06-25 14:48:42,289 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2014-06-25 14:48:42,740 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-25 14:48:42,740 INFO indexer.IndexingJob - Active IndexWriters : ElasticIndexWriter
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043466#comment-14043466 ] Julien Nioche commented on NUTCH-1798: -- and what's the output of 'nutch readdb' ? Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1 Reporter: Aaron Bedward Fix For: 2.3 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043477#comment-14043477 ] Aaron Bedward commented on NUTCH-1798: -- Using the command $ ./bin/nutch readdb -dump db -crawlId testcrawl Output http://google.com/ key:com.google:http/ baseUrl:http://google.com/ status: 4 (status_redir_temp) fetchTime: 1406290992771 prevFetchTime: 1403698972410 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: TEMP_MOVED, args=[http://www.google.co.uk/?gfe_rd=crei=ML-qU-jpIOHR8gfw34CABw] parseStatus:(null) title: null score: 1.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c reprUrl:null batchId:1403698977-26703 metadata _csh_ : metadata _rs_ :8 http://wiki.apache.org/ key:org.apache.wiki:http/ baseUrl:http://wiki.apache.org/ status: 1 (status_unfetched) fetchTime: 1403704161974 prevFetchTime: 1403704069065 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: TEMP_MOVED, args=[http://wiki.apache.org/general/] parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:http://wiki.apache.org/ batchId:1403704073-19409 metadata _csh_ : http://wiki.apache.org/HttpComponents/FrontPage key: org.apache.wiki:http/HttpComponents/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704161990 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/ant/FrontPagekey: org.apache.wiki:http/ant/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704161993 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/apachecon/FrontPage key: org.apache.wiki:http/apachecon/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162001 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/avalon/FrontPage key: org.apache.wiki:http/avalon/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162004 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/beehive/FrontPagekey: org.apache.wiki:http/beehive/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162007 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/cassandra/FrontPage key: org.apache.wiki:http/cassandra/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162012 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/clerezza/FrontPage key: org.apache.wiki:http/clerezza/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162015 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/cocoon-lenya/FrontPage key: org.apache.wiki:http/cocoon-lenya/FrontPage baseUrl:null status: 1
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043483#comment-14043483 ] Julien Nioche commented on NUTCH-1798: -- I bet it is large ;-) I meant to use readdb with the -stats option to get a global view of how many docs have been fetched. Maybe combine it with -crawlId to check the status of the docs for that specific crawlid Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043531#comment-14043531 ] Aaron Bedward commented on NUTCH-1798: -- Pardon me i'm a newbie! WebTable statistics start Statistics for WebTable: status 38 (status_notmodified): 1 status 2 (status_fetched): 78 min score: 0.0 retry 0:1981 jobs: {[testcrawl]db_stats-job_local2072251870_0001={jobID=job_local2072251870_0001, jobName=[testcrawl]db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1984, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=105153, COMMITTED_HEAP_BYTES=535683072, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=793, COMBINE_INPUT_RECORDS=7936, REDUCE_INPUT_RECORDS=12, REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=7936}, FileSystemCounters={FILE_BYTES_READ=974939, FILE_BYTES_WRITTEN=1150567}, File Output Format Counters ={BYTES_WRITTEN=375 retry 1:3 status 5 (status_redir_perm): 7 max score: 1.0 TOTAL urls: 1984 status 3 (status_gone): 6 status 4 (status_redir_temp): 8 status 1 (status_unfetched):1884 avg score: 0.002016129 WebTable statistics: done Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport -
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043546#comment-14043546 ] Julien Nioche commented on NUTCH-1798: -- No problem Aaron! Ok, so it looks like you do have documents in the table that are successfully fetched. Unfortunately 2.x lacks many of the functionalities that 1.x has (not mentioning robustness) and that are useful for testing e.g. indexer-dummy or [NUTCH-1758]. If you have good reasons to use 2.x and not 1.x, the best approach would be to either port these 2 patches to 2.x or debug in local mode to see what's happening. See [http://wiki.apache.org/nutch/RunNutchInEclipse#Remote_Debugging_in_Eclipse_.28NOT_VERIFIED.29] (no idea who thought it was not verified, it should work fine) for advice on how to debug. Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]])
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043555#comment-14043555 ] Aaron Bedward commented on NUTCH-1798: -- Ok, choose this version because it already had elastic search integration and i was hopping to extract meta tags out the box. I will try debuging and report back with my progress. Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043910#comment-14043910 ] Julien Nioche commented on NUTCH-1798: -- You are welcome bq. Ok, choose this version because it already had elastic search integration and i was hopping to extract meta tags out the box. Just commenting on this so that other users don't get the wrong idea. Nutch 1.x does both indexing with ElasticSearch and meta tags extraction. bq. I will try debuging and report back with my progress. Great. Would be good to get to the bottom of this one. Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]},
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042030#comment-14042030 ] Fjodor Vershinin commented on NUTCH-1798: - I just checked, I have the same issue with nutch 2.x and elasticsearch. Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1 Reporter: Aaron Bedward Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding