[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224 ] Aaron Bedward commented on NUTCH-1798: -- Right... i have made a few observations (i may have misunderstood the architecture but please bare with me) I have managed to get 2.x indexing with ES by making the following changes to the crawl script Line 149: echo Indexing $CRAWL_ID on SOLR index - $SOLRURL -Line 150: $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID Line 150: $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID Example call: ./bin/crawl urls test http://localhost:9300 2 However i believe the script should use $bin/nutch index -D solr.server.url=$SOLRURL Hope this helps anybody trying to use ES, i will commit my source code MongoDB over the weekend Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master
[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224 ] Aaron Bedward edited comment on NUTCH-1798 at 6/26/14 9:32 PM: --- Right... i have made a few observations (i may have misunderstood the architecture but please bare with me) I have managed to get 2.x indexing with ES by making the following changes to the crawl script Line 149: echo Indexing $CRAWL_ID on SOLR index - $SOLRURL -Line 150: $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID Line 150: $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID Example call: ./bin/crawl urls test http://localhost:9300 2 However i believe the script should use $bin/nutch index -D solr.server.url=$SOLRURL Hope this helps anybody trying to use ES, i will commit my source code for MongoDB over the weekend was (Author: mrbedward): Right... i have made a few observations (i may have misunderstood the architecture but please bare with me) I have managed to get 2.x indexing with ES by making the following changes to the crawl script Line 149: echo Indexing $CRAWL_ID on SOLR index - $SOLRURL -Line 150: $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID Line 150: $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID Example call: ./bin/crawl urls test http://localhost:9300 2 However i believe the script should use $bin/nutch index -D solr.server.url=$SOLRURL Hope this helps anybody trying to use ES, i will commit my source code MongoDB over the weekend Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043462#comment-14043462 ] Aaron Bedward commented on NUTCH-1798: -- Tried the following command ./bin/crawl urls crawl nutch 2 I removed echo SOLR dedup - $SOLRURL $bin/nutch solrdedup $commonOptions $SOLRURL if [ $? -ne 0 ] then exit $? fi from the crawl script however i left the following part the same echo Indexing $CRAWL_ID on SOLR index - $SOLRURL $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID if [ $? -ne 0 ] then exit $? fi still not documents indexed. Log file: 2014-06-25 14:48:27,728 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2014-06-25 14:48:27,728 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2014-06-25 14:48:27,728 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2014-06-25 14:48:27,761 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2014-06-25 14:48:28,470 INFO crawl.DbUpdaterJob - DbUpdaterJob: finished at 2014-06-25 14:48:28, time elapsed: 00:00:05 2014-06-25 14:48:29,792 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-25 14:48:30,901 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-25 14:48:30,906 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-25 14:48:31,777 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-25 14:48:31,788 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-25 14:48:31,789 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-25 14:48:34,123 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-25 14:48:35,116 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-25 14:48:35,526 INFO elasticsearch.node - [Master of Vengeance] version[1.1.0], pid[10180], build[2181e11/2014-03-25T15:59:51Z] 2014-06-25 14:48:35,526 INFO elasticsearch.node - [Master of Vengeance] initializing ... 2014-06-25 14:48:35,660 INFO elasticsearch.plugins - [Master of Vengeance] loaded [], sites [] 2014-06-25 14:48:38,837 INFO elasticsearch.node - [Master of Vengeance] initialized 2014-06-25 14:48:38,837 INFO elasticsearch.node - [Master of Vengeance] starting ... 2014-06-25 14:48:38,970 INFO elasticsearch.transport - [Master of Vengeance] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-25 14:48:42,106 INFO cluster.service - [Master of Vengeance] detected_master [Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, reason: zen-disco-receive(from master [[Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-25 14:48:42,120 INFO elasticsearch.discovery - [Master of Vengeance] elasticsearch/zH-oAjvTTEyg1l4aOJM_lg 2014-06-25 14:48:42,132 INFO elasticsearch.http - [Master of Vengeance] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-25 14:48:42,135 INFO elasticsearch.node - [Master of Vengeance] started 2014-06-25 14:48:42,142 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-25 14:48:42,142 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-25 14:48:42,247 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0] 2014-06-25 14:48:42,247 INFO elastic.ElasticIndexWriter - Processing to finalize last execute 2014-06-25 14:48:42,247 INFO elasticsearch.node - [Master of Vengeance] stopping ... 2014-06-25 14:48:42,279 INFO elasticsearch.node - [Master of Vengeance] stopped 2014-06-25 14:48:42,280 INFO elasticsearch.node - [Master of Vengeance] closing ... 2014-06-25 14:48:42,286 INFO elasticsearch.node - [Master of Vengeance] closed 2014-06-25 14:48:42,289 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2014-06-25 14:48:42,740 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-25 14:48:42,740 INFO indexer.IndexingJob - Active IndexWriters : ElasticIndexWriter
[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043462#comment-14043462 ] Aaron Bedward edited comment on NUTCH-1798 at 6/25/14 1:52 PM: --- Tried the following command ./bin/crawl urls crawl nutch 2 I removed echo SOLR dedup - $SOLRURL $bin/nutch solrdedup $commonOptions $SOLRURL if [ $? -ne 0 ] then exit $? fi from the crawl script however i left the following part the same echo Indexing $CRAWL_ID on SOLR index - $SOLRURL $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID if [ $? -ne 0 ] then exit $? fi still not documents indexed. Log file: 2014-06-25 14:48:27,728 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2014-06-25 14:48:27,728 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2014-06-25 14:48:27,728 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2014-06-25 14:48:27,761 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2014-06-25 14:48:28,470 INFO crawl.DbUpdaterJob - DbUpdaterJob: finished at 2014-06-25 14:48:28, time elapsed: 00:00:05 2014-06-25 14:48:29,792 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-25 14:48:30,901 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-25 14:48:30,906 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-25 14:48:31,777 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-25 14:48:31,788 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-25 14:48:31,789 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-25 14:48:34,123 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-25 14:48:35,116 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-25 14:48:35,526 INFO elasticsearch.node - [Master of Vengeance] version[1.1.0], pid[10180], build[2181e11/2014-03-25T15:59:51Z] 2014-06-25 14:48:35,526 INFO elasticsearch.node - [Master of Vengeance] initializing ... 2014-06-25 14:48:35,660 INFO elasticsearch.plugins - [Master of Vengeance] loaded [], sites [] 2014-06-25 14:48:38,837 INFO elasticsearch.node - [Master of Vengeance] initialized 2014-06-25 14:48:38,837 INFO elasticsearch.node - [Master of Vengeance] starting ... 2014-06-25 14:48:38,970 INFO elasticsearch.transport - [Master of Vengeance] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-25 14:48:42,106 INFO cluster.service - [Master of Vengeance] detected_master [Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[X-Treme][CKEftQrsShaXeNWbCV_ZAg][nutch][inet[/10.0.2.15:9300]],[Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]],}, reason: zen-disco-receive(from master [[Speed Demon][5baFhS21S42DEy_7d8bsJQ][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-25 14:48:42,120 INFO elasticsearch.discovery - [Master of Vengeance] elasticsearch/zH-oAjvTTEyg1l4aOJM_lg 2014-06-25 14:48:42,132 INFO elasticsearch.http - [Master of Vengeance] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-25 14:48:42,135 INFO elasticsearch.node - [Master of Vengeance] started 2014-06-25 14:48:42,142 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-25 14:48:42,142 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-25 14:48:42,142 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-25 14:48:42,247 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0] 2014-06-25 14:48:42,247 INFO elastic.ElasticIndexWriter - Processing to finalize last execute 2014-06-25 14:48:42,247 INFO elasticsearch.node - [Master of Vengeance] stopping ... 2014-06-25 14:48:42,279 INFO elasticsearch.node - [Master of Vengeance] stopped 2014-06-25 14:48:42,280 INFO elasticsearch.node - [Master of Vengeance] closing ... 2014-06-25 14:48:42,286 INFO elasticsearch.node - [Master of Vengeance] closed 2014-06-25 14:48:42,289 WARN mapred.FileOutputCommitter - Output path is null in cleanup 2014-06-25 14:48:42,740 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-25 14:48:42,740 INFO indexer.IndexingJob - Active
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043477#comment-14043477 ] Aaron Bedward commented on NUTCH-1798: -- Using the command $ ./bin/nutch readdb -dump db -crawlId testcrawl Output http://google.com/ key:com.google:http/ baseUrl:http://google.com/ status: 4 (status_redir_temp) fetchTime: 1406290992771 prevFetchTime: 1403698972410 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: TEMP_MOVED, args=[http://www.google.co.uk/?gfe_rd=crei=ML-qU-jpIOHR8gfw34CABw] parseStatus:(null) title: null score: 1.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c reprUrl:null batchId:1403698977-26703 metadata _csh_ : metadata _rs_ :8 http://wiki.apache.org/ key:org.apache.wiki:http/ baseUrl:http://wiki.apache.org/ status: 1 (status_unfetched) fetchTime: 1403704161974 prevFetchTime: 1403704069065 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: TEMP_MOVED, args=[http://wiki.apache.org/general/] parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:http://wiki.apache.org/ batchId:1403704073-19409 metadata _csh_ : http://wiki.apache.org/HttpComponents/FrontPage key: org.apache.wiki:http/HttpComponents/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704161990 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/ant/FrontPagekey: org.apache.wiki:http/ant/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704161993 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/apachecon/FrontPage key: org.apache.wiki:http/apachecon/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162001 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/avalon/FrontPage key: org.apache.wiki:http/avalon/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162004 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/beehive/FrontPagekey: org.apache.wiki:http/beehive/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162007 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/cassandra/FrontPage key: org.apache.wiki:http/cassandra/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162012 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/clerezza/FrontPage key: org.apache.wiki:http/clerezza/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162015 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/cocoon-lenya/FrontPage key: org.apache.wiki:http/cocoon-lenya/FrontPage baseUrl:null status: 1
[jira] [Updated] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Bedward updated NUTCH-1798: - Attachment: part-r-0 Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24
[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043477#comment-14043477 ] Aaron Bedward edited comment on NUTCH-1798 at 6/25/14 2:03 PM: --- Using the command $ ./bin/nutch readdb -dump db -crawlId testcrawl I have attached the output : part-r-0 was (Author: mrbedward): Using the command $ ./bin/nutch readdb -dump db -crawlId testcrawl Output http://google.com/ key:com.google:http/ baseUrl:http://google.com/ status: 4 (status_redir_temp) fetchTime: 1406290992771 prevFetchTime: 1403698972410 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: TEMP_MOVED, args=[http://www.google.co.uk/?gfe_rd=crei=ML-qU-jpIOHR8gfw34CABw] parseStatus:(null) title: null score: 1.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c reprUrl:null batchId:1403698977-26703 metadata _csh_ : metadata _rs_ :8 http://wiki.apache.org/ key:org.apache.wiki:http/ baseUrl:http://wiki.apache.org/ status: 1 (status_unfetched) fetchTime: 1403704161974 prevFetchTime: 1403704069065 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: TEMP_MOVED, args=[http://wiki.apache.org/general/] parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:http://wiki.apache.org/ batchId:1403704073-19409 metadata _csh_ : http://wiki.apache.org/HttpComponents/FrontPage key: org.apache.wiki:http/HttpComponents/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704161990 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/ant/FrontPagekey: org.apache.wiki:http/ant/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704161993 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/apachecon/FrontPage key: org.apache.wiki:http/apachecon/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162001 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/avalon/FrontPage key: org.apache.wiki:http/avalon/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162004 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/beehive/FrontPagekey: org.apache.wiki:http/beehive/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162007 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/cassandra/FrontPage key: org.apache.wiki:http/cassandra/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162012 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:org.apache.gora.persistency.impl.DirtyMapWrapper@2f0d94 reprUrl:null metadata _csh_ : http://wiki.apache.org/clerezza/FrontPage key: org.apache.wiki:http/clerezza/FrontPage baseUrl:null status: 1 (status_unfetched) fetchTime: 1403704162015 prevFetchTime: 0 fetchInterval: 2592000 retriesSinceFetch: 0 modifiedTime: 0 prevModifiedTime: 0 protocolStatus: (null) parseStatus:(null) title: null score: 0.0 markers:
[jira] [Updated] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Bedward updated NUTCH-1798: - Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 (was: Ubuntu 13.10, Elasticsearch 1) Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:11,581 INFO
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043531#comment-14043531 ] Aaron Bedward commented on NUTCH-1798: -- Pardon me i'm a newbie! WebTable statistics start Statistics for WebTable: status 38 (status_notmodified): 1 status 2 (status_fetched): 78 min score: 0.0 retry 0:1981 jobs: {[testcrawl]db_stats-job_local2072251870_0001={jobID=job_local2072251870_0001, jobName=[testcrawl]db_stats, counters={File Input Format Counters ={BYTES_READ=0}, Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=211, MAP_INPUT_RECORDS=1984, REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=24, MAP_OUTPUT_BYTES=105153, COMMITTED_HEAP_BYTES=535683072, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=793, COMBINE_INPUT_RECORDS=7936, REDUCE_INPUT_RECORDS=12, REDUCE_INPUT_GROUPS=12, COMBINE_OUTPUT_RECORDS=12, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=12, VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=7936}, FileSystemCounters={FILE_BYTES_READ=974939, FILE_BYTES_WRITTEN=1150567}, File Output Format Counters ={BYTES_WRITTEN=375 retry 1:3 status 5 (status_redir_perm): 7 max score: 1.0 TOTAL urls: 1984 status 3 (status_gone): 6 status 4 (status_redir_temp): 8 status 1 (status_unfetched):1884 avg score: 0.002016129 WebTable statistics: done Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport -
[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043555#comment-14043555 ] Aaron Bedward commented on NUTCH-1798: -- Ok, choose this version because it already had elastic search integration and i was hopping to extract meta tags out the box. I will try debuging and report back with my progress. Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24
[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search
[ https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14043555#comment-14043555 ] Aaron Bedward edited comment on NUTCH-1798 at 6/25/14 2:45 PM: --- Thank you for your help Ok, choose this version because it already had elastic search integration and i was hopping to extract meta tags out the box. I will try debuging and report back with my progress. was (Author: mrbedward): Ok, choose this version because it already had elastic search integration and i was hopping to extract meta tags out the box. I will try debuging and report back with my progress. Unable to get any documents to index in elastic search -- Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9 Reporter: Aaron Bedward Fix For: 2.3 Attachments: part-r-0 Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver]
[jira] [Created] (NUTCH-1798) Unable to get any documents to index in elastic search
Aaron Bedward created NUTCH-1798: Summary: Unable to get any documents to index in elastic search Key: NUTCH-1798 URL: https://issues.apache.org/jira/browse/NUTCH-1798 Project: Nutch Issue Type: Bug Affects Versions: 2.3 Environment: Ubuntu 13.10, Elasticsearch 1 Reporter: Aaron Bedward Hopefully this is something i am doing wrong. I have checked out 2.x as i would like to use the new metatag extraction features. I have then run ant runtime to build, I have updated the nutch-site.xml like so: property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. /description /property property nameelastic.cluster/name valueelasticsearch/value descriptionThe cluster name to discover. Either host and potr must be defined or cluster./description /property I have then created a folder called urls and added seed.txt. i ran the following commands bin/nutch inject urls bin/nutch generate -topN 1000 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb bin/nutch index -all it runs no errors however no documents have been index i also tried setting up the following with solr and no documents are indexed Log: 2014-06-24 02:57:57,804 INFO parse.ParserJob - ParserJob: success 2014-06-24 02:57:57,805 INFO parse.ParserJob - ParserJob: finished at 2014-06-24 02:57:57, time elapsed: 00:00:06 2014-06-24 02:57:59,823 INFO indexer.IndexingJob - IndexingJob: starting 2014-06-24 02:58:00,815 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:00,815 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:01,774 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:01,776 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:01,776 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:03,946 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2014-06-24 02:58:04,920 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] version[1.1.0], pid[21885], build[2181e11/2014-03-25T15:59:51Z] 2014-06-24 02:58:05,261 INFO elasticsearch.node - [Silver] initializing ... 2014-06-24 02:58:05,377 INFO elasticsearch.plugins - [Silver] loaded [], sites [] 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] initialized 2014-06-24 02:58:08,339 INFO elasticsearch.node - [Silver] starting ... 2014-06-24 02:58:08,431 INFO elasticsearch.transport - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/10.0.2.15:9301]} 2014-06-24 02:58:11,540 INFO cluster.service - [Silver] detected_master [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], added {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: zen-disco-receive(from master [[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]]]) 2014-06-24 02:58:11,553 INFO elasticsearch.discovery - [Silver] elasticsearch/jXIC3VT6THukKDFB7GMw7Q 2014-06-24 02:58:11,562 INFO elasticsearch.http - [Silver] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/10.0.2.15:9201]} 2014-06-24 02:58:11,566 INFO elasticsearch.node - [Silver] started 2014-06-24 02:58:11,568 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: 100 2014-06-24 02:58:11,569 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2014-06-24 02:58:11,581 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2014-06-24 02:58:11,581 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2014-06-24 02:58:11,716 INFO elastic.ElasticIndexWriter - Processing remaining requests [docs = 0, length = 0, total docs = 0]