[jira] [Comment Edited] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772 ] Tien Nguyen Manh edited comment on NUTCH-961 at 1/26/16 6:57 AM: - AH yes, Could you explain why we need to parse it twice? with NUTCH-1233 we can use just 1 parse? was (Author: tiennm): AH yes, Could you explain why we need to parse it twice? > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116772#comment-15116772 ] Tien Nguyen Manh commented on NUTCH-961: AH yes, Could you explain why we need to parse it twice? > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2184: Attachment: NUTCH-2184v2.patch Updated patch for trunk. [~markus17], working to address your comments now thanks for response, i must have missed them. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2206) Provide example scoring.similarity.stopword.file
Lewis John McGibbney created NUTCH-2206: --- Summary: Provide example scoring.similarity.stopword.file Key: NUTCH-2206 URL: https://issues.apache.org/jira/browse/NUTCH-2206 Project: Nutch Issue Type: Bug Components: plugin, scoring Affects Versions: 1.11 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.12 The scoring-similarity plugin does not provide an example file for the property scoring.similarity.stopword.file. This is an issue for a number of reasons, namely * A user does not know what it is meant to look like, and * We always check of this file and will [throw an exception if it is not found|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/DocumentVector.java#L79-L80], this may not be picked up by the user until much later. I suggest a simple fix here, simply include the [standard English stop words taken from Lucene's StopAnalyzer|https://github.com/apache/lucene-solr/blob/3f38aba02ce37c6422875d8824ee034d42d635b9/solr/contrib/morphlines-core/src/test-files/solr/collection1/conf/lang/stopwords_en.txt]. The comments will help people to easily customize the list to whatever they require. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2206) Provide example scoring.similarity.stopword.file
[ https://issues.apache.org/jira/browse/NUTCH-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116491#comment-15116491 ] Lewis John McGibbney commented on NUTCH-2206: - CC [~sujenshah] > Provide example scoring.similarity.stopword.file > > > Key: NUTCH-2206 > URL: https://issues.apache.org/jira/browse/NUTCH-2206 > Project: Nutch > Issue Type: Bug > Components: plugin, scoring >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > > The scoring-similarity plugin does not provide an example file for the > property scoring.similarity.stopword.file. > This is an issue for a number of reasons, namely > * A user does not know what it is meant to look like, and > * We always check of this file and will [throw an exception if it is not > found|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine/DocumentVector.java#L79-L80], > this may not be picked up by the user until much later. > I suggest a simple fix here, simply include the [standard English stop words > taken from Lucene's > StopAnalyzer|https://github.com/apache/lucene-solr/blob/3f38aba02ce37c6422875d8824ee034d42d635b9/solr/contrib/morphlines-core/src/test-files/solr/collection1/conf/lang/stopwords_en.txt]. > The comments will help people to easily customize the list to whatever they > require. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2207) Remove class duplication and smarten-up scoring-similarity plugin
Lewis John McGibbney created NUTCH-2207: --- Summary: Remove class duplication and smarten-up scoring-similarity plugin Key: NUTCH-2207 URL: https://issues.apache.org/jira/browse/NUTCH-2207 Project: Nutch Issue Type: Improvement Components: plugin, scoring Affects Versions: 1.11 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 1.12 Right now it appears that DocumentVector.java is duplicated, there is also no license header on [ScoringFilterModel.java|https://github.com/apache/nutch/blob/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/ScoringFilterModel.java]. I think I've also spotted a number of places that imports are not being used. Finally, Javadoc is virtually non-existent for the scoring-similarity plugin at all. It would help to augment some documentation. It would be very helpful if the [SimilairittScoringFilter wiki page|https://wiki.apache.org/nutch/SimilarityScoringFilter] was cited. We could also do with visiting the wiki page ensuring that all references are present. CC [~sujenshah] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2205) Nutch solrdedup error in solrcloud for doc
VictorHu created NUTCH-2205: --- Summary: Nutch solrdedup error in solrcloud for doc Key: NUTCH-2205 URL: https://issues.apache.org/jira/browse/NUTCH-2205 Project: Nutch Issue Type: Bug Components: indexer Reporter: VictorHu -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114989#comment-15114989 ] Markus Jelsma commented on NUTCH-961: - That is probably due to the patch parsing twice. Once with BP for text, and once without for link extraction. > Expose Tika's boilerpipe support > > > Key: NUTCH-961 > URL: https://issues.apache.org/jira/browse/NUTCH-961 > Project: Nutch > Issue Type: New Feature > Components: parser >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: BoilerpipeExtractorRepository.java, > NUTCH-961-1.11-1.patch, NUTCH-961-1.3-3.patch, > NUTCH-961-1.3-tikaparser.patch, NUTCH-961-1.3-tikaparser1.patch, > NUTCH-961-1.4-dombuilder-1.patch, NUTCH-961-1.5-1.patch, > NUTCH-961-1.8-1.patch, NUTCH-961-2.1-v1.patch, NUTCH-961-2.1-v2.patch, > NUTCH-961v2.patch, nutch-2.x-boilerpipe.patch > > > Tika 0.8 comes with the Boilerpipe content handler which can be used to > extract boilerplate content from HTML pages. We should see how we can expose > Boilerplate in the Nutch cofiguration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2205) Nutch solrdedup error in solrcloud for larger docs
[ https://issues.apache.org/jira/browse/NUTCH-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114991#comment-15114991 ] Markus Jelsma commented on NUTCH-2205: -- This looks like your cluster was down, not a Nutch error. > Nutch solrdedup error in solrcloud for larger docs > --- > > Key: NUTCH-2205 > URL: https://issues.apache.org/jira/browse/NUTCH-2205 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.3 > Environment: CentOS 6.5,Jdk 1.7.0_75,omcat 8.0.9 ,Hadoop > 2.5.2,Zookeeper 3.4.6 ,Hbase 0.98.8 ,Solr 4.8.1 ,Nutch 2.3.1 >Reporter: VictorHu > Fix For: 2.4 > > > When the number of solr docs larger than 9000,the solrdedup of the nutch is > broken.This is log: > http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2 > 16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: > starting... > 16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: Solr > url: http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2 > 16/01/25 17:02:39 INFO client.RMProxy: Connecting to ResourceManager at > master.Itble/10.192.1.100:8032 > 16/01/25 17:02:43 INFO mapreduce.JobSubmitter: number of splits:1 > 16/01/25 17:02:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: > job_1453104806095_0162 > 16/01/25 17:02:44 INFO impl.YarnClientImpl: Submitted application > application_1453104806095_0162 > 16/01/25 17:02:44 INFO mapreduce.Job: The url to track the job: > http://master.Itble:8088/proxy/application_1453104806095_0162/ > 16/01/25 17:02:44 INFO mapreduce.Job: Running job: job_1453104806095_0162 > 16/01/25 17:02:54 INFO mapreduce.Job: Job job_1453104806095_0162 running in > uber mode : false > 16/01/25 17:02:54 INFO mapreduce.Job: map 0% reduce 0% > 16/01/25 17:03:02 INFO mapreduce.Job: Task Id : > attempt_1453104806095_0162_m_00_0, Status : FAILED > Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > org.apache.solr.client.solrj.SolrServerException: No live SolrServers > available to handle this > request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2, > http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, > http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1] > at > org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) > at > org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) > at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) > 16/01/25 17:03:12 INFO mapreduce.Job: Task Id : > attempt_1453104806095_0162_m_00_1, Status : FAILED > Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > org.apache.solr.client.solrj.SolrServerException: No live SolrServers > available to handle this > request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2, > http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, > http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1, > http://10.192.1.102:8080/solr/myEnterpriseCollection_shard1_replica1] > at > org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) > at > org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) > at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) > at > org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291) > at >
[jira] [Updated] (NUTCH-2205) Nutch solrdedup error in solrcloud for larger docs
[ https://issues.apache.org/jira/browse/NUTCH-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] VictorHu updated NUTCH-2205: Affects Version/s: 2.3 Environment: CentOS 6.5,Jdk 1.7.0_75,omcat 8.0.9 ,Hadoop 2.5.2,Zookeeper 3.4.6 ,Hbase 0.98.8 ,Solr 4.8.1 ,Nutch 2.3.1 Fix Version/s: 2.4 Description: When the number of solr docs larger than 9000,the solrdedup of the nutch is broken.This is log: http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2 16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: starting... 16/01/25 17:02:38 INFO solr.SolrDeleteDuplicates: SolrDeleteDuplicates: Solr url: http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2 16/01/25 17:02:39 INFO client.RMProxy: Connecting to ResourceManager at master.Itble/10.192.1.100:8032 16/01/25 17:02:43 INFO mapreduce.JobSubmitter: number of splits:1 16/01/25 17:02:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1453104806095_0162 16/01/25 17:02:44 INFO impl.YarnClientImpl: Submitted application application_1453104806095_0162 16/01/25 17:02:44 INFO mapreduce.Job: The url to track the job: http://master.Itble:8088/proxy/application_1453104806095_0162/ 16/01/25 17:02:44 INFO mapreduce.Job: Running job: job_1453104806095_0162 16/01/25 17:02:54 INFO mapreduce.Job: Job job_1453104806095_0162 running in uber mode : false 16/01/25 17:02:54 INFO mapreduce.Job: map 0% reduce 0% 16/01/25 17:03:02 INFO mapreduce.Job: Task Id : attempt_1453104806095_0162_m_00_0, Status : FAILED Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2, http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1] at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) 16/01/25 17:03:12 INFO mapreduce.Job: Task Id : attempt_1453104806095_0162_m_00_1, Status : FAILED Error: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://10.192.1.100:8080/solr/myEnterpriseCollection_shard2_replica2, http://10.192.1.101:8080/solr/myEnterpriseCollection_shard1_replica2, http://10.192.1.103:8080/solr/myEnterpriseCollection_shard2_replica1, http://10.192.1.102:8080/solr/myEnterpriseCollection_shard1_replica1] at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:554) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.createRecordReader(SolrDeleteDuplicates.java:291) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.(MapTask.java:492) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163) 16/01/25 17:03:22 INFO