[jira] [Created] (NUTCH-1702) Port HostNormalizer to 2.x
Tien Nguyen Manh created NUTCH-1702: --- Summary: Port HostNormalizer to 2.x Key: NUTCH-1702 URL: https://issues.apache.org/jira/browse/NUTCH-1702 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: NUTCH-1702.patch Port HostNormalizer to 2.x -- Key: NUTCH-1702 URL: https://issues.apache.org/jira/browse/NUTCH-1702 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Fix For: 2.3 Attachments: NUTCH-1702.patch Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Fix Version/s: 2.3 Port HostNormalizer to 2.x -- Key: NUTCH-1702 URL: https://issues.apache.org/jira/browse/NUTCH-1702 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Fix For: 2.3 Attachments: NUTCH-1702.patch Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: NUTCH-1702.patch Port HostNormalizer to 2.x -- Key: NUTCH-1702 URL: https://issues.apache.org/jira/browse/NUTCH-1702 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Fix For: 2.3 Attachments: NUTCH-1702.patch Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1702: Attachment: (was: NUTCH-1702.patch) Port HostNormalizer to 2.x -- Key: NUTCH-1702 URL: https://issues.apache.org/jira/browse/NUTCH-1702 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Fix For: 2.3 Attachments: NUTCH-1702.patch Port NUTCH-1319 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1703) Nutch ignores alt text of images
Canan Girgin created NUTCH-1703: --- Summary: Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3 If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1703) Nutch ignores alt text of images
[ https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Canan Girgin updated NUTCH-1703: Attachment: NUTCH_1703.patch Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3 Attachments: NUTCH_1703.patch If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1703) Nutch ignores alt text of images
[ https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1703: - Fix Version/s: 1.8 Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3, 1.8 Attachments: NUTCH_1703.patch If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1703) Nutch ignores alt text of images
[ https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871904#comment-13871904 ] Markus Jelsma commented on NUTCH-1703: -- Can you provide a test for TestDOMContentUtils as well? Would be splendig. Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3, 1.8 Attachments: NUTCH_1703.patch If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (NUTCH-1568) port pluggable indexing architecture to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1568. - Resolution: Fixed Committed @revision 1558349 in 2.x [~talat], thank you for your work on this one. [~jnioche], thanks to you + others for original patch. Now I'll move on to NUTCH-1655 port pluggable indexing architecture to 2.x --- Key: NUTCH-1568 URL: https://issues.apache.org/jira/browse/NUTCH-1568 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.2 Reporter: Lewis John McGibbney Fix For: 2.3 Attachments: NUTCH-1568-v2.patch, NUTCH-1568-v3.path, NUTCH-1568-v4.patch, NUTCH-1568.patch I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue should track that. It would be nice to do the upgrade in NUTCH-1486 before we do the upgrade so that people can get using with solr 4.x ASAP. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1655) Indexer Plugin for Elastic Search
[ https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1655: Attachment: NUTCH-1655-v3.patch Updated patch to correct formatting in confi files, adds license header to new elastic.conf file as well. I would like to commit this in 24hrs unless there are objections. Indexer Plugin for Elastic Search - Key: NUTCH-1655 URL: https://issues.apache.org/jira/browse/NUTCH-1655 Project: Nutch Issue Type: Sub-task Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch We should rewrite ElasticSearch indexer compatible with new Indexing Plugin Architect. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1655) Indexer Plugin for Elastic Search
[ https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871998#comment-13871998 ] Markus Jelsma commented on NUTCH-1655: -- Hi i haven't read the code but incorporating NUTCH-1598 can be very useful to some users. Indexer Plugin for Elastic Search - Key: NUTCH-1655 URL: https://issues.apache.org/jira/browse/NUTCH-1655 Project: Nutch Issue Type: Sub-task Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch We should rewrite ElasticSearch indexer compatible with new Indexing Plugin Architect. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1655) Indexer Plugin for Elastic Search
[ https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872009#comment-13872009 ] Talat UYARER commented on NUTCH-1655: - Hi [~markus17], I have already included NUTCH-1598 in this patch. Indexer Plugin for Elastic Search - Key: NUTCH-1655 URL: https://issues.apache.org/jira/browse/NUTCH-1655 Project: Nutch Issue Type: Sub-task Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch We should rewrite ElasticSearch indexer compatible with new Indexing Plugin Architect. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1568) port pluggable indexing architecture to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872010#comment-13872010 ] Hudson commented on NUTCH-1568: --- SUCCESS: Integrated in Nutch-nutchgora #887 (See [https://builds.apache.org/job/Nutch-nutchgora/887/]) NUTCH-1568 port pluggable indexing architecture to 2.x (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1558349) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/build.xml * /nutch/branches/2.x/conf/log4j.properties * /nutch/branches/2.x/conf/nutch-default.xml * /nutch/branches/2.x/conf/schema-solr4.xml * /nutch/branches/2.x/conf/schema.xml * /nutch/branches/2.x/default.properties * /nutch/branches/2.x/ivy/ivy.xml * /nutch/branches/2.x/pom.xml * /nutch/branches/2.x/src/bin/nutch * /nutch/branches/2.x/src/java/org/apache/nutch/api/impl/RAMJobManager.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/CleaningJob.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleanerJob.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexWriter.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexWriters.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexerJob.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingJob.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/NutchIndexWriter.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticConstants.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticIndexerJob.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrClean.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrIndexerJob.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrUtils.java * /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrWriter.java * /nutch/branches/2.x/src/plugin/build.xml * /nutch/branches/2.x/src/plugin/indexer-solr * /nutch/branches/2.x/src/plugin/indexer-solr/build.xml * /nutch/branches/2.x/src/plugin/indexer-solr/ivy.xml * /nutch/branches/2.x/src/plugin/indexer-solr/plugin.xml * /nutch/branches/2.x/src/plugin/indexer-solr/src * /nutch/branches/2.x/src/plugin/indexer-solr/src/java * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java * /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java * /nutch/branches/2.x/src/plugin/nutch-extensionpoints/plugin.xml port pluggable indexing architecture to 2.x --- Key: NUTCH-1568 URL: https://issues.apache.org/jira/browse/NUTCH-1568 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.2 Reporter: Lewis John McGibbney Fix For: 2.3 Attachments: NUTCH-1568-v2.patch, NUTCH-1568-v3.path, NUTCH-1568-v4.patch, NUTCH-1568.patch I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue should track that. It would be nice to do the upgrade in NUTCH-1486 before we do the upgrade so that people can get using with solr 4.x ASAP. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1655) Indexer Plugin for Elastic Search
[ https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872015#comment-13872015 ] Markus Jelsma commented on NUTCH-1655: -- Nice :) Indexer Plugin for Elastic Search - Key: NUTCH-1655 URL: https://issues.apache.org/jira/browse/NUTCH-1655 Project: Nutch Issue Type: Sub-task Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch We should rewrite ElasticSearch indexer compatible with new Indexing Plugin Architect. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Comment Edited] (NUTCH-1703) Nutch ignores alt text of images
[ https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872106#comment-13872106 ] Canan Girgin edited comment on NUTCH-1703 at 1/15/14 2:18 PM: -- ok. A new patch Patch had been added which contains TestDOMContentUtils class. (NUTCH_1703.patch_v2) was (Author: dandelion): ok. A new patch Patch had been added which contains TestDOMContentUtils class. (NUTCH_1703.patch_v1) Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3, 1.8 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1703) Nutch ignores alt text of images
[ https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872106#comment-13872106 ] Canan Girgin commented on NUTCH-1703: - ok. A new patch Patch had been added which contains TestDOMContentUtils class. (NUTCH_1703.patch_v1) Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3, 1.8 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1703) Nutch ignores alt text of images
[ https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Canan Girgin updated NUTCH-1703: Attachment: NUTCH_1703_v2.patch Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3, 1.8 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1703) Nutch ignores alt text of images
[ https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872116#comment-13872116 ] Markus Jelsma commented on NUTCH-1703: -- How is this patch made? I cannot patch the sources with patch -p0 ... Nutch ignores alt text of images Key: NUTCH-1703 URL: https://issues.apache.org/jira/browse/NUTCH-1703 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 2.2.1 Reporter: Canan Girgin Fix For: 2.3, 1.8 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch If you put image as link alt text of that image is equivalent to the anchor text of text link. During content parse nutch does not give image alt text and anchor text for that link is empty. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1701) Make Solr Document Boost as an option
[ https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872137#comment-13872137 ] Lewis John McGibbney commented on NUTCH-1701: - Configurable sounds good. I'm +1 Make Solr Document Boost as an option - Key: NUTCH-1701 URL: https://issues.apache.org/jira/browse/NUTCH-1701 Project: Nutch Issue Type: Improvement Components: indexer Reporter: Tien Nguyen Manh Priority: Minor Fix For: 2.3, 1.8 Attachments: NUTCH-1701-2x.patch Nutch SolrIndexer use Nutch score as document boost by default. We should make it as an option because we can use nutch score to boost in different way such as boost at query time via function query -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1704: Attachment: NUTCH-1704.patch Port DomainBlacklist urlfilter to 2.x - Key: NUTCH-1704 URL: https://issues.apache.org/jira/browse/NUTCH-1704 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Attachments: NUTCH-1704.patch Port NUTCH-1210 to 2.x -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1699) Tika Parser - Image Parse Bug
[ https://issues.apache.org/jira/browse/NUTCH-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1699: Attachment: NUTCH-1699v2-2.x.patch Patch for 2.x Tika Parser - Image Parse Bug - Key: NUTCH-1699 URL: https://issues.apache.org/jira/browse/NUTCH-1699 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7, 2.3 Reporter: Mehmet Zahid Yüzügüldü Labels: ImageMetadataExtractor Fix For: 2.3, 1.8 Attachments: NUTCH-1699-trunk-junit.patch, NUTCH-1699-trunk.patch, NUTCH-1699v2-2.x.patch, NUTCH_1699.patch TikaParser is not extract metadatas from mime type of png, gif and bmp images. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1662) Indexer Plugin for Solr Cloud
[ https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yasin Kılınç updated NUTCH-1662: Attachment: NUTCH-1662.patch I create indexer plugin of SolrCloud. This patch can apply after NUTCH-1655. Indexer Plugin for Solr Cloud - Key: NUTCH-1662 URL: https://issues.apache.org/jira/browse/NUTCH-1662 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 2.3 Reporter: Talat UYARER Fix For: 2.3 Attachments: NUTCH-1662.patch In main issue's patch use Solr Http connection. It doesnt support Solr Could. This plugin support Solr Cloud. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series
[ https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1478: Attachment: NUTCH-1478-parse-v2.patch i port parse-metatags to 2.x, this patch support multi-value in metatags. Parse-metatags and index-metadata plugin for Nutch 2.x series -- Key: NUTCH-1478 URL: https://issues.apache.org/jira/browse/NUTCH-1478 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 2.1 Reporter: kiran Fix For: 2.3 Attachments: NUTCH-1478-parse-v2.patch, Nutch1478.patch, Nutch1478.zip, metadata_parseChecker_sites.png I have ported parse-metatags and index-metadata plugin to Nutch 2.x series. This will take multiple values of same tag and index in Solr as i patched before (https://issues.apache.org/jira/browse/NUTCH-1467). The usage is same as described here (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is no need to give 'metatag' keyword before metatag names. For example my configuration looks like this (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml) This is only the first version and does not include the junit test. I will update the new version soon. This will parse the tags and index the tags in Solr. Make sure you create the fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr. Please let me know if you have any suggestions This is supported by DLA (Digital Library and Archives) of Virginia Tech. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1699) Tika Parser - Image Parse Bug
[ https://issues.apache.org/jira/browse/NUTCH-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872198#comment-13872198 ] Hudson commented on NUTCH-1699: --- SUCCESS: Integrated in Nutch-nutchgora #888 (See [https://builds.apache.org/job/Nutch-nutchgora/888/]) NUTCH-1699 Tika Parser - Image Parse Bug (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1558418) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseUtil.java * /nutch/branches/2.x/src/plugin/microformats-reltag/src/test/org/apache/nutch/microformats/reltag/TestRelTagParser.java * /nutch/branches/2.x/src/plugin/parse-tika/build.xml * /nutch/branches/2.x/src/plugin/parse-tika/sample/nutch_logo_tm.gif * /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java * /nutch/branches/2.x/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestImageMetadata.java Tika Parser - Image Parse Bug - Key: NUTCH-1699 URL: https://issues.apache.org/jira/browse/NUTCH-1699 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7, 2.3 Reporter: Mehmet Zahid Yüzügüldü Labels: ImageMetadataExtractor Fix For: 2.3, 1.8 Attachments: NUTCH-1699-trunk-junit.patch, NUTCH-1699-trunk.patch, NUTCH-1699v2-2.x.patch, NUTCH_1699.patch TikaParser is not extract metadatas from mime type of png, gif and bmp images. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
[ https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872199#comment-13872199 ] Alparslan Avcı commented on NUTCH-1674: --- Hi [~memnoh], the patch is prepared for 2.x branch, however you can try to patch over 2.2.1 and use it if it does not give any exceptions. :) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index - Key: NUTCH-1674 URL: https://issues.apache.org/jira/browse/NUTCH-1674 Project: Nutch Issue Type: Improvement Affects Versions: 2.3 Reporter: Tien Nguyen Manh Fix For: 2.3 Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch, NUTCH-1674_3.patch Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, update, index). When crawldb is big, the time to scan is bigger than the actual processing time. We really need to skip records while scanning using GORA-119 for example we can only get records belong to a specified batchId. In my crawl the filter reduce the time to scan from 90 min to 30 min. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (NUTCH-1699) Tika Parser - Image Parse Bug
[ https://issues.apache.org/jira/browse/NUTCH-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872205#comment-13872205 ] Hudson commented on NUTCH-1699: --- SUCCESS: Integrated in Nutch-trunk #2491 (See [https://builds.apache.org/job/Nutch-trunk/2491/]) NUTCH-1699 Tika Parser - Image Parse Bug (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1558420) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/parse-tika/build.xml * /nutch/trunk/src/plugin/parse-tika/sample/nutch_logo_tm.gif * /nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java * /nutch/trunk/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestImageMetadata.java Tika Parser - Image Parse Bug - Key: NUTCH-1699 URL: https://issues.apache.org/jira/browse/NUTCH-1699 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.7, 2.3 Reporter: Mehmet Zahid Yüzügüldü Labels: ImageMetadataExtractor Fix For: 2.3, 1.8 Attachments: NUTCH-1699-trunk-junit.patch, NUTCH-1699-trunk.patch, NUTCH-1699v2-2.x.patch, NUTCH_1699.patch TikaParser is not extract metadatas from mime type of png, gif and bmp images. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page
[ https://issues.apache.org/jira/browse/NUTCH-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-1705: Attachment: NUTCH-1705.patch Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page --- Key: NUTCH-1705 URL: https://issues.apache.org/jira/browse/NUTCH-1705 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Priority: Minor Attachments: NUTCH-1705.patch Currently HtmlParser and TikaParser always skip extracting text and title for noIndex page - page which have noIndex robots metatags. But some parse-filter may still interested in text and title such as NUTCH-1661, where we may decide wether to follow a page by it's language. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page
Tien Nguyen Manh created NUTCH-1705: --- Summary: Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page Key: NUTCH-1705 URL: https://issues.apache.org/jira/browse/NUTCH-1705 Project: Nutch Issue Type: Improvement Reporter: Tien Nguyen Manh Priority: Minor Currently HtmlParser and TikaParser always skip extracting text and title for noIndex page - page which have noIndex robots metatags. But some parse-filter may still interested in text and title such as NUTCH-1661, where we may decide wether to follow a page by it's language. -- This message was sent by Atlassian JIRA (v6.1.5#6160)