[jira] [Created] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1702:
---

 Summary: Port HostNormalizer to 2.x
 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh


Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: NUTCH-1702.patch

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Fix Version/s: 2.3

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: NUTCH-1702.patch

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1702) Port HostNormalizer to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1702:


Attachment: (was: NUTCH-1702.patch)

 Port HostNormalizer to 2.x
 --

 Key: NUTCH-1702
 URL: https://issues.apache.org/jira/browse/NUTCH-1702
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1702.patch


 Port NUTCH-1319 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Canan Girgin (JIRA)
Canan Girgin created NUTCH-1703:
---

 Summary: Nutch ignores alt text of images
 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3


If you put image as link alt text of that image is equivalent to the anchor 
text of text link. During content parse nutch does not give image alt text and  
anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Canan Girgin (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Canan Girgin updated NUTCH-1703:


Attachment: NUTCH_1703.patch

 Nutch ignores alt text of images
 

 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3

 Attachments: NUTCH_1703.patch


 If you put image as link alt text of that image is equivalent to the anchor 
 text of text link. During content parse nutch does not give image alt text 
 and  anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1703:
-

Fix Version/s: 1.8

 Nutch ignores alt text of images
 

 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3, 1.8

 Attachments: NUTCH_1703.patch


 If you put image as link alt text of that image is equivalent to the anchor 
 text of text link. During content parse nutch does not give image alt text 
 and  anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871904#comment-13871904
 ] 

Markus Jelsma commented on NUTCH-1703:
--

Can you provide a test for TestDOMContentUtils as well? Would be splendig.

 Nutch ignores alt text of images
 

 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3, 1.8

 Attachments: NUTCH_1703.patch


 If you put image as link alt text of that image is equivalent to the anchor 
 text of text link. During content parse nutch does not give image alt text 
 and  anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1568) port pluggable indexing architecture to 2.x

2014-01-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1568.
-

Resolution: Fixed

Committed @revision 1558349 in 2.x
[~talat], thank you for your work on this one.
[~jnioche], thanks to you + others for original patch.
Now I'll move on to NUTCH-1655

 port pluggable indexing architecture to 2.x
 ---

 Key: NUTCH-1568
 URL: https://issues.apache.org/jira/browse/NUTCH-1568
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.2
Reporter: Lewis John McGibbney
 Fix For: 2.3

 Attachments: NUTCH-1568-v2.patch, NUTCH-1568-v3.path, 
 NUTCH-1568-v4.patch, NUTCH-1568.patch


 I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue 
 should track that. It would be nice to do the upgrade in NUTCH-1486 before we 
 do the upgrade so that people can get using with solr 4.x ASAP.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1655) Indexer Plugin for Elastic Search

2014-01-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1655:


Attachment: NUTCH-1655-v3.patch

Updated patch to correct formatting in confi files, adds license header to new 
elastic.conf file as well.
I would like to commit this in 24hrs unless there are objections.

 Indexer Plugin for Elastic Search
 -

 Key: NUTCH-1655
 URL: https://issues.apache.org/jira/browse/NUTCH-1655
 Project: Nutch
  Issue Type: Sub-task
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch


 We should rewrite ElasticSearch indexer compatible with new Indexing Plugin 
 Architect. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1655) Indexer Plugin for Elastic Search

2014-01-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871998#comment-13871998
 ] 

Markus Jelsma commented on NUTCH-1655:
--

Hi i haven't read the code but incorporating NUTCH-1598 can be very useful to 
some users.

 Indexer Plugin for Elastic Search
 -

 Key: NUTCH-1655
 URL: https://issues.apache.org/jira/browse/NUTCH-1655
 Project: Nutch
  Issue Type: Sub-task
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch


 We should rewrite ElasticSearch indexer compatible with new Indexing Plugin 
 Architect. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1655) Indexer Plugin for Elastic Search

2014-01-15 Thread Talat UYARER (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872009#comment-13872009
 ] 

Talat UYARER commented on NUTCH-1655:
-

Hi [~markus17],
I have already included NUTCH-1598 in this patch.

 Indexer Plugin for Elastic Search
 -

 Key: NUTCH-1655
 URL: https://issues.apache.org/jira/browse/NUTCH-1655
 Project: Nutch
  Issue Type: Sub-task
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch


 We should rewrite ElasticSearch indexer compatible with new Indexing Plugin 
 Architect. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1568) port pluggable indexing architecture to 2.x

2014-01-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872010#comment-13872010
 ] 

Hudson commented on NUTCH-1568:
---

SUCCESS: Integrated in Nutch-nutchgora #887 (See 
[https://builds.apache.org/job/Nutch-nutchgora/887/])
NUTCH-1568 port pluggable indexing architecture to 2.x (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1558349)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/build.xml
* /nutch/branches/2.x/conf/log4j.properties
* /nutch/branches/2.x/conf/nutch-default.xml
* /nutch/branches/2.x/conf/schema-solr4.xml
* /nutch/branches/2.x/conf/schema.xml
* /nutch/branches/2.x/default.properties
* /nutch/branches/2.x/ivy/ivy.xml
* /nutch/branches/2.x/pom.xml
* /nutch/branches/2.x/src/bin/nutch
* /nutch/branches/2.x/src/java/org/apache/nutch/api/impl/RAMJobManager.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/CleaningJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexCleanerJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexWriter.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexWriters.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexerJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexingJob.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/NutchIndexWriter.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/NutchIndexWriterFactory.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticConstants.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticIndexerJob.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrClean.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrDeleteDuplicates.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrIndexerJob.java
* 
/nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrMappingReader.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrUtils.java
* /nutch/branches/2.x/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
* /nutch/branches/2.x/src/plugin/build.xml
* /nutch/branches/2.x/src/plugin/indexer-solr
* /nutch/branches/2.x/src/plugin/indexer-solr/build.xml
* /nutch/branches/2.x/src/plugin/indexer-solr/ivy.xml
* /nutch/branches/2.x/src/plugin/indexer-solr/plugin.xml
* /nutch/branches/2.x/src/plugin/indexer-solr/src
* /nutch/branches/2.x/src/plugin/indexer-solr/src/java
* /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org
* /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache
* /nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch
* 
/nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter
* 
/nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr
* 
/nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrConstants.java
* 
/nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrIndexWriter.java
* 
/nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrMappingReader.java
* 
/nutch/branches/2.x/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java
* /nutch/branches/2.x/src/plugin/nutch-extensionpoints/plugin.xml


 port pluggable indexing architecture to 2.x
 ---

 Key: NUTCH-1568
 URL: https://issues.apache.org/jira/browse/NUTCH-1568
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.2
Reporter: Lewis John McGibbney
 Fix For: 2.3

 Attachments: NUTCH-1568-v2.patch, NUTCH-1568-v3.path, 
 NUTCH-1568-v4.patch, NUTCH-1568.patch


 I would like to port the work done by Julien on NUTCH-1047 to 2.x. This issue 
 should track that. It would be nice to do the upgrade in NUTCH-1486 before we 
 do the upgrade so that people can get using with solr 4.x ASAP.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1655) Indexer Plugin for Elastic Search

2014-01-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872015#comment-13872015
 ] 

Markus Jelsma commented on NUTCH-1655:
--

Nice :)

 Indexer Plugin for Elastic Search
 -

 Key: NUTCH-1655
 URL: https://issues.apache.org/jira/browse/NUTCH-1655
 Project: Nutch
  Issue Type: Sub-task
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1655-v2.path, NUTCH-1655-v3.patch, NUTCH-1655.patch


 We should rewrite ElasticSearch indexer compatible with new Indexing Plugin 
 Architect. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Canan Girgin (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872106#comment-13872106
 ] 

Canan Girgin edited comment on NUTCH-1703 at 1/15/14 2:18 PM:
--

ok. A new patch Patch had been added which contains TestDOMContentUtils class. 
(NUTCH_1703.patch_v2)


was (Author: dandelion):
ok. A new patch Patch had been added which contains TestDOMContentUtils class. 
(NUTCH_1703.patch_v1)

 Nutch ignores alt text of images
 

 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3, 1.8

 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch


 If you put image as link alt text of that image is equivalent to the anchor 
 text of text link. During content parse nutch does not give image alt text 
 and  anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Canan Girgin (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872106#comment-13872106
 ] 

Canan Girgin commented on NUTCH-1703:
-

ok. A new patch Patch had been added which contains TestDOMContentUtils class. 
(NUTCH_1703.patch_v1)

 Nutch ignores alt text of images
 

 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3, 1.8

 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch


 If you put image as link alt text of that image is equivalent to the anchor 
 text of text link. During content parse nutch does not give image alt text 
 and  anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Canan Girgin (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Canan Girgin updated NUTCH-1703:


Attachment: NUTCH_1703_v2.patch

 Nutch ignores alt text of images
 

 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3, 1.8

 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch


 If you put image as link alt text of that image is equivalent to the anchor 
 text of text link. During content parse nutch does not give image alt text 
 and  anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1703) Nutch ignores alt text of images

2014-01-15 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872116#comment-13872116
 ] 

Markus Jelsma commented on NUTCH-1703:
--

How is this patch made? I cannot patch the sources with patch -p0  ...

 Nutch ignores alt text of images
 

 Key: NUTCH-1703
 URL: https://issues.apache.org/jira/browse/NUTCH-1703
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.2.1
Reporter: Canan Girgin
 Fix For: 2.3, 1.8

 Attachments: NUTCH_1703.patch, NUTCH_1703_v2.patch


 If you put image as link alt text of that image is equivalent to the anchor 
 text of text link. During content parse nutch does not give image alt text 
 and  anchor text for that link is empty.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1701) Make Solr Document Boost as an option

2014-01-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872137#comment-13872137
 ] 

Lewis John McGibbney commented on NUTCH-1701:
-

Configurable sounds good. I'm +1

 Make Solr Document Boost as an option
 -

 Key: NUTCH-1701
 URL: https://issues.apache.org/jira/browse/NUTCH-1701
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Reporter: Tien Nguyen Manh
Priority: Minor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1701-2x.patch


 Nutch SolrIndexer use Nutch score as document boost by default. We should 
 make it as an option because we can use nutch score to boost in different way 
 such as boost at query time via function query



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1704) Port DomainBlacklist urlfilter to 2.x

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1704:


Attachment: NUTCH-1704.patch

 Port DomainBlacklist urlfilter to 2.x
 -

 Key: NUTCH-1704
 URL: https://issues.apache.org/jira/browse/NUTCH-1704
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
 Attachments: NUTCH-1704.patch


 Port NUTCH-1210 to 2.x



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1699) Tika Parser - Image Parse Bug

2014-01-15 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1699:


Attachment: NUTCH-1699v2-2.x.patch

Patch for 2.x

 Tika Parser - Image Parse Bug
 -

 Key: NUTCH-1699
 URL: https://issues.apache.org/jira/browse/NUTCH-1699
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7, 2.3
Reporter: Mehmet Zahid Yüzügüldü
  Labels: ImageMetadataExtractor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1699-trunk-junit.patch, NUTCH-1699-trunk.patch, 
 NUTCH-1699v2-2.x.patch, NUTCH_1699.patch


 TikaParser is not extract metadatas from mime type of png, gif and bmp images.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yasin Kılınç updated NUTCH-1662:


Attachment: NUTCH-1662.patch

I create indexer plugin of SolrCloud. This patch can apply after NUTCH-1655.

 Indexer Plugin for Solr Cloud
 -

 Key: NUTCH-1662
 URL: https://issues.apache.org/jira/browse/NUTCH-1662
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 2.3
Reporter: Talat UYARER
 Fix For: 2.3

 Attachments: NUTCH-1662.patch


 In main issue's patch use Solr Http connection. It doesnt support Solr Could. 
 This plugin support Solr Cloud. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1478) Parse-metatags and index-metadata plugin for Nutch 2.x series

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1478:


Attachment: NUTCH-1478-parse-v2.patch

i port parse-metatags to 2.x, this patch support multi-value in metatags.

 Parse-metatags and index-metadata plugin for Nutch 2.x series 
 --

 Key: NUTCH-1478
 URL: https://issues.apache.org/jira/browse/NUTCH-1478
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 2.1
Reporter: kiran
 Fix For: 2.3

 Attachments: NUTCH-1478-parse-v2.patch, Nutch1478.patch, 
 Nutch1478.zip, metadata_parseChecker_sites.png


 I have ported parse-metatags and index-metadata plugin to Nutch 2.x series.  
 This will take multiple values of same tag and index in Solr as i patched 
 before (https://issues.apache.org/jira/browse/NUTCH-1467).
 The usage is same as described here 
 (http://wiki.apache.org/nutch/IndexMetatags) but one change is that there is 
 no need to give 'metatag' keyword before metatag names. For example my 
 configuration looks like this 
 (https://github.com/salvager/NutchDev/blob/master/runtime/local/conf/nutch-site.xml)
  
 This is only the first version and does not include the junit test. I will 
 update the new version soon.
 This will parse the tags and index the tags in Solr. Make sure you create the 
 fields in 'index.parse.md' in nutch-site.xml in schema.xml in Solr.
 Please let me know if you have any suggestions
 This is supported by DLA (Digital Library and Archives) of Virginia Tech.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1699) Tika Parser - Image Parse Bug

2014-01-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872198#comment-13872198
 ] 

Hudson commented on NUTCH-1699:
---

SUCCESS: Integrated in Nutch-nutchgora #888 (See 
[https://builds.apache.org/job/Nutch-nutchgora/888/])
NUTCH-1699 Tika Parser - Image Parse Bug (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1558418)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseUtil.java
* 
/nutch/branches/2.x/src/plugin/microformats-reltag/src/test/org/apache/nutch/microformats/reltag/TestRelTagParser.java
* /nutch/branches/2.x/src/plugin/parse-tika/build.xml
* /nutch/branches/2.x/src/plugin/parse-tika/sample/nutch_logo_tm.gif
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* 
/nutch/branches/2.x/src/plugin/parse-tika/src/test/org/apache/nutch/parse/tika/TestImageMetadata.java


 Tika Parser - Image Parse Bug
 -

 Key: NUTCH-1699
 URL: https://issues.apache.org/jira/browse/NUTCH-1699
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7, 2.3
Reporter: Mehmet Zahid Yüzügüldü
  Labels: ImageMetadataExtractor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1699-trunk-junit.patch, NUTCH-1699-trunk.patch, 
 NUTCH-1699v2-2.x.patch, NUTCH_1699.patch


 TikaParser is not extract metadatas from mime type of png, gif and bmp images.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2014-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872199#comment-13872199
 ] 

Alparslan Avcı commented on NUTCH-1674:
---

Hi [~memnoh], the patch is prepared for 2.x branch, however you can try to 
patch over 2.2.1 and use it if it does not give any exceptions. :)

 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
 -

 Key: NUTCH-1674
 URL: https://issues.apache.org/jira/browse/NUTCH-1674
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch, NUTCH-1674_3.patch


 Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, 
 update, index). When crawldb is big, the time to scan is bigger than the 
 actual processing time.
 We really need to skip records while scanning using GORA-119 for example we 
 can only get records belong to a specified batchId.
 In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1699) Tika Parser - Image Parse Bug

2014-01-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872205#comment-13872205
 ] 

Hudson commented on NUTCH-1699:
---

SUCCESS: Integrated in Nutch-trunk #2491 (See 
[https://builds.apache.org/job/Nutch-trunk/2491/])
NUTCH-1699 Tika Parser - Image Parse Bug (lewismc: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1558420)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/plugin/parse-tika/build.xml
* /nutch/trunk/src/plugin/parse-tika/sample/nutch_logo_tm.gif
* 
/nutch/trunk/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* 
/nutch/trunk/src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestImageMetadata.java


 Tika Parser - Image Parse Bug
 -

 Key: NUTCH-1699
 URL: https://issues.apache.org/jira/browse/NUTCH-1699
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7, 2.3
Reporter: Mehmet Zahid Yüzügüldü
  Labels: ImageMetadataExtractor
 Fix For: 2.3, 1.8

 Attachments: NUTCH-1699-trunk-junit.patch, NUTCH-1699-trunk.patch, 
 NUTCH-1699v2-2.x.patch, NUTCH_1699.patch


 TikaParser is not extract metadatas from mime type of png, gif and bmp images.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-1705:


Attachment: NUTCH-1705.patch

 Make configuration option for HtmlParser  TikaParser to extract text or 
 title for noIndex page
 ---

 Key: NUTCH-1705
 URL: https://issues.apache.org/jira/browse/NUTCH-1705
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
Priority: Minor
 Attachments: NUTCH-1705.patch


 Currently HtmlParser and TikaParser always skip extracting text and title for 
 noIndex page - page which have noIndex robots metatags.
 But some parse-filter may still interested in text and title such as 
 NUTCH-1661, where we may decide wether to follow a page by it's language.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1705) Make configuration option for HtmlParser TikaParser to extract text or title for noIndex page

2014-01-15 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-1705:
---

 Summary: Make configuration option for HtmlParser  TikaParser to 
extract text or title for noIndex page
 Key: NUTCH-1705
 URL: https://issues.apache.org/jira/browse/NUTCH-1705
 Project: Nutch
  Issue Type: Improvement
Reporter: Tien Nguyen Manh
Priority: Minor


Currently HtmlParser and TikaParser always skip extracting text and title for 
noIndex page - page which have noIndex robots metatags.
But some parse-filter may still interested in text and title such as 
NUTCH-1661, where we may decide wether to follow a page by it's language.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)