[jira] [Updated] (NUTCH-1570) Add filtering capability to Datastore Queries
[ https://issues.apache.org/jira/browse/NUTCH-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1570: Fix Version/s: (was: 2.4) 2.3 Add filtering capability to Datastore Queries - Key: NUTCH-1570 URL: https://issues.apache.org/jira/browse/NUTCH-1570 Project: Nutch Issue Type: Bug Components: storage Affects Versions: 2.2 Reporter: Lewis John McGibbney Fix For: 2.3 For some time this issue has been discussed on various lists. When doing the upgrade of the Gora dependencies in NUTCH-1569, I stumbled across a comment within o.a.n.api.DbReader#Iterator {code} public IteratorMapString,Object iterator(String[] fields, String startKey, String endKey, String batchId) throws Exception { QueryString,WebPage q = store.newQuery(); String[] qFields = fields; if (fields != null) { HashSetString flds = new HashSetString(Arrays.asList(fields)); // remove url flds.remove(url); if (flds.size() 0) { qFields = flds.toArray(new String[flds.size()]); } else { qFields = null; } } q.setFields(qFields); if (startKey != null) { q.setStartKey(startKey); if (endKey != null) { q.setEndKey(endKey); } } ResultString,WebPage res = store.execute(q); *XXX we should add the filtering capability to Query* return new DbIterator(res, fields, batchId); } {code} I will link this issue to something over on Gora once we get around to the implementation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1410) impact of a map-reduce problem
[ https://issues.apache.org/jira/browse/NUTCH-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1410: Fix Version/s: (was: 2.4) 2.3 impact of a map-reduce problem -- Key: NUTCH-1410 URL: https://issues.apache.org/jira/browse/NUTCH-1410 Project: Nutch Issue Type: Bug Components: fetcher, generator Reporter: behnam nikbakht Fix For: 2.3 with a simple test , found that each mapper or reducer have a local view of variables. in Nutch, there are multiple places that share a variable between mappers or reducers , for example in generate there is a shared variable : hostCounts . or in fetcher , the last request time for each mapper (fetcherThread) is different from another. this problem cause critical problems like send multiple requests to same host that cause to block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (NUTCH-1410) impact of a map-reduce problem
[ https://issues.apache.org/jira/browse/NUTCH-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1410. --- impact of a map-reduce problem -- Key: NUTCH-1410 URL: https://issues.apache.org/jira/browse/NUTCH-1410 Project: Nutch Issue Type: Bug Components: fetcher, generator Reporter: behnam nikbakht Fix For: 2.3 with a simple test , found that each mapper or reducer have a local view of variables. in Nutch, there are multiple places that share a variable between mappers or reducers , for example in generate there is a shared variable : hostCounts . or in fetcher , the last request time for each mapper (fetcherThread) is different from another. this problem cause critical problems like send multiple requests to same host that cause to block. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1490) Data Truncation exceptions when using mysql
[ https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1490. - Resolution: Won't Fix gora-sql not in use right now Data Truncation exceptions when using mysql --- Key: NUTCH-1490 URL: https://issues.apache.org/jira/browse/NUTCH-1490 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Nathan Gass Fix For: 2.3 Attachments: patch Nutch does not ensure the set (or implicit) maximal length for the following columns: title urls (id, baseUrl, reprUrl, typ (contentType) inlinks outlinks Trying to store too much data in one of this columns results in an exception similar to this (copied from GORA-24, I will be able to add an newer stack trace later today): java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018) at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) ... 5 more I'll add my current fixes in later comments. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (NUTCH-1490) Data Truncation exceptions when using mysql
[ https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1490. --- Data Truncation exceptions when using mysql --- Key: NUTCH-1490 URL: https://issues.apache.org/jira/browse/NUTCH-1490 Project: Nutch Issue Type: Bug Affects Versions: 2.1 Reporter: Nathan Gass Fix For: 2.3 Attachments: patch Nutch does not ensure the set (or implicit) maximal length for the following columns: title urls (id, baseUrl, reprUrl, typ (contentType) inlinks outlinks Trying to store too much data in one of this columns results in an exception similar to this (copied from GORA-24, I will be able to add an newer stack trace later today): java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for column 'inlinks' at row 1 at com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018) at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) ... 5 more I'll add my current fixes in later comments. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL
[ https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1497. - Resolution: Won't Fix gora-sql not in use right now Better default gora-sql-mapping.xml with larger field sizes for MySQL - Key: NUTCH-1497 URL: https://issues.apache.org/jira/browse/NUTCH-1497 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.2 Environment: MySQL Backend Reporter: James Sullivan Priority: Minor Labels: MySQL Fix For: 2.3 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, gora-mysql-mapping.xml The current generic default gora-sql-mapping.xml has field sizes that are too small in almost all situations when used with MySQL. I have included a mapping which will work better for MySQL (takes slightly more space but will be able to handle larger fields necessary for real world use). Includes patch from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it is not possible to use the same gora-sql-mapping for both hsqldb and MySQL without a significantly degraded lowest common denominator resulting. Should the user manually rename the attached file to gora-sql-mapping.xml or is there a way to have Nutch automatically use it when MySQL is selected in other configurations (Ivy.xml or gora.properties)? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
[ https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1674: Fix Version/s: (was: 2.4) 2.3 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index - Key: NUTCH-1674 URL: https://issues.apache.org/jira/browse/NUTCH-1674 Project: Nutch Issue Type: Improvement Affects Versions: 2.3 Reporter: Tien Nguyen Manh Fix For: 2.3 Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch, NUTCH-1674_3.patch, NUTCH-1674_final.patch Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, update, index). When crawldb is big, the time to scan is bigger than the actual processing time. We really need to skip records while scanning using GORA-119 for example we can only get records belong to a specified batchId. In my crawl the filter reduce the time to scan from 90 min to 30 min. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1714: Summary: Nutch 2.x upgrade to Gora 0.4 (was: Nutch 2.x upgrade to use GORA_94 branch) Nutch 2.x upgrade to Gora 0.4 - Key: NUTCH-1714 URL: https://issues.apache.org/jira/browse/NUTCH-1714 Project: Nutch Issue Type: Improvement Reporter: Alparslan Avcı Assignee: Alparslan Avcı Fix For: 2.3 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch, NUTCH-1714v4.patch Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1301) Index job resume switch to resume a failed job
[ https://issues.apache.org/jira/browse/NUTCH-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1301: Fix Version/s: (was: 2.3) 2.4 Index job resume switch to resume a failed job -- Key: NUTCH-1301 URL: https://issues.apache.org/jira/browse/NUTCH-1301 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: nutchgora Reporter: Dan Rosher Priority: Minor Fix For: 2.4 Attachments: NUTCH-1301-v2.patch, NUTCH-1301.patch This is also useful in nutchgora to allow for continuous indexing with -all -resume, as it is for fetching, cron scripts can then be independent without having to know the batchid. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [DISCUSS] Roadmap for 2.3 Release
Hi Alparslan Folks, OK so you can see the road map's here *http://s.apache.org/Xqk* http://s.apache.org/Xqk As you can see in 2.3 development drive we've addressed 66 of 71 issues. The remainders being as follows NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741 Support of Sitemaps in Nutch 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714 Nutch 2.x upgrade to Gora 0.4https://issues.apache.org/jira/browse/NUTCH-1714 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avschttps://issues.apache.org/jira/browse/NUTCH-1709 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570 Add filtering capability to Datastore Querieshttps://issues.apache.org/jira/browse/NUTCH-1570 I think if we addressed the above then we could push an RC. Any comments? I'll be able to crack on with this final push relatively soon. On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.org wrote: I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674. This issue was waiting the stable release of gora-0.4. And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if anyone could review and test it. Thanks, Alparslan
[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986402#comment-13986402 ] Julien Nioche commented on NUTCH-1714: -- Hi [~lewismc] Re-progression update : I suspect a GORA issue. Would be good to try and reproduce it on a non-Nutch example. [NUTCH-1674] seems to do the filtering but not just on the batch ID as its title suggests. {quote} OK so when we read XML mappings (e.g. gora-hbase-mapping.xml) and initialize a Gora datastore the table is created no matter if data is written or read. Are you expecting to see Records? Or are you just surprised that the table is there and no Records? {quote} the latter. What I meant was that the crawl is working fine with that crawlID, the underlying table exists but I don't get any results from the readdb command Nutch 2.x upgrade to Gora 0.4 - Key: NUTCH-1714 URL: https://issues.apache.org/jira/browse/NUTCH-1714 Project: Nutch Issue Type: Improvement Reporter: Alparslan Avcı Assignee: Alparslan Avcı Fix For: 2.3 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch, NUTCH-1714v4.patch Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986404#comment-13986404 ] Lewis John McGibbney commented on NUTCH-1714: - Looks like we have a couple of issues then. This is good as as it means we are getting them prior to this getting anywhere near an RC ;) I will look into this ASAP Julien. Thanks Lewis Nutch 2.x upgrade to Gora 0.4 - Key: NUTCH-1714 URL: https://issues.apache.org/jira/browse/NUTCH-1714 Project: Nutch Issue Type: Improvement Reporter: Alparslan Avcı Assignee: Alparslan Avcı Fix For: 2.3 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch, NUTCH-1714v4.patch Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [DISCUSS] Roadmap for 2.3 Release
I'd exclude NUTCH-1741 for now and focus on the core updates (GORA, filters, etc...). See comments on NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Alparslan Folks, OK so you can see the road map's here *http://s.apache.org/Xqk* http://s.apache.org/Xqk As you can see in 2.3 development drive we've addressed 66 of 71 issues. The remainders being as follows NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741 Support of Sitemaps in Nutch 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714 Nutch 2.x upgrade to Gora 0.4https://issues.apache.org/jira/browse/NUTCH-1714 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avschttps://issues.apache.org/jira/browse/NUTCH-1709 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570 Add filtering capability to Datastore Querieshttps://issues.apache.org/jira/browse/NUTCH-1570 I think if we addressed the above then we could push an RC. Any comments? I'll be able to crack on with this final push relatively soon. On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.org wrote: I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674. This issue was waiting the stable release of gora-0.4. And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if anyone could review and test it. Thanks, Alparslan -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: [DISCUSS] Roadmap for 2.3 Release
I aggree with you Julien. Today Lewis change some issues's fix version 2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can I change fix version to 2.3 ? I need them. Thanks Talat 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: I'd exclude NUTCH-1741 for now and focus on the core updates (GORA, filters, etc...). See comments on NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Alparslan Folks, OK so you can see the road map's here *http://s.apache.org/Xqk* http://s.apache.org/Xqk As you can see in 2.3 development drive we've addressed 66 of 71 issues. The remainders being as follows NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741 Support of Sitemaps in Nutch 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714 Nutch 2.x upgrade to Gora 0.4https://issues.apache.org/jira/browse/NUTCH-1714 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avschttps://issues.apache.org/jira/browse/NUTCH-1709 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570 Add filtering capability to Datastore Querieshttps://issues.apache.org/jira/browse/NUTCH-1570 I think if we addressed the above then we could push an RC. Any comments? I'll be able to crack on with this final push relatively soon. On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote: I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674. This issue was waiting the stable release of gora-0.4. And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if anyone could review and test it. Thanks, Alparslan -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Re: [DISCUSS] Roadmap for 2.3 Release
Hi Talat Not clear what you mean here. I need them is not really an explanation as to why they should be part of the next release. [If you want your own repository then open an account on GitHub (or somewhere else) and clone the 2.x branch to add the patches of your choice]. Lewis suggested a roadmap for the next release and the changes he made reflect his suggestions. If you think some of the issues should be part of the 2.3 release then please explain why. BTW I don't think you agree with me as I was suggesting we stick to the ones already listed minus 1741. Thanks Julien On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote: I aggree with you Julien. Today Lewis change some issues's fix version 2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can I change fix version to 2.3 ? I need them. Thanks Talat 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: I'd exclude NUTCH-1741 for now and focus on the core updates (GORA, filters, etc...). See comments on NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Alparslan Folks, OK so you can see the road map's here *http://s.apache.org/Xqk* http://s.apache.org/Xqk As you can see in 2.3 development drive we've addressed 66 of 71 issues. The remainders being as follows NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741 Support of Sitemaps in Nutch 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714 Nutch 2.x upgrade to Gora 0.4https://issues.apache.org/jira/browse/NUTCH-1714 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avschttps://issues.apache.org/jira/browse/NUTCH-1709 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570 Add filtering capability to Datastore Querieshttps://issues.apache.org/jira/browse/NUTCH-1570 I think if we addressed the above then we could push an RC. Any comments? I'll be able to crack on with this final push relatively soon. On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote: I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674. This issue was waiting the stable release of gora-0.4. And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if anyone could review and test it. Thanks, Alparslan -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: [DISCUSS] Roadmap for 2.3 Release
Hi Julien, Sorry, You are right. I guess I could not express myself. I want to say some of the issues which are appointed to the 2.4, should be part of the 2.3. The issues: NUTCH-1753 Eclipse dependecy problem for 2.x NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names (path elements) NUTCH-1740 BatchId parameter is not set in DbUpdaterJob NUTCH-1728 indexer-solr plugin is not delete docs from solr NUTCH-1725 CleaningJob's reducer does not commit deleted docs. NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud NUTCH-1661 Language based crawling NUTCH-1660 Index filter for Page's latitude and longitude NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser NUTCH-1643 Unnecessary fetching with http.content.limit when using protocol-http NUTCH-1618 Fetches some websites multiple times for long lasting queues Wdyt ? Talat 2014-05-01 11:32 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: Hi Talat Not clear what you mean here. I need them is not really an explanation as to why they should be part of the next release. [If you want your own repository then open an account on GitHub (or somewhere else) and clone the 2.x branch to add the patches of your choice]. Lewis suggested a roadmap for the next release and the changes he made reflect his suggestions. If you think some of the issues should be part of the 2.3 release then please explain why. BTW I don't think you agree with me as I was suggesting we stick to the ones already listed minus 1741. Thanks Julien On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote: I aggree with you Julien. Today Lewis change some issues's fix version 2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can I change fix version to 2.3 ? I need them. Thanks Talat 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: I'd exclude NUTCH-1741 for now and focus on the core updates (GORA, filters, etc...). See comments on NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Alparslan Folks, OK so you can see the road map's here *http://s.apache.org/Xqk* http://s.apache.org/Xqk As you can see in 2.3 development drive we've addressed 66 of 71 issues. The remainders being as follows NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741 Support of Sitemaps in Nutch 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714 Nutch 2.x upgrade to Gora 0.4https://issues.apache.org/jira/browse/NUTCH-1714 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avschttps://issues.apache.org/jira/browse/NUTCH-1709 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570 Add filtering capability to Datastore Querieshttps://issues.apache.org/jira/browse/NUTCH-1570 I think if we addressed the above then we could push an RC. Any comments? I'll be able to crack on with this final push relatively soon. On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote: I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674. This issue was waiting the stable release of gora-0.4. And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if anyone could review and test it. Thanks, Alparslan -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304 -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble -- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
[jira] [Commented] (NUTCH-1753) Eclipse dependecy problem for 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986698#comment-13986698 ] Julien Nioche commented on NUTCH-1753: -- It won't do any harm to do it the way you are suggesting. +1 to use your brand new committer skills. Reminder : don't forget to add a short description on CHANGES.txt and show the commit number when marking this issue as resolved Thanks! Eclipse dependecy problem for 2.x - Key: NUTCH-1753 URL: https://issues.apache.org/jira/browse/NUTCH-1753 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Assignee: Talat UYARER Priority: Minor Fix For: 2.4 Attachments: NUTCH-1753.patch When running Nutch 2.x on eclipse if dependencies is not added in eclipse target of build.xml some plugins do not work correctly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1740) BatchId parameter is not set in DbUpdaterJob
[ https://issues.apache.org/jira/browse/NUTCH-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche resolved NUTCH-1740. -- Resolution: Duplicate BatchId parameter is not set in DbUpdaterJob Key: NUTCH-1740 URL: https://issues.apache.org/jira/browse/NUTCH-1740 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Alparslan Avcı Priority: Minor Attachments: NUTCH-1556-batchId.patch BatchId is not set in DbUpdaterJob since batchId is set to configuration after creating currentJob. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1679: - Fix Version/s: (was: 2.4) 2.3 UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1679: - Affects Version/s: (was: 2.3) 2.2.1 UpdateDb using batchId, link may override crawled page. --- Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Tien Nguyen Manh Priority: Critical Fix For: 2.3 Attachments: NUTCH-1679.patch The problem is in Hbase store, not sure about other store. Suppose at first crawl cycle we crawl link A, then get an outlink B. In second cycle we crawl link B which also has a link point to A In second updatedb we load only page B from store, and will add A as new link because it doesn't know A already exist in store and will override A. UpdateDb must be run without batchId or we must set additionsAllowed=false Here are code for new page page = new WebPage(); schedule.initializeSchedule(url, page); page.setStatus(CrawlStatus.STATUS_UNFETCHED); try { scoringFilters.initialScore(url, page); } catch (ScoringFilterException e) { page.setScore(0.0f); } new page will override old page status, score, fetchTime, fetchInterval, retries, metadata[CASH_KEY] - i think we can change something here so that new page will only update one column for example 'link' and if it is really a new page, we can initialize all above fields in generator - or we add operator checkAndPut to store so when add new page we will check if already exist first -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1728) indexer-solr plugin is not delete docs from solr
[ https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1728: - Fix Version/s: 2.3 indexer-solr plugin is not delete docs from solr Key: NUTCH-1728 URL: https://issues.apache.org/jira/browse/NUTCH-1728 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN Fix For: 2.3 Attachments: NUTCH-1728.patch Missing delete variable used in delete(String key) method setting. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1728) indexer-solr plugin is not delete docs from solr
[ https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986713#comment-13986713 ] Julien Nioche commented on NUTCH-1728: -- +1 to commit indexer-solr plugin is not delete docs from solr Key: NUTCH-1728 URL: https://issues.apache.org/jira/browse/NUTCH-1728 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN Fix For: 2.3 Attachments: NUTCH-1728.patch Missing delete variable used in delete(String key) method setting. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.
[ https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1725: - Fix Version/s: 2.3 CleaningJob's reducer does not commit deleted docs. Key: NUTCH-1725 URL: https://issues.apache.org/jira/browse/NUTCH-1725 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN Fix For: 2.3 Attachments: NUTCH-1725.patch In cleanup(Context context) method, if condition has logical problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.
[ https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986717#comment-13986717 ] Julien Nioche commented on NUTCH-1725: -- +1 to commit CleaningJob's reducer does not commit deleted docs. Key: NUTCH-1725 URL: https://issues.apache.org/jira/browse/NUTCH-1725 Project: Nutch Issue Type: Bug Reporter: İlhami KALKAN Fix For: 2.3 Attachments: NUTCH-1725.patch In cleanup(Context context) method, if condition has logical problem. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1662) Indexer Plugin for Solr Cloud
[ https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1662: - Affects Version/s: (was: 2.3) 2.2.1 Indexer Plugin for Solr Cloud - Key: NUTCH-1662 URL: https://issues.apache.org/jira/browse/NUTCH-1662 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.4 Attachments: NUTCH-1662.patch In main issue's patch use Solr Http connection. It doesnt support Solr Could. This plugin support Solr Cloud. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1662) Indexer Plugin for Solr Cloud
[ https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986721#comment-13986721 ] Julien Nioche commented on NUTCH-1662: -- I think we did something pretty similar in 1.x and would like to make sure that both versions are as similar as possible. Will have a look at it later Indexer Plugin for Solr Cloud - Key: NUTCH-1662 URL: https://issues.apache.org/jira/browse/NUTCH-1662 Project: Nutch Issue Type: Sub-task Components: indexer Affects Versions: 2.2.1 Reporter: Talat UYARER Fix For: 2.4 Attachments: NUTCH-1662.patch In main issue's patch use Solr Http connection. It doesnt support Solr Could. This plugin support Solr Cloud. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: [DISCUSS] Roadmap for 2.3 Release
Hi Talat, Comments below : NUTCH-1753 Eclipse dependecy problem for 2.x = trivial, please see my comments on it NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names (path elements) = still under discussion - leave it for 2.4 NUTCH-1740 BatchId parameter is not set in DbUpdaterJob = duplicate NUTCH-1728 indexer-solr plugin is not delete docs from solr = trivial enough to be committed for 2.3 NUTCH-1725 CleaningJob's reducer does not commit deleted docs. = trivial enough to be committed for 2.3 NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud = I think we did something pretty similar in 1.x and would like to make sure that both versions are as similar as possible. NUTCH-1661 Language based crawling = This is definitely not being committed. You haven't replied to Otis's questions and this has to be properly reviewed first and discussed. NUTCH-1660 Index filter for Page's latitude and longitude = same. You haven't replied to the comments on this one. NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser = trivial indeed, +1 thanks NUTCH-1643 Unnecessary fetching with http.content.limit when using protocol-http = needs reviewing first, let's leave it for later NUTCH-1618 Fetches some websites multiple times for long lasting queues = trivial indeed, please change the title to something more explicit like Turn speculative execution off for Fetching I have added NUTCH-1679 https://issues.apache.org/jira/browse/NUTCH-1679 (UpdateDb using batchId, link may override crawled page.) to 2.3 as it must be fixed ASAP. Thanks for pointing out these issues. I think the focus for 2.3 should be to get everything as robust as possible, we can always add new functionalities in another release after that (release often etc...). One thing we should definitely have though is to leverage the brand new GORA filtering so that we get only the entries marked for a given job - see discussion on NUTCH-1714 ttps://issues.apache.org/jira/browse/NUTCH-1714. This should make Nutch 2.x a lot faster. We haven't released 2.x for some time and loads of interesting stuff has been done to it. It will be an exciting release! Thanks for your contributions and pushing things forward! Julien 2014-05-01 11:32 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: Hi Talat Not clear what you mean here. I need them is not really an explanation as to why they should be part of the next release. [If you want your own repository then open an account on GitHub (or somewhere else) and clone the 2.x branch to add the patches of your choice]. Lewis suggested a roadmap for the next release and the changes he made reflect his suggestions. If you think some of the issues should be part of the 2.3 release then please explain why. BTW I don't think you agree with me as I was suggesting we stick to the ones already listed minus 1741. Thanks Julien On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote: I aggree with you Julien. Today Lewis change some issues's fix version 2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can I change fix version to 2.3 ? I need them. Thanks Talat 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com: I'd exclude NUTCH-1741 for now and focus on the core updates (GORA, filters, etc...). See comments on NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Hi Alparslan Folks, OK so you can see the road map's here *http://s.apache.org/Xqk* http://s.apache.org/Xqk As you can see in 2.3 development drive we've addressed 66 of 71 issues. The remainders being as follows NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741 Support of Sitemaps in Nutch 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714 Nutch 2.x upgrade to Gora 0.4https://issues.apache.org/jira/browse/NUTCH-1714 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus contain methods not defined in source .avschttps://issues.apache.org/jira/browse/NUTCH-1709 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570 Add filtering capability to Datastore Querieshttps://issues.apache.org/jira/browse/NUTCH-1570 I think if we addressed the above then we could push an RC. Any comments? I'll be able to crack on with this final push relatively soon. On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote: I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674. This issue was waiting the stable release of
[jira] [Commented] (NUTCH-1657) ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser
[ https://issues.apache.org/jira/browse/NUTCH-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986726#comment-13986726 ] Julien Nioche commented on NUTCH-1657: -- +1 thanks! ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser --- Key: NUTCH-1657 URL: https://issues.apache.org/jira/browse/NUTCH-1657 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Priority: Minor Fix For: 2.3 Attachments: NUTCH-1657.patch ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are never set in HTMLParser.java. In 2.x, we didn't set this value any field. Actually we never use this value in 2.x I thought delete them. But Feng Lu guided me and I will set metadata field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1618) Fetches some websites multiple times for long lasting queues
[ https://issues.apache.org/jira/browse/NUTCH-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1618: - Fix Version/s: (was: 2.4) 2.3 Fetches some websites multiple times for long lasting queues Key: NUTCH-1618 URL: https://issues.apache.org/jira/browse/NUTCH-1618 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.1, 2.2, 2.3, 2.4 Reporter: Talat UYARER Priority: Minor Fix For: 2.3 Attachments: NUTCH-1618.patch We are using nutch for high volume crawls. We noticed that FetcherJob ReduceTask fetches some websites multiple times for long lasting queues. I have discovered the reason of this is mapred.reduce.tasks.speculative.execution settings in hadoop. 1.x has speculative execution turned off. I create a patch for 2.x -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1657) ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser
[ https://issues.apache.org/jira/browse/NUTCH-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1657: - Fix Version/s: (was: 2.4) 2.3 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser --- Key: NUTCH-1657 URL: https://issues.apache.org/jira/browse/NUTCH-1657 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Talat UYARER Priority: Minor Fix For: 2.3 Attachments: NUTCH-1657.patch ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are never set in HTMLParser.java. In 2.x, we didn't set this value any field. Actually we never use this value in 2.x I thought delete them. But Feng Lu guided me and I will set metadata field. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)
[ https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986943#comment-13986943 ] Rogério Pereira Araújo commented on NUTCH-1768: --- Tried to apply this patch on a fresh copy of 2.2.1 sources, no success on patch the first file ivy/ivy.xml by giving the following output: Patching file ivy/ivy.xml using Plan A... Hunk #1 failed at 32. 1 out of 1 hunks failed--saving rejects to ivy/ivy.xml.rej Any hints? port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0) -- Key: NUTCH-1768 URL: https://issues.apache.org/jira/browse/NUTCH-1768 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.2.1 Reporter: Julien Nioche Labels: elasticsearch Fix For: 2.4 Attachments: NUTCH-1768.patch See [https://issues.apache.org/jira/browse/NUTCH-1745] ElasticSearch is currently at version 1.1.0. The patch attached upgrades the dependencies, fixes a couple of changes required by 1.1.0 and also : removes the need for having ES in the main ivy dependency - it is now only required at the plugin level improves the logic around using the cluster name or an explicit host:port to connect to ES : the clustername is not required nor set when using host:port uses a more sensible default value for the port -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4
[ https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alparslan Avcı updated NUTCH-1714: -- Attachment: NUTCH-1714v5.patch Hi [~jnioche], I have uploaded a new patch that also fixes the problem in _./nutch readdb -crawlId MYCRAWLIDHERE -stats_ command. Would you please test it again? Thanks! Nutch 2.x upgrade to Gora 0.4 - Key: NUTCH-1714 URL: https://issues.apache.org/jira/browse/NUTCH-1714 Project: Nutch Issue Type: Improvement Reporter: Alparslan Avcı Assignee: Alparslan Avcı Fix For: 2.3 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the details in this issue. -- This message was sent by Atlassian JIRA (v6.2#6252)