[jira] [Updated] (NUTCH-1570) Add filtering capability to Datastore Queries

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1570:


Fix Version/s: (was: 2.4)
   2.3

 Add filtering capability to Datastore Queries
 -

 Key: NUTCH-1570
 URL: https://issues.apache.org/jira/browse/NUTCH-1570
 Project: Nutch
  Issue Type: Bug
  Components: storage
Affects Versions: 2.2
Reporter: Lewis John McGibbney
 Fix For: 2.3


 For some time this issue has been discussed on various lists.
 When doing the upgrade of the Gora dependencies in NUTCH-1569, I  stumbled 
 across a comment within o.a.n.api.DbReader#Iterator
 {code}
   public IteratorMapString,Object iterator(String[] fields, String 
 startKey, String endKey,
   String batchId) throws Exception {
 QueryString,WebPage q = store.newQuery();
 String[] qFields = fields;
 if (fields != null) {
   HashSetString flds = new HashSetString(Arrays.asList(fields));
   // remove url
   flds.remove(url);
   if (flds.size()  0) {
 qFields = flds.toArray(new String[flds.size()]);
   } else {
 qFields = null;
   }
 }
 q.setFields(qFields);
 if (startKey != null) {
   q.setStartKey(startKey);
   if (endKey != null) {
 q.setEndKey(endKey);
   }
 }
 ResultString,WebPage res = store.execute(q);
 *XXX we should add the filtering capability to Query*
 return new DbIterator(res, fields, batchId);
   }
 {code} 
 I will link this issue to something over on Gora once we get around to the 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1410) impact of a map-reduce problem

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1410:


Fix Version/s: (was: 2.4)
   2.3

 impact of a map-reduce problem
 --

 Key: NUTCH-1410
 URL: https://issues.apache.org/jira/browse/NUTCH-1410
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Reporter: behnam nikbakht
 Fix For: 2.3


 with a simple test , found that each mapper or reducer have a local view of 
 variables. in Nutch, there are multiple places that share a variable between 
 mappers or reducers , for example in generate there is a shared variable : 
 hostCounts . or in fetcher , the last request time for each mapper 
 (fetcherThread) is different from another.
 this problem cause critical problems like send multiple requests to same host 
 that cause to block.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (NUTCH-1410) impact of a map-reduce problem

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1410.
---


 impact of a map-reduce problem
 --

 Key: NUTCH-1410
 URL: https://issues.apache.org/jira/browse/NUTCH-1410
 Project: Nutch
  Issue Type: Bug
  Components: fetcher, generator
Reporter: behnam nikbakht
 Fix For: 2.3


 with a simple test , found that each mapper or reducer have a local view of 
 variables. in Nutch, there are multiple places that share a variable between 
 mappers or reducers , for example in generate there is a shared variable : 
 hostCounts . or in fetcher , the last request time for each mapper 
 (fetcherThread) is different from another.
 this problem cause critical problems like send multiple requests to same host 
 that cause to block.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1490) Data Truncation exceptions when using mysql

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1490.
-

Resolution: Won't Fix

gora-sql not in use right now

 Data Truncation exceptions when using mysql
 ---

 Key: NUTCH-1490
 URL: https://issues.apache.org/jira/browse/NUTCH-1490
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Nathan Gass
 Fix For: 2.3

 Attachments: patch


 Nutch does not ensure the set (or implicit) maximal length for the following 
 columns:
 title
 urls (id, baseUrl, reprUrl,
 typ (contentType)
 inlinks
 outlinks
 Trying to store too much data in one of this columns results in an exception 
 similar to this (copied from GORA-24, I will be able to add an newer stack 
 trace later today):
 java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
 long for column 'inlinks' at row 1 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) 
 at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) 
 at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) 
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) 
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) 
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 
 Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for 
 column 'inlinks' at row 1 
 at 
 com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018)
  
 at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) 
 ... 5 more
 I'll add my current fixes in later comments.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (NUTCH-1490) Data Truncation exceptions when using mysql

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-1490.
---


 Data Truncation exceptions when using mysql
 ---

 Key: NUTCH-1490
 URL: https://issues.apache.org/jira/browse/NUTCH-1490
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.1
Reporter: Nathan Gass
 Fix For: 2.3

 Attachments: patch


 Nutch does not ensure the set (or implicit) maximal length for the following 
 columns:
 title
 urls (id, baseUrl, reprUrl,
 typ (contentType)
 inlinks
 outlinks
 Trying to store too much data in one of this columns results in an exception 
 similar to this (copied from GORA-24, I will be able to add an newer stack 
 trace later today):
 java.io.IOException: java.sql.BatchUpdateException: Data truncation: Data too 
 long for column 'inlinks' at row 1 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:340) 
 at org.apache.gora.sql.store.SqlStore.close(SqlStore.java:185) 
 at org.apache.gora.mapreduce.GoraRecordWriter.close(GoraRecordWriter.java:55) 
 at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567) 
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408) 
 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 
 Caused by: java.sql.BatchUpdateException: Data truncation: Data too long for 
 column 'inlinks' at row 1 
 at 
 com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2018)
  
 at com.mysql.jdbc.PreparedStatement.executeBatch(PreparedStatement.java:1449) 
 at org.apache.gora.sql.store.SqlStore.flush(SqlStore.java:328) 
 ... 5 more
 I'll add my current fixes in later comments.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1497.
-

Resolution: Won't Fix

gora-sql not in use right now

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Fix For: 2.3

 Attachments: gora-mysql-mapping-patch, gora-mysql-mapping.xml, 
 gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1674:


Fix Version/s: (was: 2.4)
   2.3

 Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
 -

 Key: NUTCH-1674
 URL: https://issues.apache.org/jira/browse/NUTCH-1674
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.3
Reporter: Tien Nguyen Manh
 Fix For: 2.3

 Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch, 
 NUTCH-1674_3.patch, NUTCH-1674_final.patch


 Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, 
 update, index). When crawldb is big, the time to scan is bigger than the 
 actual processing time.
 We really need to skip records while scanning using GORA-119 for example we 
 can only get records belong to a specified batchId.
 In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1714:


Summary: Nutch 2.x upgrade to Gora 0.4  (was: Nutch 2.x upgrade to use 
GORA_94 branch)

 Nutch 2.x upgrade to Gora 0.4
 -

 Key: NUTCH-1714
 URL: https://issues.apache.org/jira/browse/NUTCH-1714
 Project: Nutch
  Issue Type: Improvement
Reporter: Alparslan Avcı
Assignee: Alparslan Avcı
 Fix For: 2.3

 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
 NUTCH-1714v2.patch, NUTCH-1714v4.patch


 Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
 details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1301) Index job resume switch to resume a failed job

2014-05-01 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1301:


Fix Version/s: (was: 2.3)
   2.4

 Index job resume switch to resume a failed job
 --

 Key: NUTCH-1301
 URL: https://issues.apache.org/jira/browse/NUTCH-1301
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: nutchgora
Reporter: Dan Rosher
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1301-v2.patch, NUTCH-1301.patch


 This is also useful in nutchgora to allow for continuous indexing with -all 
 -resume, as it is for fetching, cron scripts can then be independent without 
 having to know the batchid.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Lewis John Mcgibbney
Hi Alparslan  Folks,

OK so you can see the road map's here

*http://s.apache.org/Xqk* http://s.apache.org/Xqk

As you can see in 2.3 development drive we've addressed 66 of 71 issues.
The remainders being as follows

NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

Support of Sitemaps in Nutch
2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

Nutch 2.x upgrade to Gora 0.4https://issues.apache.org/jira/browse/NUTCH-1714
NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
contain methods not defined in source
.avschttps://issues.apache.org/jira/browse/NUTCH-1709
NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

Use batchId filter to enable scan (GORA-119) for
Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

Add filtering capability to Datastore
Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
I think if we addressed the above then we could push an RC.
Any comments?
I'll be able to crack on with this final push relatively soon.

On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.org wrote:


 I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
 This issue was waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741, if
 anyone could review and test it.

 Thanks,
 Alparslan





[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986402#comment-13986402
 ] 

Julien Nioche commented on NUTCH-1714:
--

Hi [~lewismc]

Re-progression update : I suspect a GORA issue. Would be good to try and 
reproduce it on a non-Nutch example.

 [NUTCH-1674] seems to do the filtering but not just on the batch ID as its 
title suggests.

{quote}
OK so when we read XML mappings (e.g. gora-hbase-mapping.xml) and initialize a 
Gora datastore the table is created no matter if data is written or read. Are 
you expecting to see Records? Or are you just surprised that the table is there 
and no Records?
{quote}

the latter. What I meant was that the crawl is working fine with that crawlID, 
the underlying table exists but I don't get any results from the readdb command




 Nutch 2.x upgrade to Gora 0.4
 -

 Key: NUTCH-1714
 URL: https://issues.apache.org/jira/browse/NUTCH-1714
 Project: Nutch
  Issue Type: Improvement
Reporter: Alparslan Avcı
Assignee: Alparslan Avcı
 Fix For: 2.3

 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
 NUTCH-1714v2.patch, NUTCH-1714v4.patch


 Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
 details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-01 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986404#comment-13986404
 ] 

Lewis John McGibbney commented on NUTCH-1714:
-

Looks like we have a couple of issues then. This is good as as it means we are 
getting them prior to this getting anywhere near an RC ;)
I will look into this ASAP Julien.
Thanks
Lewis

 Nutch 2.x upgrade to Gora 0.4
 -

 Key: NUTCH-1714
 URL: https://issues.apache.org/jira/browse/NUTCH-1714
 Project: Nutch
  Issue Type: Improvement
Reporter: Alparslan Avcı
Assignee: Alparslan Avcı
 Fix For: 2.3

 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
 NUTCH-1714v2.patch, NUTCH-1714v4.patch


 Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
 details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
filters, etc...). See comments on
NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71 issues.
 The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for 
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
 NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.org wrote:


 I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
 This issue was waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
 if anyone could review and test it.

 Thanks,
 Alparslan






-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Talat Uyarer
I aggree with you Julien. Today Lewis change some issues's fix version  2.3
to 2.4. Most of my issues :) May I ask, If I update these issues, can I
change fix version to 2.3  ? I need them.

Thanks
Talat


2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
 filters, etc...). See comments on 
 NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71 issues.
 The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for 
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
  NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote:


 I think we can also add https://issues.apache.org/jira/browse/NUTCH-1674.
 This issue was waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
 if anyone could review and test it.

 Thanks,
 Alparslan






 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
Hi Talat

Not clear what you mean here. I need them is not really an explanation as
to why they should be part of the next release. [If you want your own
repository then open an account on GitHub (or somewhere else) and clone the
2.x branch to add the patches of your choice].

Lewis suggested a roadmap for the next release and the changes he made
reflect his suggestions. If you think some of the issues should be part of
the 2.3 release then please explain why. BTW I don't think you agree with
me as I was suggesting we stick to the ones already listed minus 1741.

Thanks

Julien


On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote:

 I aggree with you Julien. Today Lewis change some issues's fix version
  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
 I change fix version to 2.3  ? I need them.

 Thanks
 Talat


 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
 filters, etc...). See comments on 
 NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71 issues.
 The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for 
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
  NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote:


 I think we can also add
 https://issues.apache.org/jira/browse/NUTCH-1674. This issue was
 waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
 if anyone could review and test it.

 Thanks,
 Alparslan






 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Talat Uyarer
Hi Julien,

Sorry, You are right. I guess I could not express myself. I want to say
some of the issues which are appointed to the 2.4, should be part of the
2.3.

The issues:
NUTCH-1753 Eclipse dependecy problem for 2.x
NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names
(path elements)
NUTCH-1740 BatchId parameter is not set in DbUpdaterJob
NUTCH-1728 indexer-solr plugin is not delete docs from solr
NUTCH-1725 CleaningJob's reducer does not commit deleted docs.
NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud
NUTCH-1661 Language based crawling
NUTCH-1660 Index filter for Page's latitude and longitude
NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never
set in HTMLParser
NUTCH-1643 Unnecessary fetching with http.content.limit when using
protocol-http
NUTCH-1618 Fetches some websites multiple times for long lasting queues

Wdyt ?

Talat


2014-05-01 11:32 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 Hi Talat

 Not clear what you mean here. I need them is not really an explanation
 as to why they should be part of the next release. [If you want your own
 repository then open an account on GitHub (or somewhere else) and clone the
 2.x branch to add the patches of your choice].

 Lewis suggested a roadmap for the next release and the changes he made
 reflect his suggestions. If you think some of the issues should be part of
 the 2.3 release then please explain why. BTW I don't think you agree with
 me as I was suggesting we stick to the ones already listed minus 1741.

 Thanks

 Julien



 On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote:

 I aggree with you Julien. Today Lewis change some issues's fix version
  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
 I change fix version to 2.3  ? I need them.

 Thanks
 Talat


 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
 filters, etc...). See comments on 
 NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71
 issues. The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
  NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote:


 I think we can also add
 https://issues.apache.org/jira/browse/NUTCH-1674. This issue was
 waiting the stable release of gora-0.4.

 And IMHO, we can add https://issues.apache.org/jira/browse/NUTCH-1741,
 if anyone could review and test it.

 Thanks,
 Alparslan






 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 Talat UYARER
 Websitesi: http://talat.uyarer.com
 Twitter: http://twitter.com/talatuyarer
 Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304




 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304


[jira] [Commented] (NUTCH-1753) Eclipse dependecy problem for 2.x

2014-05-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986698#comment-13986698
 ] 

Julien Nioche commented on NUTCH-1753:
--

It won't do any harm to do it the way you are suggesting. +1 to use your brand 
new committer skills. 
Reminder : don't forget to add a short description on CHANGES.txt and show the 
commit number when marking this issue as resolved
Thanks!

 Eclipse dependecy problem for 2.x
 -

 Key: NUTCH-1753
 URL: https://issues.apache.org/jira/browse/NUTCH-1753
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
Assignee: Talat UYARER
Priority: Minor
 Fix For: 2.4

 Attachments: NUTCH-1753.patch


 When running Nutch 2.x on eclipse if dependencies is not added in eclipse 
 target of build.xml some plugins do not work correctly. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1740) BatchId parameter is not set in DbUpdaterJob

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1740.
--

Resolution: Duplicate

 BatchId parameter is not set in DbUpdaterJob
 

 Key: NUTCH-1740
 URL: https://issues.apache.org/jira/browse/NUTCH-1740
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Alparslan Avcı
Priority: Minor
 Attachments: NUTCH-1556-batchId.patch


 BatchId is not set in DbUpdaterJob since batchId is set to configuration 
 after creating currentJob.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1679:
-

Fix Version/s: (was: 2.4)
   2.3

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1679:
-

Affects Version/s: (was: 2.3)
   2.2.1

 UpdateDb using batchId, link may override crawled page.
 ---

 Key: NUTCH-1679
 URL: https://issues.apache.org/jira/browse/NUTCH-1679
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Tien Nguyen Manh
Priority: Critical
 Fix For: 2.3

 Attachments: NUTCH-1679.patch


 The problem is in Hbase store, not sure about other store.
 Suppose at first crawl cycle we crawl link A, then get an outlink B.
 In second cycle we crawl link B which also has a link point to A
 In second updatedb we load only page B from store, and will add A as new link 
 because it doesn't know A already exist in store and will override A.
 UpdateDb must be run without batchId or we must set additionsAllowed=false
 Here are code for new page
   page = new WebPage();
   schedule.initializeSchedule(url, page);
   page.setStatus(CrawlStatus.STATUS_UNFETCHED);
   try {
 scoringFilters.initialScore(url, page);
   } catch (ScoringFilterException e) {
 page.setScore(0.0f);
   }
 new page will override old page status, score, fetchTime, fetchInterval, 
 retries, metadata[CASH_KEY]
  - i think we can change something here so that new page will only update one 
 column for example 'link' and if it is really a new page, we can initialize 
 all above fields in generator
 - or we add operator checkAndPut to store so when add new page we will check 
 if already exist first



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1728) indexer-solr plugin is not delete docs from solr

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1728:
-

Fix Version/s: 2.3

 indexer-solr plugin is not delete docs from solr
 

 Key: NUTCH-1728
 URL: https://issues.apache.org/jira/browse/NUTCH-1728
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN
 Fix For: 2.3

 Attachments: NUTCH-1728.patch


 Missing delete variable used in delete(String key) method setting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1728) indexer-solr plugin is not delete docs from solr

2014-05-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986713#comment-13986713
 ] 

Julien Nioche commented on NUTCH-1728:
--

+1 to commit

 indexer-solr plugin is not delete docs from solr
 

 Key: NUTCH-1728
 URL: https://issues.apache.org/jira/browse/NUTCH-1728
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN
 Fix For: 2.3

 Attachments: NUTCH-1728.patch


 Missing delete variable used in delete(String key) method setting.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1725:
-

Fix Version/s: 2.3

 CleaningJob's reducer does not commit deleted docs. 
 

 Key: NUTCH-1725
 URL: https://issues.apache.org/jira/browse/NUTCH-1725
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN
 Fix For: 2.3

 Attachments: NUTCH-1725.patch


 In cleanup(Context context) method, if condition has logical problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1725) CleaningJob's reducer does not commit deleted docs.

2014-05-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986717#comment-13986717
 ] 

Julien Nioche commented on NUTCH-1725:
--

+1 to commit

 CleaningJob's reducer does not commit deleted docs. 
 

 Key: NUTCH-1725
 URL: https://issues.apache.org/jira/browse/NUTCH-1725
 Project: Nutch
  Issue Type: Bug
Reporter: İlhami KALKAN
 Fix For: 2.3

 Attachments: NUTCH-1725.patch


 In cleanup(Context context) method, if condition has logical problem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1662:
-

Affects Version/s: (was: 2.3)
   2.2.1

 Indexer Plugin for Solr Cloud
 -

 Key: NUTCH-1662
 URL: https://issues.apache.org/jira/browse/NUTCH-1662
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.4

 Attachments: NUTCH-1662.patch


 In main issue's patch use Solr Http connection. It doesnt support Solr Could. 
 This plugin support Solr Cloud. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1662) Indexer Plugin for Solr Cloud

2014-05-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986721#comment-13986721
 ] 

Julien Nioche commented on NUTCH-1662:
--

 I think we did something pretty similar in 1.x and would like to make sure 
that both versions are as similar as possible. Will have a look at it later

 Indexer Plugin for Solr Cloud
 -

 Key: NUTCH-1662
 URL: https://issues.apache.org/jira/browse/NUTCH-1662
 Project: Nutch
  Issue Type: Sub-task
  Components: indexer
Affects Versions: 2.2.1
Reporter: Talat UYARER
 Fix For: 2.4

 Attachments: NUTCH-1662.patch


 In main issue's patch use Solr Http connection. It doesnt support Solr Could. 
 This plugin support Solr Cloud. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [DISCUSS] Roadmap for 2.3 Release

2014-05-01 Thread Julien Nioche
Hi Talat,

Comments below :

NUTCH-1753 Eclipse dependecy problem for 2.x


= trivial, please see my comments on it


 NUTCH-1748 urlfilter-validator to allow .. (two dots) inside file names
 (path elements)


= still under discussion - leave it for 2.4


 NUTCH-1740 BatchId parameter is not set in DbUpdaterJob


= duplicate


 NUTCH-1728 indexer-solr plugin is not delete docs from solr


= trivial enough to be committed for 2.3


 NUTCH-1725 CleaningJob's reducer does not commit deleted docs.


= trivial enough to be committed for 2.3


 NUTCH-1662 NUTCH-1568 Indexer Plugin for Solr Cloud


= I think we did something pretty similar in 1.x and would like to make
sure that both versions are as similar as possible.


 NUTCH-1661 Language based crawling


= This is definitely not being committed. You haven't replied to Otis's
questions and this has to be properly reviewed first and discussed.


 NUTCH-1660 Index filter for Page's latitude and longitude


= same. You haven't replied to the comments on this one.


 NUTCH-1657 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never
 set in HTMLParser


= trivial indeed, +1 thanks


 NUTCH-1643 Unnecessary fetching with http.content.limit when using
 protocol-http


= needs reviewing first, let's leave it for later


 NUTCH-1618 Fetches some websites multiple times for long lasting queues


= trivial indeed, please change the title to something more explicit like
Turn speculative execution off for Fetching

I have added NUTCH-1679
https://issues.apache.org/jira/browse/NUTCH-1679 (UpdateDb
using batchId, link may override crawled page.) to 2.3 as it must be fixed
ASAP.

Thanks for pointing out these issues. I think the focus for 2.3 should be
to get everything as robust as possible, we can always add new
functionalities in another release after that (release often etc...). One
thing we should definitely have though is to leverage the brand new GORA
filtering so that we get only the entries marked for a given job - see
discussion on NUTCH-1714 ttps://issues.apache.org/jira/browse/NUTCH-1714.
This should make Nutch 2.x a lot faster.

We haven't released 2.x for some time and loads of interesting stuff has
been done to it. It will be an exciting release!

Thanks for your contributions and pushing things forward!

Julien




 2014-05-01 11:32 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 Hi Talat

 Not clear what you mean here. I need them is not really an explanation
 as to why they should be part of the next release. [If you want your own
 repository then open an account on GitHub (or somewhere else) and clone the
 2.x branch to add the patches of your choice].

 Lewis suggested a roadmap for the next release and the changes he made
 reflect his suggestions. If you think some of the issues should be part of
 the 2.3 release then please explain why. BTW I don't think you agree with
 me as I was suggesting we stick to the ones already listed minus 1741.

 Thanks

 Julien



 On 1 May 2014 08:40, Talat Uyarer ta...@uyarer.com wrote:

 I aggree with you Julien. Today Lewis change some issues's fix version
  2.3 to 2.4. Most of my issues :) May I ask, If I update these issues, can
 I change fix version to 2.3  ? I need them.

 Thanks
 Talat


 2014-05-01 9:47 GMT+03:00 Julien Nioche lists.digitalpeb...@gmail.com:

 I'd exclude NUTCH-1741 for now and focus on the core updates (GORA,
 filters, etc...). See comments on 
 NUTCH-1714https://issues.apache.org/jira/browse/NUTCH-1714


 On 1 May 2014 07:27, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Alparslan  Folks,

 OK so you can see the road map's here

 *http://s.apache.org/Xqk* http://s.apache.org/Xqk

 As you can see in 2.3 development drive we've addressed 66 of 71
 issues. The remainders being as follows

 NUTCH-1741 https://issues.apache.org/jira/browse/NUTCH-1741

 Support of Sitemaps in Nutch 
 2.xhttps://issues.apache.org/jira/browse/NUTCH-1741
 NUTCH-1714 https://issues.apache.org/jira/browse/NUTCH-1714

 Nutch 2.x upgrade to Gora 
 0.4https://issues.apache.org/jira/browse/NUTCH-1714
 NUTCH-1709 https://issues.apache.org/jira/browse/NUTCH-1709

 Generated classes o.a.n.storage.Host and o.a.n.storage.ProtocolStatus
 contain methods not defined in source 
 .avschttps://issues.apache.org/jira/browse/NUTCH-1709
 NUTCH-1674 https://issues.apache.org/jira/browse/NUTCH-1674

 Use batchId filter to enable scan (GORA-119) for
 Fetch,Parse,Update,Indexhttps://issues.apache.org/jira/browse/NUTCH-1674
  NUTCH-1570 https://issues.apache.org/jira/browse/NUTCH-1570

 Add filtering capability to Datastore 
 Querieshttps://issues.apache.org/jira/browse/NUTCH-1570
 I think if we addressed the above then we could push an RC.
 Any comments?
 I'll be able to crack on with this final push relatively soon.

 On Tue, Apr 29, 2014 at 1:09 PM, dev-digest-h...@nutch.apache.orgwrote:


 I think we can also add
 https://issues.apache.org/jira/browse/NUTCH-1674. This issue was
 waiting the stable release of 

[jira] [Commented] (NUTCH-1657) ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser

2014-05-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986726#comment-13986726
 ] 

Julien Nioche commented on NUTCH-1657:
--

+1 thanks!

 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in 
 HTMLParser
 ---

 Key: NUTCH-1657
 URL: https://issues.apache.org/jira/browse/NUTCH-1657
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1657.patch


 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are never set in 
 HTMLParser.java.
 In 2.x, we didn't set this value any field. Actually we never use this value 
 in 2.x I thought delete them. But Feng Lu guided me and I will set metadata 
 field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1618) Fetches some websites multiple times for long lasting queues

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1618:
-

Fix Version/s: (was: 2.4)
   2.3

 Fetches some websites multiple times for long lasting queues
 

 Key: NUTCH-1618
 URL: https://issues.apache.org/jira/browse/NUTCH-1618
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.1, 2.2, 2.3, 2.4
Reporter: Talat UYARER
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1618.patch


 We are using nutch for high volume crawls. We noticed that FetcherJob 
 ReduceTask fetches some websites multiple times for long lasting queues. I 
 have discovered the reason of this is 
 mapred.reduce.tasks.speculative.execution settings in hadoop. 1.x has 
 speculative execution turned off. I create a patch for 2.x



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1657) ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in HTMLParser

2014-05-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1657:
-

Fix Version/s: (was: 2.4)
   2.3

 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION never set in 
 HTMLParser
 ---

 Key: NUTCH-1657
 URL: https://issues.apache.org/jira/browse/NUTCH-1657
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Talat UYARER
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1657.patch


 ORIGINAL_CHAR_ENCODING and CHAR_ENCODING_FOR_CONVERSION are never set in 
 HTMLParser.java.
 In 2.x, we didn't set this value any field. Actually we never use this value 
 in 2.x I thought delete them. But Feng Lu guided me and I will set metadata 
 field.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1768) port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0)

2014-05-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-1768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13986943#comment-13986943
 ] 

Rogério Pereira Araújo commented on NUTCH-1768:
---

Tried to apply this patch on a fresh copy of 2.2.1 sources, no success on patch 
the first file ivy/ivy.xml by giving the following output:

Patching file ivy/ivy.xml using Plan A...
Hunk #1 failed at 32.
1 out of 1 hunks failed--saving rejects to ivy/ivy.xml.rej

Any hints?

 port NUTCH-1745 to Nutch 2.x (Upgrade to ElasticSearch 1.1.0) 
 --

 Key: NUTCH-1768
 URL: https://issues.apache.org/jira/browse/NUTCH-1768
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 2.2.1
Reporter: Julien Nioche
  Labels: elasticsearch
 Fix For: 2.4

 Attachments: NUTCH-1768.patch


 See [https://issues.apache.org/jira/browse/NUTCH-1745]
 ElasticSearch is currently at version 1.1.0. The patch attached upgrades the 
 dependencies, fixes a couple of changes required by 1.1.0 and also :
 removes the need for having ES in the main ivy dependency - it is now only 
 required at the plugin level
 improves the logic around using the cluster name or an explicit host:port to 
 connect to ES : the clustername is not required nor set when using host:port
 uses a more sensible default value for the port



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1714) Nutch 2.x upgrade to Gora 0.4

2014-05-01 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alparslan Avcı updated NUTCH-1714:
--

Attachment: NUTCH-1714v5.patch

Hi [~jnioche],
I have uploaded a new patch that also fixes the problem in _./nutch readdb 
-crawlId MYCRAWLIDHERE -stats_ command. 
Would you please test it again? Thanks!

 Nutch 2.x upgrade to Gora 0.4
 -

 Key: NUTCH-1714
 URL: https://issues.apache.org/jira/browse/NUTCH-1714
 Project: Nutch
  Issue Type: Improvement
Reporter: Alparslan Avcı
Assignee: Alparslan Avcı
 Fix For: 2.3

 Attachments: NUTCH-1714.patch, NUTCH-1714_NUTCH-1714_v2_v3.patch, 
 NUTCH-1714v2.patch, NUTCH-1714v4.patch, NUTCH-1714v5.patch


 Nutch upgrade for GORA_94 branch has to be implemented. We can discuss the 
 details in this issue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)