[jira] [Updated] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1686: Attachment: NUTCH-1686.patch Optimize UpdateDb to load less field from Store

[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1687: Attachment: NUTCH-1687.patch Pick queue in Round Robin -

[jira] [Created] (NUTCH-1688) Port DeleteDuplicate based on crawlDB only to 2.x

2013-12-22 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1688: --- Summary: Port DeleteDuplicate based on crawlDB only to 2.x Key: NUTCH-1688 URL: https://issues.apache.org/jira/browse/NUTCH-1688 Project: Nutch Issue

[jira] [Updated] (NUTCH-1688) Port DeleteDuplicate based on crawlDB only to 2.x

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1688: Component/s: indexer Port DeleteDuplicate based on crawlDB only to 2.x

[jira] [Updated] (NUTCH-1688) Port DeleteDuplicate based on crawlDB only to 2.x

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1688: Attachment: NUTCH-1688.patch Port DeleteDuplicate based on crawlDB only to 2.x

[jira] [Updated] (NUTCH-1689) Improve CrawlDb stats

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1689: Attachment: NUTCH-1689.patch Improve CrawlDb stats -

[jira] [Created] (NUTCH-1689) Improve CrawlDb stats

2013-12-22 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1689: --- Summary: Improve CrawlDb stats Key: NUTCH-1689 URL: https://issues.apache.org/jira/browse/NUTCH-1689 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-1689) Improve CrawlDb stats

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1689: Fix Version/s: 2.3 Improve CrawlDb stats - Key:

[jira] [Created] (NUTCH-1690) IndexClean: mark url as unindexed after clean to not delete again

2013-12-22 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1690: --- Summary: IndexClean: mark url as unindexed after clean to not delete again Key: NUTCH-1690 URL: https://issues.apache.org/jira/browse/NUTCH-1690 Project: Nutch

[jira] [Updated] (NUTCH-1690) IndexClean: mark url as unindexed after clean to not delete again

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1690: Fix Version/s: 2.3 IndexClean: mark url as unindexed after clean to not delete again

[jira] [Commented] (NUTCH-1686) Optimize UpdateDb to load less field from Store

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855385#comment-13855385 ] Nguyen Manh Tien commented on NUTCH-1686: - no backwards compatibility, because i

[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13855386#comment-13855386 ] Nguyen Manh Tien commented on NUTCH-1687: - I found one in double linked list

[jira] [Updated] (NUTCH-1689) Improve CrawlDb stats

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1689: Attachment: (was: NUTCH-1690.patch) Improve CrawlDb stats -

[jira] [Updated] (NUTCH-1689) Improve CrawlDb stats

2013-12-22 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1689: Attachment: NUTCH-1690.patch Thanks Tejas for reviewing 1)2) I think my change don't

[jira] [Commented] (NUTCH-1314) Impose a limit on the length of outlink target urls

2013-12-20 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13854780#comment-13854780 ] Nguyen Manh Tien commented on NUTCH-1314: - [~lewismc] We are using

[jira] [Created] (NUTCH-1682) Port optionally maintain custom fetch interval despite AdaptiveFetchSchedule to 2.x

2013-12-10 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1682: --- Summary: Port optionally maintain custom fetch interval despite AdaptiveFetchSchedule to 2.x Key: NUTCH-1682 URL: https://issues.apache.org/jira/browse/NUTCH-1682

[jira] [Updated] (NUTCH-1682) Port optionally maintain custom fetch interval despite AdaptiveFetchSchedule to 2.x

2013-12-10 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1682: Attachment: NUTCH-1682.patch Port optionally maintain custom fetch interval despite

[jira] [Updated] (NUTCH-1682) Port optionally maintain custom fetch interval despite AdaptiveFetchSchedule to 2.x

2013-12-10 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1682: Affects Version/s: (was: 2.3) Port optionally maintain custom fetch interval despite

[jira] [Updated] (NUTCH-1682) Port optionally maintain custom fetch interval despite AdaptiveFetchSchedule to 2.x

2013-12-10 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1682: Fix Version/s: 2.3 Port optionally maintain custom fetch interval despite

[jira] [Created] (NUTCH-1683) Optionally maintain custom fetch interval despite AbstractFetchSchedule

2013-12-10 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1683: --- Summary: Optionally maintain custom fetch interval despite AbstractFetchSchedule Key: NUTCH-1683 URL: https://issues.apache.org/jira/browse/NUTCH-1683 Project:

[jira] [Updated] (NUTCH-1683) Optionally maintain custom fetch interval despite AbstractFetchSchedule

2013-12-10 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1683: Description: DefaultFetchSchedul also change fetch interval so we should also maintain

[jira] [Updated] (NUTCH-1683) Optionally maintain custom fetch interval despite AbstractFetchSchedule

2013-12-10 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1683: Attachment: (was: NUTCH-1683.patch) Optionally maintain custom fetch interval despite

[jira] [Updated] (NUTCH-1683) Optionally maintain custom fetch interval despite AbstractFetchSchedule

2013-12-10 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1683: Attachment: NUTCH-1683.patch Optionally maintain custom fetch interval despite

[jira] [Created] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2013-12-04 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1679: --- Summary: UpdateDb using batchId, link may override crawled page. Key: NUTCH-1679 URL: https://issues.apache.org/jira/browse/NUTCH-1679 Project: Nutch

[jira] [Updated] (NUTCH-1672) Inlinks are added twice in DbUpdateReducer

2013-11-24 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1672: Attachment: NUTCH-1672.patch Inlinks are added twice in DbUpdateReducer

[jira] [Created] (NUTCH-1672) Inlinks are added twice in DbUpdateReducer

2013-11-24 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1672: --- Summary: Inlinks are added twice in DbUpdateReducer Key: NUTCH-1672 URL: https://issues.apache.org/jira/browse/NUTCH-1672 Project: Nutch Issue Type:

[jira] [Created] (NUTCH-1673) Title isn't reset in MoreIndexingFilter

2013-11-24 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1673: --- Summary: Title isn't reset in MoreIndexingFilter Key: NUTCH-1673 URL: https://issues.apache.org/jira/browse/NUTCH-1673 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1673) Title isn't reset in MoreIndexingFilter

2013-11-24 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1673: Attachment: NUTCH-1673.patch Title isn't reset in MoreIndexingFilter

[jira] [Created] (NUTCH-1674) Use batchId filter enable scan (GORA-119) for Fetch,Parse,Update,Index

2013-11-24 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1674: --- Summary: Use batchId filter enable scan (GORA-119) for Fetch,Parse,Update,Index Key: NUTCH-1674 URL: https://issues.apache.org/jira/browse/NUTCH-1674 Project:

[jira] [Created] (NUTCH-1667) Updatedb always ignore batchId

2013-11-14 Thread Nguyen Manh Tien (JIRA)
Nguyen Manh Tien created NUTCH-1667: --- Summary: Updatedb always ignore batchId Key: NUTCH-1667 URL: https://issues.apache.org/jira/browse/NUTCH-1667 Project: Nutch Issue Type: Bug

[jira] [Updated] (NUTCH-1667) Updatedb always ignore batchId

2013-11-14 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1667: Attachment: NUTCH-1556-batchId.patch Updatedb always ignore batchId

[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId

2013-11-12 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nguyen Manh Tien updated NUTCH-1556: Attachment: NUTCH-1556-batchId.patch batchId is not set in currentJob because we set

[jira] [Commented] (NUTCH-961) Expose Tika's boilerpipe support

2013-10-07 Thread Nguyen Manh Tien (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13788911#comment-13788911 ] Nguyen Manh Tien commented on NUTCH-961: I used patch NUTCH-961-2.1-v2.patch for