RE: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Markus Jelsma
Welcome!! -Original message- From: Sujen Shah Sent: Wednesday 16th September 2015 0:58 To: dev@nutch.apache.org Cc: u...@nutch.apache.org Subject: Re: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah Hi Everyone, I would like to thank the members of the Apache Nutch PMC for bringing m

[jira] [Assigned] (NUTCH-1572) Nutch 2.x should use o.a.g.mem.store.MemStore for testing

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1572: --- Assignee: Lewis John McGibbney > Nutch 2.x should use o.a.g.mem.store.MemStor

[jira] [Updated] (NUTCH-1572) Nutch 2.x should use o.a.g.mem.store.MemStore for testing

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1572: Fix Version/s: (was: 2.4) 2.3.1 > Nutch 2.x should use o.a.g.

[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746866#comment-14746866 ] Hudson commented on NUTCH-1679: --- SUCCESS: Integrated in Nutch-nutchgora #1535 (See [https:/

[jira] [Created] (NUTCH-2101) Upgrade Nutch 2.X to Hadoop 2.4.0

2015-09-15 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-2101: --- Summary: Upgrade Nutch 2.X to Hadoop 2.4.0 Key: NUTCH-2101 URL: https://issues.apache.org/jira/browse/NUTCH-2101 Project: Nutch Issue Type: Bug

[jira] [Resolved] (NUTCH-2009) Fetcher does not work with batchID

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2009. - Resolution: Duplicate These MongoDB issues have been resolved in Gora 0.6.1 and on

[jira] [Resolved] (NUTCH-2080) Eclipse compilation issue

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2080. - Resolution: Invalid This has to do with ivy/ivy.xml configuration and should be fi

[jira] [Resolved] (NUTCH-2029) Mark.checkMark returns empty string when null is expected with mongodb storage

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-2029. - Resolution: Fixed This issue has been resolved as it was fixed over in GORA-423. W

[jira] [Resolved] (NUTCH-1922) DbUpdater overwrites fetch status for URLs from previous batches, causes repeated re-fetches

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1922. - Resolution: Duplicate This issue is a clone of NUTCH-1679 for which I just committ

[jira] [Resolved] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1679. - Resolution: Fixed Committed @revision 1703331 in 2.X HEAD > UpdateDb using batchI

[jira] [Updated] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1679: Attachment: NUTCH-1679_4.patch Patch which sorts out some trivial formatting and als

[jira] [Commented] (NUTCH-1679) UpdateDb using batchId, link may override crawled page.

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746807#comment-14746807 ] Lewis John McGibbney commented on NUTCH-1679: - I've tested this with Nutch 2.X

[jira] [Closed] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Kim Whitehall (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kim Whitehall closed NUTCH-2100. Resolution: Invalid The command was used incorrectly. There is no bug. > Nutch dump command doesnt

Re: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Sujen Shah
Hi Everyone, I would like to thank the members of the Apache Nutch PMC for bringing me on board and giving me the opportunity to become a member and committer. I am a Graduate student at the University of Southern California, majoring in Computer Science. I have been working with Chris Mattmann an

[jira] [Commented] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Kim Whitehall (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746210#comment-14746210 ] Kim Whitehall commented on NUTCH-2100: -- LOL! how dumb of me! yeap, it works. Of all t

[jira] [Commented] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746141#comment-14746141 ] Chris A. Mattmann commented on NUTCH-2100: -- Kim I think that the directory expect

[jira] [Assigned] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Chris A. Mattmann (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2100: Assignee: Chris A. Mattmann > Nutch dump command doesnt dump anything > --

[jira] [Created] (NUTCH-2100) Nutch dump command doesnt dump anything

2015-09-15 Thread Kim Whitehall (JIRA)
Kim Whitehall created NUTCH-2100: Summary: Nutch dump command doesnt dump anything Key: NUTCH-2100 URL: https://issues.apache.org/jira/browse/NUTCH-2100 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746062#comment-14746062 ] Sebastian Nagel commented on NUTCH-1932: Correct, it was about 404 pages not about

[ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-15 Thread Sebastian Nagel
Dear all, on behalf of the Nutch PMC it is my pleasure to announce that Sujen Shah has been voted in as committer and member of the Nutch PMC. Sujen, would you mind to introduce yourself to the Nutch community and tell in just a few words about your interests and your plans regarding Nutch? Cong

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746034#comment-14746034 ] Markus Jelsma commented on NUTCH-1932: -- Hello Sebastian. I am not sure about that bei

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1932: --- Attachment: NUTCH-1932-add.patch > Automatically remove orphaned pages > -

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746004#comment-14746004 ] Sebastian Nagel commented on NUTCH-1932: Hi Markus, understood. - didn't we have t

[GitHub] nutch pull request: 2.x

2015-09-15 Thread prernasatija
Github user prernasatija closed the pull request at: https://github.com/apache/nutch/pull/57 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[Nutch Wiki] Update of "AdvancedAjaxInteraction" by MichaelJoyce

2015-09-15 Thread Apache Wiki
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "AdvancedAjaxInteraction" page has been changed by MichaelJoyce: https://wiki.apache.org/nutch/AdvancedAjaxInteraction?action=diff&rev1=4&rev2=5 Comment: Updates regarding available

[jira] [Commented] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-15 Thread Hudson (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745755#comment-14745755 ] Hudson commented on NUTCH-2093: --- SUCCESS: Integrated in Nutch-trunk #3271 (See [https://bui

[jira] [Commented] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-15 Thread ASF GitHub Bot (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745663#comment-14745663 ] ASF GitHub Bot commented on NUTCH-2099: --- GitHub user sujen1412 opened a pull request

[GitHub] nutch pull request: Fix for NUTCH-2099 Contributed by Sujen Shah

2015-09-15 Thread sujen1412
GitHub user sujen1412 opened a pull request: https://github.com/apache/nutch/pull/59 Fix for NUTCH-2099 Contributed by Sujen Shah You can merge this pull request into a Git repository by running: $ git pull https://github.com/sujen1412/nutch NUTCH-2099 Alternatively you can r

[jira] [Created] (NUTCH-2099) Refactoring the REST endpoints for integration with webui

2015-09-15 Thread Sujen Shah (JIRA)
Sujen Shah created NUTCH-2099: - Summary: Refactoring the REST endpoints for integration with webui Key: NUTCH-2099 URL: https://issues.apache.org/jira/browse/NUTCH-2099 Project: Nutch Issue Type:

[jira] [Updated] (NUTCH-2098) Add null SeedUrl constructor

2015-09-15 Thread Aron Ahmadia (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aron Ahmadia updated NUTCH-2098: Attachment: 0001-Default-SeedURL-constructor.patch > Add null SeedUrl constructor >

[jira] [Created] (NUTCH-2098) Add null SeedUrl constructor

2015-09-15 Thread Aron Ahmadia (JIRA)
Aron Ahmadia created NUTCH-2098: --- Summary: Add null SeedUrl constructor Key: NUTCH-2098 URL: https://issues.apache.org/jira/browse/NUTCH-2098 Project: Nutch Issue Type: Bug Components

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jnioche
Github user jnioche commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39509421 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import java.io.ByteArr

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jorgelbg
Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39509273 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import java.io.ByteAr

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jorgelbg
Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39509063 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import java.io.ByteAr

[jira] [Comment Edited] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745330#comment-14745330 ] Nadeem Douba edited comment on NUTCH-2097 at 9/15/15 12:23 PM: -

[jira] [Comment Edited] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745330#comment-14745330 ] Nadeem Douba edited comment on NUTCH-2097 at 9/15/15 12:22 PM: -

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745330#comment-14745330 ] Nadeem Douba commented on NUTCH-2097: - Re: maven migration Would building each tool i

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745322#comment-14745322 ] Markus Jelsma commented on NUTCH-2097: -- Yes, having them as separate mapper and reduc

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Nadeem Douba (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745313#comment-14745313 ] Nadeem Douba commented on NUTCH-2097: - I'm not entirely married to the package structu

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Description: Orphan scoring filter that determines whether a page has become orphaned, e.g. it has

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch First proper working patch. Tests pass > Automatically remove orph

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745216#comment-14745216 ] Markus Jelsma commented on NUTCH-1932: -- Hey Sebastian, i fixed the location, it is al

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch > Automatically remove orphaned pages > -

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745205#comment-14745205 ] Sebastian Nagel commented on NUTCH-1932: Hi Markus, that looks quite simple - do w

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jnioche
Github user jnioche commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39492479 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import java.io.ByteArr

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-15 Thread jnioche
Github user jnioche commented on a diff in the pull request: https://github.com/apache/nutch/pull/55#discussion_r39492460 --- Diff: src/java/org/apache/nutch/tools/CommonCrawlFormatWARC.java --- @@ -0,0 +1,337 @@ +package org.apache.nutch.tools; + +import java.io.ByteArr

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch Eeh, patch with the scoring filter itself. Apparently it is possible

[jira] [Updated] (NUTCH-1932) Automatically remove orphaned pages

2015-09-15 Thread Markus Jelsma (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1932: - Attachment: NUTCH-1932.patch New and much simpler patch. This relies on a scoring filter to mark

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Sebastian Nagel (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745061#comment-14745061 ] Sebastian Nagel commented on NUTCH-2097: Yes, looks promising. - maven could simpl

[jira] [Commented] (NUTCH-2097) Proposal for Nutch 3.x

2015-09-15 Thread Lewis John McGibbney (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744983#comment-14744983 ] Lewis John McGibbney commented on NUTCH-2097: - Hi [~markus17] thanks for initi