[jira] [Commented] (NUTCH-1581) CrawlDB csv output to include metadata

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696865#comment-13696865
 ] 

lufeng commented on NUTCH-1581:
---

I have tested it with nutch 1.x and works fine. 

+1

> CrawlDB csv output to include metadata
> --
>
> Key: NUTCH-1581
> URL: https://issues.apache.org/jira/browse/NUTCH-1581
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1581-1.8.patch
>
>
> Dumping the CrawlDB to CSV should include the CrawlDatum's metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1327:
-

Attachment: NUTCH-1327-1.8-2.patch

Thanks! I always forget something! Here's a new one plus comment!

> QueryStringNormalizer
> -
>
> Key: NUTCH-1327
> URL: https://issues.apache.org/jira/browse/NUTCH-1327
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1327-1.8-1.patch, NUTCH-1327-1.8-2.patch
>
>
> A normalizer for dealing with query strings. Sorting query strings is helpful 
> in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696854#comment-13696854
 ] 

lufeng commented on NUTCH-1327:
---

Hi Markus, I tested you patch, Do you forget to add deploy and test target into 
src/plugin/build.xml?

+1 

> QueryStringNormalizer
> -
>
> Key: NUTCH-1327
> URL: https://issues.apache.org/jira/browse/NUTCH-1327
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1327-1.8-1.patch
>
>
> A normalizer for dealing with query strings. Sorting query strings is helpful 
> in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696840#comment-13696840
 ] 

Tejas Patil commented on NUTCH-1327:


Hi Markus,

1. The patch when applied as is didn't compile the plugin. I had to add entries 
into src/plugin/build.xml to get it compiled. 
2. Can you kindly add some javadoc comments in QuerystringURLNormalizer class 
so that people can quickly get an idea about what this plugin would do ?

> QueryStringNormalizer
> -
>
> Key: NUTCH-1327
> URL: https://issues.apache.org/jira/browse/NUTCH-1327
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1327-1.8-1.patch
>
>
> A normalizer for dealing with query strings. Sorting query strings is helpful 
> in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696833#comment-13696833
 ] 

Hudson commented on NUTCH-1594:
---

Integrated in Nutch-nutchgora #669 (See 
[https://builds.apache.org/job/Nutch-nutchgora/669/])
NUTCH-1594 count variable is never changed in ParseUtil class (Revision 
1498437)

 Result = SUCCESS
fenglu : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1498437
Files : 
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParseUtil.java


> count variable is never changed in ParseUtil class
> --
>
> Key: NUTCH-1594
> URL: https://issues.apache.org/jira/browse/NUTCH-1594
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.2
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1594.patch
>
>
> in ParseUtil class the count variable is never change. the code is like this 
> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 
> so even if you define the "db.max.outlinks.per.page" parameter, it will not 
> take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Adding nutch stage

2013-07-01 Thread Tejas Patil
On Mon, Jul 1, 2013 at 5:31 AM, Ahmet Emre Aladağ wrote:

> Hi,
>
> I'd like to add a new stage called "updatescore" after "updatedb" to Nutch
> 2.1.
>
> I tried two ways for this:
> 1) public class ScoreUpdaterJob extends NutchTool implements Tool;
>
> Nutch requires me to define the InputFormat, OutputFormat etc. to perform
> Map-reduce calculations.
>
> I don't want to perform map-reduce but call a Giraph job to run on Hadoop.
> When it's finished, Nutch can go on its way.
>

> 2) public class ScoreUpdaterJob implements Tool;
> or public class ScoreUpdaterJob;
>
> Then I can't use setJarClass of NutchTool, so hadoop job fails:
> Caused by: java.lang.**ClassNotFoundException: org.apache.giraph.examples.
> **LinkRank.LinkRankComputation
>

Isn't setJarClass a method provided in Hadoop itself and something that is
not provided in NutchTool ?
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setJarByClass%28java.lang.Class%29

>
> How can I fix this? What's the best way to add a giraph job as a Nutch
> stage?
>

My feeling is that #2 should work.


> Thanks,
>
>
>


[jira] [Commented] (NUTCH-1594) count variable is never changed in ParseUtil class

2013-07-01 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696798#comment-13696798
 ] 

lufeng commented on NUTCH-1594:
---

Committed @revision 1498437 in 2.x HEAD. Thanks Canan and Lewis.

> count variable is never changed in ParseUtil class
> --
>
> Key: NUTCH-1594
> URL: https://issues.apache.org/jira/browse/NUTCH-1594
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.2
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1594.patch
>
>
> in ParseUtil class the count variable is never change. the code is like this 
> for (int i = 0; count < maxOutlinks && i < outlinks.length; i++) 
> so even if you define the "db.max.outlinks.per.page" parameter, it will not 
> take effect.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Adding nutch stage

2013-07-01 Thread Ahmet Emre Aladağ

Hi,

I'd like to add a new stage called "updatescore" after "updatedb" to 
Nutch 2.1.


I tried two ways for this:
1) public class ScoreUpdaterJob extends NutchTool implements Tool;

Nutch requires me to define the InputFormat, OutputFormat etc. to 
perform Map-reduce calculations.


I don't want to perform map-reduce but call a Giraph job to run on 
Hadoop. When it's finished, Nutch can go on its way.


2) public class ScoreUpdaterJob implements Tool;
or public class ScoreUpdaterJob;

Then I can't use setJarClass of NutchTool, so hadoop job fails:
Caused by: java.lang.ClassNotFoundException: 
org.apache.giraph.examples.LinkRank.LinkRankComputation


How can I fix this? What's the best way to add a giraph job as a Nutch 
stage?

Thanks,




Jenkins build is back to normal : Nutch-trunk #2263

2013-07-01 Thread Apache Jenkins Server
See 



[jira] [Commented] (NUTCH-1593) normalize option missing in SegmentMerger's usage

2013-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696746#comment-13696746
 ] 

Hudson commented on NUTCH-1593:
---

Integrated in Nutch-trunk #2263 (See 
[https://builds.apache.org/job/Nutch-trunk/2263/])
NUTCH-1593 Normalize option missing in SegmentMerger's usage (Revision 
1498346)

 Result = SUCCESS
markus : http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1498346
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/segment/SegmentMerger.java


> normalize option missing in SegmentMerger's usage
> -
>
> Key: NUTCH-1593
> URL: https://issues.apache.org/jira/browse/NUTCH-1593
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1593.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696705#comment-13696705
 ] 

Markus Jelsma commented on NUTCH-1327:
--

Any comments? Thanks

> QueryStringNormalizer
> -
>
> Key: NUTCH-1327
> URL: https://issues.apache.org/jira/browse/NUTCH-1327
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1327-1.8-1.patch
>
>
> A normalizer for dealing with query strings. Sorting query strings is helpful 
> in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1581) CrawlDB csv output to include metadata

2013-07-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696691#comment-13696691
 ] 

Markus Jelsma commented on NUTCH-1581:
--

I'll commit this one unless there are objections. Thanks

> CrawlDB csv output to include metadata
> --
>
> Key: NUTCH-1581
> URL: https://issues.apache.org/jira/browse/NUTCH-1581
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1581-1.8.patch
>
>
> Dumping the CrawlDB to CSV should include the CrawlDatum's metadata.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1593) normalize option missing in SegmentMerger's usage

2013-07-01 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1593.
--

Resolution: Fixed

Committed in trunk in rev. 1498346.

> normalize option missing in SegmentMerger's usage
> -
>
> Key: NUTCH-1593
> URL: https://issues.apache.org/jira/browse/NUTCH-1593
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1593.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira