What is the correct way to serialize a MapWritable to WebPage's metadata?

2014-01-20 Thread d_k
I'm working on porting NUTCH-1622 to Nutch 2 and the path I took was to add
a MapWritable field to the Outlink class to hold the metadata.

In order to store the metadata in the WebPage so it will be passed along
the mappers and reducers I used the metadata field of the WebPage class.

Because the putToMetadata method of the WebPage accepts a ByteBuffer, in
order to convert the MapWritable to a ByteBuffer i'm using something along
the lines of:

ByteArrayOutputStream outStream = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(outStream);

MapWritable outlinkMap = new MapWritable();

// ... fill outlinkMap ...

try {
outlinkMap.write(dataOut);
dataOut.close();
}
catch (IOException e) {
LOG.warn("...");
}

ByteBuffer byteBuffer = ByteBuffer.wrap(outStream.toByteArray());
page.putToMetadata(new Utf8("outlinks-metadata"), byteBuffer);

And I would be happy to get some input on:
1) Is it the correct way to convert the MapWritable to a ByteBuffer to be
stored in the WebPage's metadata?
2) Should the metadata be stored in the metadata field as a ByteBuffer or
is there a better way to pass along the metadata?
3) Did I waste my time working with MapWritable and could of used any java
collection as long as the target JVM could of deserialized it considering
that all that is passed is an array of bytes and Outlink is never passed as
it is. Outlinks are passed as a map between url and anchor (utf8, utf8).

... my next change was to make the Utf8 allocation static... :-P


[jira] [Commented] (NUTCH-1674) Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index

2014-01-20 Thread Alexander Uretsky (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876432#comment-13876432
 ] 

Alexander Uretsky commented on NUTCH-1674:
--

Tried this out and it works great! Thanks for the help! 

> Use batchId filter to enable scan (GORA-119) for Fetch,Parse,Update,Index
> -
>
> Key: NUTCH-1674
> URL: https://issues.apache.org/jira/browse/NUTCH-1674
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3
>Reporter: Tien Nguyen Manh
> Fix For: 2.3
>
> Attachments: NUTCH-1674.patch, NUTCH-1674_2.patch, NUTCH-1674_3.patch
>
>
> Nutch always scan the whole crawldb in each phrase (generate, fetch, parse, 
> update, index). When crawldb is big, the time to scan is bigger than the 
> actual processing time.
> We really need to skip records while scanning using GORA-119 for example we 
> can only get records belong to a specified batchId.
> In my crawl the filter reduce the time to scan from 90 min to 30 min.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876426#comment-13876426
 ] 

Markus Jelsma commented on NUTCH-1113:
--

I have to reindex my control cluster segment by segment in chronological order 
because NUTCH-1706 was not enabled when i reindexed it last friday. According 
to some test segments that should decrease the size of the control cluster by 
properly deleting some redirects!

> Merging segments causes URLs to vanish from crawldb/index?
> --
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Edward Drapkin
>Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876384#comment-13876384
 ] 

Markus Jelsma commented on NUTCH-1113:
--

I got less documents indexed when ignoring fetch_retry (726830) that previously 
but some control documents made it in. I think this is better.

> Merging segments causes URLs to vanish from crawldb/index?
> --
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Edward Drapkin
>Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876304#comment-13876304
 ] 

Markus Jelsma commented on NUTCH-1113:
--

Ok, i got something! A record that wasn't indexed by the merged segment has a 
fetch status, that is because fetch_retry is also considered a fetch status. 
I'll come up with a new patch that also considers this case.

> Merging segments causes URLs to vanish from crawldb/index?
> --
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Edward Drapkin
>Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1680) CrawldbReader to dump minRetry value

2014-01-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876293#comment-13876293
 ] 

Hudson commented on NUTCH-1680:
---

SUCCESS: Integrated in Nutch-trunk #2498 (See 
[https://builds.apache.org/job/Nutch-trunk/2498/])
NUTCH-1680 CrawlDbReader to dump minRetry value (markus: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1559657)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java


> CrawldbReader to dump minRetry value
> 
>
> Key: NUTCH-1680
> URL: https://issues.apache.org/jira/browse/NUTCH-1680
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1680-trunk.patch
>
>
> CrawlDBReader should be able to dump records based on minimum retry value.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-01-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876288#comment-13876288
 ] 

Markus Jelsma commented on NUTCH-1708:
--

Hi Sebastian - we've had issues with that before and tracked it down to the 
representative URL being indexing in index-basic as well. We choose to 
completely remove that from our custom indexing filter, in my opinion an URL 
must be indexed by it's real URL, not some representative URL. Indexing 
representative URL's also causes duplicates, which may or may not be removed by 
Nutch' new deduplicating code because the signatures are usually not the same.

> use same id when indexing and deleting redirects
> 
>
> Key: NUTCH-1708
> URL: https://issues.apache.org/jira/browse/NUTCH-1708
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.7
>Reporter: Sebastian Nagel
>
> Redirect targets are indexed using "representative URL"
> * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
> CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
> pair.
> * NutchField "url" is filled by basic indexing filter with repr URL
> * id field used as unique key is filled from url per solrindex-mapping.xml
> Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
> the URL of the redirect source. If the source URL is chosen as repr URL a 
> redirect target may get erroneously deleted.
> Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
> {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
> that same URL is deleted and added:
> {code}
> delete  http://wiki.apache.org/nutch
> add http://wiki.apache.org/nutch
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?

2014-01-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876283#comment-13876283
 ] 

Markus Jelsma commented on NUTCH-1113:
--

Yes, i reindexed them segment for segment. Out of the merged segment i got 
726842 and the segments in, indeed, chronological order produced 726920 
documents. I used my patch for merging, that patch that skips LINKED 
CrawlDatums. Something is not right here. I also noticed NUTCH-1708, the issue 
of the representative URL's. The indexing filter we use does not use 
representative URL's for indexing.

> Merging segments causes URLs to vanish from crawldb/index?
> --
>
> Key: NUTCH-1113
> URL: https://issues.apache.org/jira/browse/NUTCH-1113
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Edward Drapkin
>Priority: Blocker
> Fix For: 1.9
>
> Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, 
> NUTCH-1113-trunk.patch, merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene 
> indexing code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up 
> in the index vs. when I crawl without merging the segments.  Somehow the 
> segment merger causes me to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1680) CrawldbReader to dump minRetry value

2014-01-20 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1680.
--

Resolution: Fixed

Thanks! Committed to trunk in revision 1559657.


> CrawldbReader to dump minRetry value
> 
>
> Key: NUTCH-1680
> URL: https://issues.apache.org/jira/browse/NUTCH-1680
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1680-trunk.patch
>
>
> CrawlDBReader should be able to dump records based on minimum retry value.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1697) SegmentMerger to implement Tool

2014-01-20 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876278#comment-13876278
 ] 

Markus Jelsma commented on NUTCH-1697:
--

Hi Tejas, i think it does not matter, most of our scripts use use -Dprop=val 
without space.

> SegmentMerger to implement Tool
> ---
>
> Key: NUTCH-1697
> URL: https://issues.apache.org/jira/browse/NUTCH-1697
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1697-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1707) DummyIndexingWriter

2014-01-20 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1707:
-

Attachment: NUTCH-1707-trunk.patch

Updated patch to use URL field.

> DummyIndexingWriter
> ---
>
> Key: NUTCH-1707
> URL: https://issues.apache.org/jira/browse/NUTCH-1707
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: NUTCH-1707-trunk.patch, NUTCH-1707-trunk.patch, 
> NUTCH-1707-trunk.patch
>
>
> DummyIndexingWriter that is supposed to write COMMAND\tID to a plain text 
> file. This plain text file can then be parsed by a unit test to check whether 
> IndexerMapReduce actually behaves as it should.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)