[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-03-06 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922724#comment-13922724
 ] 

Tejas Patil commented on NUTCH-1325:


It would take me few weeks before I can work on this one. The reason being: I 
have recently left school and started working at a company. There is some legal 
paperwork that I would have to finish off to work on open source projects (even 
if its during my free time).   

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>    Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [DISCUSS] Release Trunk

2014-02-13 Thread Tejas Patil
Thanks Lewis. G+ hangout sounds cool. Is this wiki page complete and
updated to start off ?
http://wiki.apache.org/nutch/Release_HOWTO

Thanks,
Tejas


On Thu, Feb 13, 2014 at 12:23 AM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Folks,
> @Tejasp
>
> On Thu, Feb 13, 2014 at 6:30 AM,  wrote:
>
>> Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
>> have some significant patches since 1.7.
>> +1 for new release. I would be happy to volunteer / help.
>>
>>
>>
> If you're game for learning the release manager role then I'm +1 to
> support you in that. We can do G+ hangout whilst you do it so that it all
> goes smoothly.
> If you change your mind just let me know and I'll push an RC today.
> Great work on trunk folks... lots of fixes ;)
> Lewis
>


Re: [DISCUSS] Release Trunk

2014-02-12 Thread Tejas Patil
Just saw the commits since 1.7 release. Apart from trivial bug fixes, we
have some significant patches since 1.7.
+1 for new release. I would be happy to volunteer / help.

Thanks,
Tejas



On Wed, Feb 12, 2014 at 7:33 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi guys,
>
> At least 2 of the issues that Seb and I had mentioned have now been
> committed. What about releasing 1.8 from trunk? If so, any volunteers?
>
> Julien
>
>
> On 2 December 2013 21:02, Sebastian Nagel wrote:
>
>> Hi,
>>
>> +1 to release soon (this year, or early next year)
>>
>> > and probably a few others but they could also be done later.
>> At least, these should be done before releasing:
>> NUTCH-1646 IndexerMapReduce to consider DB status
>> NUTCH-1413 Record response time
>>
>> Sebastian
>>
>> On 11/28/2013 05:49 PM, Julien Nioche wrote:
>> > Hi Lewis
>> >
>> > We've done quite a few things in 1.x since the previous release (e.g.
>> generic deduplication,
>> > removing indexer.solr package, etc...)  and the next 2.x release will
>> be after the changes to GORA
>> > have been made, tested and used on the Nutch side so that could be
>> quite a while.
>> >
>> > I am neutral as to whether we should do a 1.x release now. There are
>> some minor issues that we could
>> > do in 1.x before the next release like :
>> > * https://issues.apache.org/jira/browse/NUTCH-1360
>> > * https://issues.apache.org/jira/browse/NUTCH-1676
>> > and probably a few others but they could also be done later.
>> >
>> > Let's hear what others think.
>> >
>> > Thanks
>> >
>> > Julien
>> >
>> >
>> >
>> >
>> > On 28 November 2013 16:34, Lewis John Mcgibbney <
>> lewis.mcgibb...@gmail.com
>> > > wrote:
>> >
>> > Hi Folks,
>> > Thread says it all.
>> > There are some hot tickets over in Gora right now so I think
>> holding off the next while for a
>> > 2.x release would be wise.
>> > I can spin the RC for trunk tonight/tomorrow/weekend if we get the
>> thumbs up.
>> > Ta
>> > Lewis
>> >
>> > --
>> > /Lewis/
>> >
>> >
>> >
>> >
>> > --
>> > *
>> > *Open Source Solutions for Text Engineering
>> >
>> > http://digitalpebble.blogspot.com/
>> > http://www.digitalpebble.com
>> > http://twitter.com/digitalpebble
>>
>>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


[jira] [Resolved] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-02-09 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1721.


Resolution: Fixed

Committed to trunk (rev 1566255) and 2.x (rev 1566257)

> Upgrade to Crawler commons 0.3
> --
>
> Key: NUTCH-1721
> URL: https://issues.apache.org/jira/browse/NUTCH-1721
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.7, 2.2, 2.2.1
>    Reporter: Tejas Patil
>    Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-01-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887784#comment-13887784
 ] 

Tejas Patil commented on NUTCH-1721:


Attached patches, all test cases are passing.

> Upgrade to Crawler commons 0.3
> --
>
> Key: NUTCH-1721
> URL: https://issues.apache.org/jira/browse/NUTCH-1721
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.7, 2.2, 2.2.1
>    Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-01-31 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1721:
---

Attachment: NUTCH-1721-2.x.patch
NUTCH-1721-trunk.patch

> Upgrade to Crawler commons 0.3
> --
>
> Key: NUTCH-1721
> URL: https://issues.apache.org/jira/browse/NUTCH-1721
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.7, 2.2, 2.2.1
>    Reporter: Tejas Patil
>    Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-01-31 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1721:
--

 Summary: Upgrade to Crawler commons 0.3
 Key: NUTCH-1721
 URL: https://issues.apache.org/jira/browse/NUTCH-1721
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2.1, 2.2, 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 2.3, 1.8






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887763#comment-13887763
 ] 

Tejas Patil commented on NUTCH-1465:


Re "filters and normalizers": +1.

Re "fetch intervals" and "reducer overwriting": I have never encountered bogus 
sitemaps but that was for a intranet crawl and it would be better to take care 
of that in this jira. Here is what I conclude from the discussion till now:
(1)  _fetch interval_: For old entries, don't use the value from sitemap. For 
new ones, use the value from sitemap provided 
(db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max)
(2) _score_: Never use value from sitemap. For new ones, use scoring filters. 
Keep the value of old entries as it is.
(3) _modified time_: Always use the value from sitemap provided its not a date 
in future.

Did I get it right ?
 
Re "score": I missed that the jar is old. Would file a jira to upgrade CC to 
v0.3 in Nutch.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886677#comment-13886677
 ] 

Tejas Patil commented on NUTCH-1465:


Interesting comments [~wastl-nagel].

Re "filters and normalizers" : By default I have kept those ON but can be 
disabled by using "-noFilter" and "-noNormalize".
Re "default content limits" and "fetch timeout": +1. Agree with you.
Re "Processing sitemap indexes fails" : +1. Nice catch.
Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, 
Injector allows users to provide a custom fetch interval with any value eg. 1 
sec. It makes sense not the correct it as user wants Nutch use that custom 
fetch interval. If we view sitemaps as custom seed list given by a content 
owner, then it would make sense to follow the intervals. But as you said that 
sitemaps can be wrongly set or outdated, the intervals might be incorrect. The 
question bolis down to: We are blindly accepting user's custom information in 
inject. Should we blindly assume that sitemaps are correct or not ? I have no 
strong opinion about either side of the argument. 

(PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 
1 hr as per db.fetch.schedule.adaptive.min_interval <= interval)

Re "SitemapReducer overwriting" : 
>> _"If a sitemap does not specify one of score, modified time, or fetch 
>> interval this values is set to zero. "_
Nope. See 
[SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java]

 (a) score : Crawler commons assigns a default score of 0.5 if there was none 
provided in sitemap. 
We can do this: If an old entry has score other than 0.5, it can be preserved 
else update. For new entry, use scoring plugins for score equal to 0.5, else 
preserve the same. 
Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap 
or the default one if  was absent.
 (b) fetch interval : Crawler commons does NOT set fetch interval if there was 
none provided in sitemap. So we are sure that whatever value is used is coming 
from . Validation might be needed as per comments above.
 (c) modified time : Same as fetch interval, unless parsed from sitemap file, 
modified time is set to NULL. Only possible validation is to drop values 
greater than current time.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1718) update description of property http.robots.agent

2014-01-29 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885650#comment-13885650
 ] 

Tejas Patil commented on NUTCH-1718:


Hi [~someuser77], Yup. I am waiting for folks to comment if that addition is 
fine. If it is, then I would go ahead and update the description of this jira.

> update description of property http.robots.agent
> 
>
> Key: NUTCH-1718
> URL: https://issues.apache.org/jira/browse/NUTCH-1718
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1718-trunk.v1.patch
>
>
> The description of property http.robots.agent in nutch-default.xml recommends 
> to add a '*' to the list of agent names. This will cause the same problem as 
> described in NUTCH-1715. The description should be updated. Also regarding 
> "order of precedence" which is dictated since NUTCH-1031 only by ordering of 
> user agents in robots.txt.
> {code:xml}
> 
>   http.robots.agents
>   *
>   The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent

2014-01-28 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1718:
---

Attachment: NUTCH-1718-trunk.v1.patch

Thanks [~wastl-nagel] for bringing this up. I should have updated the 
documentation with NUTCH-1715 but lost track of the same.

In addition to having a documentation, I am proposing this: 
Instead of making users to have 'http.agent.name' as the first agent in 
'http.robots.agents', make the program do that automatically. So users would 
make use of 'http.robots.agents' to specify any additional agents apart from 
'http.agent.name'. Here is a patch for the same.

> update description of property http.robots.agent
> 
>
> Key: NUTCH-1718
> URL: https://issues.apache.org/jira/browse/NUTCH-1718
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1718-trunk.v1.patch
>
>
> The description of property http.robots.agent in nutch-default.xml recommends 
> to add a '*' to the list of agent names. This will cause the same problem as 
> described in NUTCH-1715. The description should be updated. Also regarding 
> "order of precedence" which is dictated since NUTCH-1031 only by ordering of 
> user agents in robots.txt.
> {code:xml}
> 
>   http.robots.agents
>   *
>   The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-28 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v5.patch

Adding new patch 'v5' with below changes:
1. Added Apache license header as per review comment by [~wastl-nagel]
2. Added counters in log output as per review comment by [~wastl-nagel]
3. Implemented the change suggested by [~wastl-nagel] for 'isHost' and 
'filterNormalize'. I could do more re-factoring and make it more clean.
4. Added a new parameter "-noStrict" to control the checking done by sitemap 
parser 

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883204#comment-13883204
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],
Thanks a lot for your comments. First two were straight forward and I agree 
with those.

Re "hacky way" : For hosts from the HostDb, we don't know which protocol they 
below to. In the code I was checking if http:// is a match and if that was a 
bad guess then try with https://. I didn't handle for ftp:// and file:/ 
schemes. By "hacky" I meant this approach of trial-and-error till a suitable 
match is formed and we create a homepage url for the host. I have thought of 
your comment and would have a better (yet hacky) way in the coming patch.

Re "concurrency": I had thought of this and had searched over internet for 
internals of MultithreadedMapper. All I could get is that it has an internal 
thread pool and each input record to handed over to a thread in this pool to 
run map() over it. I wrote this code to check if thread safety was ensured in 
MultithreadedMapper:

{noformat}
  private static class SitemapMapper extends Mapper {
private String myurl = null;

public void map(Text key, Writable value, Context context) throws 
IOException, InterruptedException {
  if (value instanceof Text) {
String url = key.toString();
if(foo(url).compareTo(url) != 0) {
  LOG.warn("Race condition found !!!");
}
  }
}

private String foo(String url) {
  myurl = url;
  if(Thread.currentThread().getId() % 2 == 1) {
try {
  Thread.sleep(1);
} catch(InterruptedException e) {
  LOG.warn(e.getMessage());
}
  }
  return myurl;
}
{noformat}

I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never 
hit the race condition in the code. Is the code snippet above a good way to 
reveal any race condition in the code ? Its won't be a formal conclusion and 
more of an experimental conclusion. How do I get a concrete conclusion whether 
MultithreadedMapper is thread safe or not ?

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

2014-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882771#comment-13882771
 ] 

Tejas Patil commented on NUTCH-1084:


The issue gets reproduced on current trunk. Attaching a test segment :
https://issues.apache.org/jira/secure/attachment/12625275/20140126210858.tgz

The workaround suggested by [~markus17] in comment above works correctly.

> ReadDB url throws exception
> ---
>
> Key: NUTCH-1084
> URL: https://issues.apache.org/jira/browse/NUTCH-1084
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to 
> write the _SUCCESS file. Until now that's the solution implemented for 
> similar issues. I've not been successful as to make the Hadoop readers simply 
> skip the file.
> The second issue seems a bit strange and did not happen on a local check out. 
> I'm not yet sure whether this is a Hadoop issue or something being corrupt in 
> the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: 
> org.apache.nutch.protocol.ProtocolStatus because 
> org.apache.nutch.protocol.ProtocolStatus
> at 
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
> at 
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> at 
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882770#comment-13882770
 ] 

Tejas Patil commented on NUTCH-1692:


Hi [~markus17], 
I didn't knew about NUTCH-1084 until now and after going through it totally 
agree that the exception I faced was due to that issue. With that workaround 
and the patch for this jira, the NPE issue seems fixed. +1 for commit.

> SegmentReader broken in distributed mode
> 
>
> Key: NUTCH-1692
> URL: https://issues.apache.org/jira/browse/NUTCH-1692
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch
>
>
> SegmentReader -list option ignores the -no* options, causing the following 
> exception in distributed mode:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at 
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-26 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v4.patch

Attaching v4 patch with the suggestions #1 and #2 from [~lewismc].

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>    Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-26 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1692:
---

Attachment: 20140126210858.tgz

Attaching the test segment (20140126210858.tgz)

> SegmentReader broken in distributed mode
> 
>
> Key: NUTCH-1692
> URL: https://issues.apache.org/jira/browse/NUTCH-1692
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch
>
>
> SegmentReader -list option ignores the -no* options, causing the following 
> exception in distributed mode:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at 
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-26 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882348#comment-13882348
 ] 

Tejas Patil commented on NUTCH-1692:


Hi [~markus17],
I am tried out the patch on a latest trunk checkout and it ran fine in local 
mode. In deploy mode, I encountered this:
{noformat}
$ bin/nutch readseg -list 20140126210858/ -nocontent -nogenerate
14/01/26 22:26:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/01/26 22:26:16 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.io.IOException: can't find class: 
org.apache.nutch.protocol.ProtocolStatus because 
org.apache.nutch.protocol.ProtocolStatus
at 
org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:280)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
at 
org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:485)
at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:597)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
{noformat}

> SegmentReader broken in distributed mode
> 
>
> Key: NUTCH-1692
> URL: https://issues.apache.org/jira/browse/NUTCH-1692
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: NUTCH-1692-trunk.patch
>
>
> SegmentReader -list option ignores the -no* options, causing the following 
> exception in distributed mode:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at 
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1715.


Resolution: Fixed

The change was verified over nutch-user mailing list. Committed to trunk 
(revision 1561087) and 2.x (revision 1561088).

> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1715.2.x.patch, NUTCH-1715.trunk.patch
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1715:
---

Attachment: NUTCH-1715.2.x.patch
NUTCH-1715.trunk.patch

> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1715.2.x.patch, NUTCH-1715.trunk.patch
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1715:
---

Description: 
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (ie. *) to it in the end. This is sent to 
crawler commons while parsing the rules. The wildcard gets matched first in 
robots file with (User-agent: *) if that comes before any other matching rule 
thus resulting in a allowed url being robots denied. 

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E

  was:
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard '*' to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard '*' added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E


> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1716) RobotRulesParser adds extra '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1716.


Resolution: Duplicate

Accidentally duplicated NUTCH-1715

> RobotRulesParser adds extra '*' to the robots name
> --
>
> Key: NUTCH-1716
> URL: https://issues.apache.org/jira/browse/NUTCH-1716
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This bug was reported by @Markus Jelsma. The discussion over nutch-user can 
> be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E
>  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1715:
---

Description: 
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard '*' to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard '*' added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E

  was:
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (*) to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard (*) added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E


> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard '*' to it in the end. This is sent to 
> crawler commons while parsing the rules. The wildcard '*' added to the end 
> gets matched with the first rule in robots file and thus results in the url 
> being robots denied while the robots.txt actually allows them.
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1716) RobotRulesParser adds extra '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1716:
--

 Summary: RobotRulesParser adds extra '*' to the robots name
 Key: NUTCH-1716
 URL: https://issues.apache.org/jira/browse/NUTCH-1716
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.2.1, 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 2.3, 1.8


In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (ie. *) to it in the end. This is sent to 
crawler commons while parsing the rules. The wildcard gets matched first in 
robots file with (User-agent: *) if that comes before any other matching rule 
thus resulting in a allowed url being robots denied. 

This bug was reported by @Markus Jelsma. The discussion over nutch-user can be 
found here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E
 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1715:
--

 Summary: RobotRulesParser adds additional '*' to the robots name
 Key: NUTCH-1715
 URL: https://issues.apache.org/jira/browse/NUTCH-1715
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.2.1, 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 2.3, 1.8


In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (*) to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard (*) added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1676) Add rudimentary SSL support to protocol-http

2014-01-24 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881143#comment-13881143
 ] 

Tejas Patil commented on NUTCH-1676:


Hi [~markus17],
I tried out the patch with couple of https urls and it works correctly. Few 
comments on the patch:

(1) In src/plugin/protocol-http/plugin.xml, the same stuff is repeated twice. 
Not sure if that was accidental or meant to be different

{code:title=plugin.xml|borderStyle=solid}
+  
+  
+   
+  

+  
+   
+  
{code}

(2) In HttpBase.java: The values in this line go till column 2070 and might be 
painful while looking at the list. Is there any way to avoid it (maybe using a 
String array) ?

{code:title=HttpBase.java|borderStyle=solid}
conf.getStrings("http.tls.supported.cipher.suites", 
"TLS_ECDHE_ECDSA_WITH_AES_256_CBC
{code}

(3) The class description is empty after the deletion of author tag. Can you 
please fill that ?

{code:title=HttpBase.java|borderStyle=solid}
/**
 */
public abstract class HttpBase implements Protocol {
{code}

> Add rudimentary SSL support to protocol-http
> 
>
> Key: NUTCH-1676
> URL: https://issues.apache.org/jira/browse/NUTCH-1676
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.7
>Reporter: Julien Nioche
> Fix For: 1.8
>
> Attachments: NUTCH-1676-2x.patch, NUTCH-1676.patch, NUTCH-1676.patch, 
> NUTCH-1676.patch, NUTCH-1676.patch
>
>
> Adding https support to our http protocol would be a good thing even if it 
> does not handle the security. This would save us from having to use the 
> http-client plugin which is buggy in its current form. 
> Patch generated from 
> https://github.com/Aloisius/nutch/commit/d3e15a1db0eb323ccdcf5ad69a3d3a01ec65762c#commitcomment-4720772
> Needs testing...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Fwd: [jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-24 Thread Tejas Patil
Hi Lewis,
I won't be surprised if any user out there gets a cake for this jira ;) ..
just like someone did over a MySQL bug (
http://www.youtube.com/watch?v=oAiVsbXVP6k)

Cheers !!

-- Forwarded message --
From: Lewis John McGibbney (JIRA) 
Date: Fri, Jan 24, 2014 at 7:01 PM
Subject: [jira] [Commented] (NUTCH-356) Plugin repository cache can lead to
memory leak
To: tej...@apache.org



[
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880956#comment-13880956]

Lewis John McGibbney commented on NUTCH-356:


Indeed Markus. This one is blast from the past for sure. I don't know if I
was born when Nutch 0.8 was trunk ;)

> Plugin repository cache can lead to memory leak
> ---
>
> Key: NUTCH-356
> URL: https://issues.apache.org/jira/browse/NUTCH-356
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Enrico Triolo
>Assignee: Markus Jelsma
> Fix For: 2.3, 1.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java,
ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch,
cache_classes.patch
>
>
> While I was trying to solve a problem I reported a while ago (see
Nutch-314), I found out that actually the problem was related to the plugin
cache used in class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant
to work, since I need to frequently submit new urls and append their
contents to the index; I don't (and I can't) have an urls.txt file with all
urls I'm going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using
nutch as-is, since the problem I found occours only if nutch is used in a
way similar to the one I use.
> To simplify your test I'm attaching a class that performs something
similar to what I need. It fetches and index some sample urls; to avoid
webmasters complaints I left the sample urls list empty, so you should
modify the source code and add some urls.
> Then you only have to run it and watch your memory consumption with top.
In my experience I get an OutOfMemoryException after a couple of minutes,
but it clearly depends on your heap settings and on the plugins you are
using (I'm using
'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since
it never get released. It seems that some class maintains a reference to it
and this class is never released since it is cached somewhere in the
configuration.
> So I modified the PluginRepository's 'get' method so that it never uses
the cache and always returns a new instance (you can find the patch in
attachment). This way the memory consumption is always stable and I get no
OOM anymore.
> Clearly this is not the solution, since I guess there are many
performance issues involved, but for the moment it works.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Right way to run crawl script in deploy mode

2014-01-23 Thread Tejas Patil
Correction: the subject of this message should have read:
"Right way to run crawl script in deploy mode"

~tejas

On Wed, Jan 22, 2014 at 7:56 PM, Tejas Patil wrote:

> Hi nutch-dev,
>
> I was assuming that the commands to run the bin/crawl script in both local
> and deploy mode are the same.
> ie. from $NUTCH_HOME/runtime/local (or runtime/deploy),  use
> > bin/crawl
>
> It turns out that in deploy mode, this does not obtain the segment
> location from HDFS and runs into problems. The reason being this code
> snippet in the crawl script: it tries to locate the job file in the parent
> directory and fails (note that I am running from runtime/deploy):
>
> mode=local
> if [ -f ../*nutch-*.job ]; then
> mode=distributed
> fi
>
> When ran from runtime/deploy/bin, it runs properly.
> Shouldn't the command be consistent with that of local mode ?
>
> Thanks,
> Tejas
>
>


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880295#comment-13880295
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~lewismc],
+1 for the first two suggestions. For #3: I skimmed through the methods inside 
URLUtil.java and nothing came to my notice that I could use in the Sitemap code 
you pointed. Can you please confirm ?

A big thanks mate for trying out the feature. Hopefully we get this into 1.8 
release.
Cheers !!


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Fix Version/s: (was: 1.9)
   1.8

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>    Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880288#comment-13880288
 ] 

Tejas Patil commented on NUTCH-1712:


The performance gains due to this patch won't be phenomenal for small seeds 
file w/o any metadata and large crawldb's. The only savings with this patch is 
in terms of saving time over :-
1. dumping the output of the first job (ie. datum objects for the seed urls)
2. reading this output as input for the next job
3. job launch and cleanup.

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1164) Write JUnit tests for protocol-http

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1164.


Resolution: Fixed

The patch is better now and all tests pass. It needed little modification: you 
can't check string equality using equals sign and re-factoring. Committed to 
2.x (rev 1560786). 

Thanks a lot for your contribution [~Sertac Turkel] !!

> Write JUnit tests for protocol-http
> ---
>
> Key: NUTCH-1164
> URL: https://issues.apache.org/jira/browse/NUTCH-1164
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.4
>
> Attachments: NUTCH-1164.patch, 
> TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-23 Thread Tejas Patil
On Thu, Jan 23, 2014 at 1:36 PM, d_k  wrote:

> My main concerns with the Nutch2Tutorial was that it didn't stand by
> itself. As a newcomer to nutch I treated the NutchTutorial (for 1.x) with
> suspicion because I didn't know what is relevant for Nutch 2 and what isn't.
> And the Nutch2Tutorial tutorial alone is not enough to get you going.
>
> I think this can be addressed by creating a single page or perhaps several
> pages that together cover everything you need to perform a basic crawl:
>
> [*] Configuring the data store
> [**] HBase
> [**] Cassandra
>
   [*] General nutch 2 client configuration that are relevant to any store

[1] : http://wiki.apache.org/nutch/Nutch2Tutorial
[2] : http://wiki.apache.org/nutch/Nutch2Cassandra


> [**] MySQL
>

Is now not supported in Gora and new Nutch versions so no wiki page for it.

>
> [*] Crawling
> [**] Crawling step by step (running each step seperatly)
> [**] Performing a full crawl
> [***] using the crawl script
> [***] using the job file
>

The commands are same as 1.X. The only change needed would be for arguments
which can be traced looking at the command usage.

The notion of having everything in one place would make things neat. AFAIK,
the reason why this was not done before was maintenance overhead. If you
want to create such a page, feel free to add the same. You would need to
create a login to nutch wiki. If there are issues with that, then just
share the document in text format and I would add it to nutch wiki.

~tejas

>
>
>
>
> On Wed, Jan 22, 2014 at 1:53 PM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
>> Thanks Tejas!
>>
>>
>> On 22 January 2014 11:51, Tejas Patil  wrote:
>>
>>> Moved the old nutchhadooptutorial page from Nutch wiki "Front page" to
>>> "Archive and Legacy".
>>>
>>> ~tejas
>>>
>>>
>>> On Wed, Jan 22, 2014 at 5:09 PM, Tejas Patil 
>>> wrote:
>>>
>>>> Thanks *Julien* for pointing me to new "NutchHadoopSingleNodeTutorial"
>>>> wiki page [0]. I would soon remove the old nutchhadooptutorial page
>>>> from wiki.
>>>>
>>>> [0] : http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial
>>>>
>>>> *@d_k*, there are already tutorials for running Nutch 2.x. See [1] and
>>>> [2]. Those are not as extensive as the tutorial for 1.x [3] but carry the
>>>> steps which are different for 2.x. The rest steps after datastore setup are
>>>> similar - the only difference being in the command params which can be
>>>> figured out from the usage and so they were not duplicated in those 2.x
>>>> tutorials to avoid maintenance overhead. Do you think that the 2.x
>>>> tutorials are inadequate in some regards ?
>>>>
>>>> [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
>>>> [2] : http://wiki.apache.org/nutch/Nutch2Cassandra
>>>> [3] : http://wiki.apache.org/nutch/NutchTutorial
>>>>
>>>> Thanks,
>>>> Tejas
>>>>
>>>>
>>>> On Wed, Jan 22, 2014 at 2:47 AM, d_k  wrote:
>>>>
>>>>> Actually what I would like to see is a Nutch 2.x tutorial at the same
>>>>> level of detail as the
>>>>> http://wiki.apache.org/nutch/NutchHadoopTutorial
>>>>> What is the process of contributing to that wiki page?
>>>>>
>>>>>
>>>>> On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche <
>>>>> lists.digitalpeb...@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> The whole thing has been replaced with
>>>>>>  
>>>>>> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial<http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial>which
>>>>>>  does exactly what you described. +1 to remove the old
>>>>>> nutchhadooptutorial page
>>>>>>
>>>>>> J.
>>>>>>
>>>>>>
>>>>>> On 21 January 2014 17:44, Tejas Patil wrote:
>>>>>>
>>>>>>> Hi nutch-dev,
>>>>>>>
>>>>>>> I was looking at [0] and realized that with the massive number of
>>>>>>> Hadoop setup tutorials out there on internet, we need not repeat the 
>>>>>>> same
>>>>>>> on nutch wiki page and instead assume that user has already done Hadoop
>>>>>>> setup. For con

[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1712:
---

Attachment: NUTCH-1712-trunk.v1.patch

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>    Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1712:
---

Description: 
Currently Injector creates two mapreduce jobs:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort job. 
Merge and emit final CrawlDatum objects.

Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from 
seeds file simultaneously and perform inject in a single map-reduce job.

Also, here are additional things covered with this jira:
1. Pushed filtering and normalization above metadata extraction so that the 
unwanted records are ruled out quickly.
2. Migrated to new mapreduce API
3. Improved documentation 
4. New junits with better coverage

Relevant discussion over nutch-dev can be found here:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E

  was:
Currently Injector creates two mapreduce jobs:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort job. 
Merge and emit final CrawlDatum objects.

Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from 
seeds file simultaneously and perform inject in a single map-reduce job.

Also, there are few other things adressed in this patch:
1. Pushed filtering and normalization above metadata extraction so that the 
unwanted records are ruled out quickly.
2. Migrated to new mapreduce API
3. Improved documentation 
4. New junits with better coverage

Relevant discussion over nutch-dev can be found here:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>    Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1712:
--

 Summary: Use MultipleInputs in Injector to make it a single 
mapreduce job
 Key: NUTCH-1712
 URL: https://issues.apache.org/jira/browse/NUTCH-1712
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 1.8


Currently Injector creates two mapreduce jobs:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort job. 
Merge and emit final CrawlDatum objects.

Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from 
seeds file simultaneously and perform inject in a single map-reduce job.

Also, there are few other things adressed in this patch:
1. Pushed filtering and normalization above metadata extraction so that the 
unwanted records are ruled out quickly.
2. Migrated to new mapreduce API
3. Improved documentation 
4. New junits with better coverage

Relevant discussion over nutch-dev can be found here:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Right was to run crawl script in deploy mode

2014-01-22 Thread Tejas Patil
Hi nutch-dev,

I was assuming that the commands to run the bin/crawl script in both local
and deploy mode are the same.
ie. from $NUTCH_HOME/runtime/local (or runtime/deploy),  use
> bin/crawl

It turns out that in deploy mode, this does not obtain the segment location
from HDFS and runs into problems. The reason being this code snippet in the
crawl script: it tries to locate the job file in the parent directory and
fails (note that I am running from runtime/deploy):

mode=local
if [ -f ../*nutch-*.job ]; then
mode=distributed
fi

When ran from runtime/deploy/bin, it runs properly.
Shouldn't the command be consistent with that of local mode ?

Thanks,
Tejas


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v3.patch

Now that HostDb (NUTCH-1365) is in trunk, updated the patch (v3). 
Also,
- included job counters
- more documentation
- added sitemap references in log4j.properties and bin/nutch script.

For usage, see https://wiki.apache.org/nutch/SitemapFeature 

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>    Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-01-22 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878623#comment-13878623
 ] 

Tejas Patil commented on NUTCH-1325:


Hi [~markus17], 
Thanks for the correction. This feature would have not been without you in the 
first place. Apart from being a good addition to Nutch, HostDb has also helped 
in getting a simple design for Sitemap feature (NUTCH-1465). 

Cheers !!! 

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>    Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1164) Write JUnit tests for protocol-http

2014-01-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1164:
---

Attachment: TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt

Hi [~Sertac Turkel],
I tried out your patch and encountered test case failure:
{noformat}
test:
 [echo] Testing plugin: protocol-http
[junit] Running org.apache.nutch.protocol.http.TestProtocolHttp
[junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 1.244 sec
[junit] Test org.apache.nutch.protocol.http.TestProtocolHttp FAILED
{noformat}

I have attached the test case failure log for reference.

> Write JUnit tests for protocol-http
> ---
>
> Key: NUTCH-1164
> URL: https://issues.apache.org/jira/browse/NUTCH-1164
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.4
>
> Attachments: NUTCH-1158.patch, 
> TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-22 Thread Tejas Patil
Moved the old nutchhadooptutorial page from Nutch wiki "Front page" to
"Archive and Legacy".

~tejas


On Wed, Jan 22, 2014 at 5:09 PM, Tejas Patil wrote:

> Thanks *Julien* for pointing me to new "NutchHadoopSingleNodeTutorial"
> wiki page [0]. I would soon remove the old nutchhadooptutorial page from
> wiki.
>
> [0] : http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial
>
> *@d_k*, there are already tutorials for running Nutch 2.x. See [1] and
> [2]. Those are not as extensive as the tutorial for 1.x [3] but carry the
> steps which are different for 2.x. The rest steps after datastore setup are
> similar - the only difference being in the command params which can be
> figured out from the usage and so they were not duplicated in those 2.x
> tutorials to avoid maintenance overhead. Do you think that the 2.x
> tutorials are inadequate in some regards ?
>
> [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
> [2] : http://wiki.apache.org/nutch/Nutch2Cassandra
> [3] : http://wiki.apache.org/nutch/NutchTutorial
>
> Thanks,
> Tejas
>
>
> On Wed, Jan 22, 2014 at 2:47 AM, d_k  wrote:
>
>> Actually what I would like to see is a Nutch 2.x tutorial at the same
>> level of detail as the http://wiki.apache.org/nutch/NutchHadoopTutorial
>> What is the process of contributing to that wiki page?
>>
>>
>> On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche <
>> lists.digitalpeb...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> The whole thing has been replaced with
>>>  
>>> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial<http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial>which
>>>  does exactly what you described. +1 to remove the old
>>> nutchhadooptutorial page
>>>
>>> J.
>>>
>>>
>>> On 21 January 2014 17:44, Tejas Patil  wrote:
>>>
>>>> Hi nutch-dev,
>>>>
>>>> I was looking at [0] and realized that with the massive number of
>>>> Hadoop setup tutorials out there on internet, we need not repeat the same
>>>> on nutch wiki page and instead assume that user has already done Hadoop
>>>> setup. For convinience, we could direct users to the Hadoop wiki page which
>>>> has Hadoop setup details.
>>>> Plus, I propose following:
>>>>
>>>> - Section "Downloading Hadoop and Nutch" : Remove the Hadoop portions
>>>> and let the Nutch stuff stay.
>>>> - Section "Setting Up The Deployment Architecture" must be removed.
>>>> - Section "Deploy Nutch to Single Machine" and "Deploy Nutch to
>>>> Multiple Machines" can be merged together.
>>>> - Section "Performing a Nutch Crawl", "Testing the Crawl" and
>>>> "Performing a Search" must be merged, its contents must be updated.
>>>> - Section "Rsyncing Code to Slaves" and "Updates" can be completely
>>>> removed.
>>>>
>>>> Any comments ?
>>>>
>>>> [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial
>>>>
>>>> Thanks,
>>>> Tejas
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>


Re: Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-22 Thread Tejas Patil
Thanks *Julien* for pointing me to new "NutchHadoopSingleNodeTutorial" wiki
page [0]. I would soon remove the old nutchhadooptutorial page from wiki.

[0] : http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial

*@d_k*, there are already tutorials for running Nutch 2.x. See [1] and [2].
Those are not as extensive as the tutorial for 1.x [3] but carry the steps
which are different for 2.x. The rest steps after datastore setup are
similar - the only difference being in the command params which can be
figured out from the usage and so they were not duplicated in those 2.x
tutorials to avoid maintenance overhead. Do you think that the 2.x
tutorials are inadequate in some regards ?

[1] : http://wiki.apache.org/nutch/Nutch2Tutorial
[2] : http://wiki.apache.org/nutch/Nutch2Cassandra
[3] : http://wiki.apache.org/nutch/NutchTutorial

Thanks,
Tejas

On Wed, Jan 22, 2014 at 2:47 AM, d_k  wrote:

> Actually what I would like to see is a Nutch 2.x tutorial at the same
> level of detail as the http://wiki.apache.org/nutch/NutchHadoopTutorial
> What is the process of contributing to that wiki page?
>
>
> On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
>> Hi
>>
>> The whole thing has been replaced with
>>  
>> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial<http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial>which
>>  does exactly what you described. +1 to remove the old
>> nutchhadooptutorial page
>>
>> J.
>>
>>
>> On 21 January 2014 17:44, Tejas Patil  wrote:
>>
>>> Hi nutch-dev,
>>>
>>> I was looking at [0] and realized that with the massive number of Hadoop
>>> setup tutorials out there on internet, we need not repeat the same on nutch
>>> wiki page and instead assume that user has already done Hadoop setup. For
>>> convinience, we could direct users to the Hadoop wiki page which has Hadoop
>>> setup details.
>>> Plus, I propose following:
>>>
>>> - Section "Downloading Hadoop and Nutch" : Remove the Hadoop portions
>>> and let the Nutch stuff stay.
>>> - Section "Setting Up The Deployment Architecture" must be removed.
>>> - Section "Deploy Nutch to Single Machine" and "Deploy Nutch to Multiple
>>> Machines" can be merged together.
>>> - Section "Performing a Nutch Crawl", "Testing the Crawl" and
>>> "Performing a Search" must be merged, its contents must be updated.
>>> - Section "Rsyncing Code to Slaves" and "Updates" can be completely
>>> removed.
>>>
>>> Any comments ?
>>>
>>> [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial
>>>
>>> Thanks,
>>> Tejas
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>


[jira] [Resolved] (NUTCH-1325) HostDB for Nutch

2014-01-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1325.


   Resolution: Fixed
Fix Version/s: (was: 1.9)
   1.8

Thanks [~markus17] for the heads up :) I have committed the patch to trunk (rev 
1560316).

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>    Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1325:
--

Assignee: Tejas Patil  (was: Markus Jelsma)

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>    Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Request for reviewing HostDb and Sitemap features

2014-01-21 Thread Tejas Patil
Hi,

Is anyone interested in reviewing or trying out the patch for these new
features ? I have recently updated [0] and [1] and would like to hear back
comments on the same.

[0] : https://issues.apache.org/jira/browse/NUTCH-1325
[1] : https://issues.apache.org/jira/browse/NUTCH-1465

Thanks,
Tejas


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v2.patch

Attaching NUTCH-1465-trunk.v2.patch which has implementation of *option (B)* 
_Have separate job for the sitemap stuff and merge its output into the crawldb_

+I have tied both the cases in this patch:+
1. users with targeted crawl who want to get sitemaps injected from a list of 
sitemap urls - the use case which [~wastl-nagel] had pointed out.
2. large open web crawls where users cannot afford to generate sitemap seeds 
for all the hosts and want nutch to inject sitemaps automatically. 

+To try out this patch:+
1. Apply the patch for HostDb feature 
(https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch)
2. Apply this patch (NUTCH-1465-trunk.v2.patch)
3. (optional) Add this to conf/log4j.properties at line 11:
{noformat}
log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout
{noformat}
3. Run using 
{noformat}
bin/nutch org.apache.nutch.util.SitemapProcessor
{noformat}

I have started working on a *wiki page* describing this feature: 
https://wiki.apache.org/nutch/SitemapFeature 

Any suggestion and comments are welcome.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>    Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: NUTCH-1325-trunk-v4.patch

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: (was: NUTCH-1325-trunk-v4.patch)

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: NUTCH-1325-trunk-v4.patch

Attaching NUTCH-1325-trunk-v4.patch with following changes:
- Fixed filterNormalize() to prevent from incorrectly pre-pending "http://"; to 
normal urls.
- Migrated HostDb to new map-reduce API

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Renovating "Nutch Hadoop Tutorial" wiki page

2014-01-21 Thread Tejas Patil
Hi nutch-dev,

I was looking at [0] and realized that with the massive number of Hadoop
setup tutorials out there on internet, we need not repeat the same on nutch
wiki page and instead assume that user has already done Hadoop setup. For
convinience, we could direct users to the Hadoop wiki page which has Hadoop
setup details.
Plus, I propose following:

- Section "Downloading Hadoop and Nutch" : Remove the Hadoop portions and
let the Nutch stuff stay.
- Section "Setting Up The Deployment Architecture" must be removed.
- Section "Deploy Nutch to Single Machine" and "Deploy Nutch to Multiple
Machines" can be merged together.
- Section "Performing a Nutch Crawl", "Testing the Crawl" and "Performing a
Search" must be merged, its contents must be updated.
- Section "Rsyncing Code to Slaves" and "Updates" can be completely
removed.

Any comments ?

[0] : http://wiki.apache.org/nutch/NutchHadoopTutorial

Thanks,
Tejas


[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875981#comment-13875981
 ] 

Tejas Patil commented on NUTCH-1630:


Hi [~talat],
I didn't knew about NUTCH-1413. That the perfect way of getting the average 
response time. For larger crawls which spawn several days, this would give a 
good approximation of the response time. With that, the points in the first two 
paragraphs of my earlier comment are resolved. For the third paragraph, as you 
have made it configurable, crawl owners would have to make this choice. 

The concept behind the patch is good and would be value addition to Nutch. As 
[~jnioche] suggested, it would be super awesome if this could be a plugin or 
made less tangled with the Generate and Fetch code so that it accidentally 
doesn't introduce any bugs.

> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> -
>
> Key: NUTCH-1630
> URL: https://issues.apache.org/jira/browse/NUTCH-1630
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1, 2.2, 2.2.1
>Reporter: Talat UYARER
>  Labels: improvement
> Fix For: 2.3
>
> Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
> FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
> Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875947#comment-13875947
 ] 

Tejas Patil commented on NUTCH-1630:


Hi [~talat],
So from 2nd depth onwards, you would ping the host in generate phase and get 
the response time. 
For large scale crawl setups, Generator itself might runs for few hours and at 
the time when you ping the host it might be loaded or there might be network 
traffic. When the acutal fetch phase runs, the response time might be different 
depending upon the load on the server. As I mentioned in earlier comment, I 
thought you were doing a cumulative sum of response timings for several urls of 
a host and then getting an average from it... which would give a better 
response time numbers. This would be harder to code in the existing codebase 
and might look ugly as fetcher needs to pass on this information to generator.

+A more broader concern for crawls which run for days+
Server response timings itself change as the local time changes. For example 
during day time (say 8:00 - 11:00 am) there might be decent requests from users 
to the server as compared to night time (say 1:00 - 4:00 am) when there are 
very small number of users requesting the servers. Pinging the server during at 
some point in the 24 hour day would not give a good approximation for the 
response time for long running crawls. 

+Effect on crawlspace of slow servers+
If a server is genuine slow (say due to low end hardware), then it would always 
have slower response time as compared to other servers. Effectively, we would 
end up having smaller fetch queue for that host and thus creating huge backlog 
of its urls which would end up sitting in crawldb for not being generated over 
and over again. I would take your side on this: try to fetch as much as we can. 
But some crawl owners might be unhappy with this.

> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> -
>
> Key: NUTCH-1630
> URL: https://issues.apache.org/jira/browse/NUTCH-1630
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1, 2.2, 2.2.1
>Reporter: Talat UYARER
>  Labels: improvement
> Fix For: 2.3
>
> Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
> FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
> Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1697) SegmentMerger to implement Tool

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875910#comment-13875910
 ] 

Tejas Patil commented on NUTCH-1697:


Hi [~markus17],
Correct me if I am wrong: Hadoop properties should be passed as *-D 
property=value* (note the space after -D). The way you  were passing ie. 
(*-Dproperty=value*) is applicable for JVM system properties and won't be 
picked up by Tool

> SegmentMerger to implement Tool
> ---
>
> Key: NUTCH-1697
> URL: https://issues.apache.org/jira/browse/NUTCH-1697
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1697-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875902#comment-13875902
 ] 

Tejas Patil commented on NUTCH-1630:


Hi [~icebergx5],
How do you obtain the average response time of previous depth ? I was hoping 
that it would be somewhere in the fetch phase where you somehow stored the 
response timings for each host then then pass on that information to the 
generate phase.

> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> -
>
> Key: NUTCH-1630
> URL: https://issues.apache.org/jira/browse/NUTCH-1630
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1, 2.2, 2.2.1
>Reporter: Talat UYARER
>  Labels: improvement
> Fix For: 2.3
>
> Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
> FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
> Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1680) CrawldbReader to dump minRetry value

2014-01-18 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875687#comment-13875687
 ] 

Tejas Patil commented on NUTCH-1680:


+1 

> CrawldbReader to dump minRetry value
> 
>
> Key: NUTCH-1680
> URL: https://issues.apache.org/jira/browse/NUTCH-1680
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1680-trunk.patch
>
>
> CrawlDBReader should be able to dump records based on minimum retry value.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Inject operation: can't it be done in a single map-reduce job ?

2014-01-06 Thread Tejas Patil
Thanks Lewis and Markus.

@Lewis: I don't have a dedicated cluster (I am currently not a student nor
working anywhere) so would be running in the pseudo distributed mode on my
laptop. I don't think that it would be a perfect setup to get some stats.
Does ASF has any cluster which could be used ?

Thanks,
Tejas


On Mon, Jan 6, 2014 at 6:54 AM, Markus Jelsma wrote:

> Hi - Yes, MultipleInputs works very well, i did that too when coding the
> HostDB. The MultipleInputs class was not available when the injector was
> originally written, it was introduced around 0.19 or 0.20. I see no reason
> not to replace this so +1 for an new ticket. If unit tests pass, we're good
> to go.
>
> -Original message-
> From: Lewis John Mcgibbney
> Sent: Monday 6th January 2014 15:40
> To: dev@nutch.apache.org
> Subject: Re: Inject operation: can't it be done in a single map-reduce job
> ?
>
> Hi Tejas,
>
> On Sat, Jan 4, 2014 at 8:01 AM,   dev-digest-h...@nutch.apache.org>> wrote:
>
> I realized that by using MultipleInputs, we can read CrawlDatum objects
> from crawldb and urls from seeds file simultaneously and perform inject in
> a single map-reduce job. PFA Injector2.java which is an implementation of
> this approach. I did some basic testing on it and so far I have not
> encountered any problems.
>
> Dynamite Tejas. I would kindly ask that you open an issue and apply your
> patch against trunk :)
>
> I am not sure why Injector was not written this way which is more
> efficient than the one currently in trunk (maybe MultipleInputs was later
> added in Hadoop).
>
> As far as I have discovered, joins have been available in Hadoops mapred
> package and subsequently in mapreduce package so it may not be a case of
> them not being available... however this goes to no length to explain why
> the Injector was not written in this way.
>
> Wondering if I am wrong somewhere in my understanding. Any comments about
> this ?
>
> I am curious to discover how more efficient using the MultipleInputss
> class is over the sequential MR jobs as is currently implemented. Do you
> have any comparison on the size of the dataset being used?
>
> There is a script [0] I keep on my github which we can test this against
> (1M URLs). This would provide a reasonable input dataset which we could use
> to base some efficiency tests on.
>
> Great observations Tejas.
>
> Lewis
>
> [0] https://github.com/lewismc/nipt 
>
>
>


Re: Independent Map Reduce to parse Nutch content (Cont.)

2014-01-04 Thread Tejas Patil
*>> It will finish all the mappers without problem but still.. errored out
after all the mappers*
*>> Exception in thread "main" java.io.IOException: Job failed!*

As I mentioned in the earlier mail, did you see the logs to find out the
root cause of the exception ?

*>> I can see Nutch constantly uses Hadoop API without hadoop
pre-installed.. why can't my code work*

The way you are running in local mode is using:
*>> java -jar example.jar localinput/ localoutput/*

which is not adding the hadoop jars and its dependency jars in classpath.
You could set Hadoops' configuration for local mode and then run it using
the same command you used for the distributed mode. ie.
*>> hadoop -jar example.jar hdfsinput/ hdfsoutput/*

The advantage of using this command is that it would set your classpath and
environment variables for you and then invoke the relevant java class. When
your config is tuned for local mode of hadoop, it would run locally just
like Nutch's local mode.

Thanks,
Tejas


On Sat, Jan 4, 2014 at 12:47 PM, Bin Wang  wrote:

> Hi Tejas,
>
> I started an AWS instance and run hadoop in single node mode.
>
> When I do..
> hadoop -jar example.jar hdfsinput/ hdfsoutput/
>
> Everything works perfect as I expected: a bunch of staff got printed to
> the screen and both mappers and reducers got finished without question. In
> the end, the expected output sits in the hdfs output directory.
>
> However, when I tried to run the jar file without hadoop:
> java -jar example.jar localinput/ localoutput/
> It will finish all the mappers without problem but still.. errored out
> after all the mappers
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)
> at arrow.ParseMapred.run(ParseMapred.java:70)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at arrow.ParseMapred.main(ParseMapred.java:18)
>
> I am so confused now why my code doesn't work locally...
> Based on my understanding, I can see Nutch constantly uses Hadoop API
> without hadoop pre-installed.. why can't my code work..
>
> Well, any hint or directional guidance will be appreciated, many thanks!
>
> /usr/bin
>
>
>
>
> On Sat, Jan 4, 2014 at 12:38 AM, Tejas Patil wrote:
>
>> Hi Bin Wang,
>> I would suggest you to NOT use eclipse and run your code over command
>> line. Use logger statements and see the logs for full stack traces of the
>> failure. In my personal experience, logs are the best way to debug hadoop
>> code compared to Eclipse debugger.
>>
>> Thanks,
>> Tejas
>>
>>
>> On Fri, Jan 3, 2014 at 8:56 PM, Bin Wang  wrote:
>>
>>> Hi,
>>>
>>> I tried to modify the code here to parse the nutch content data...
>>>
>>> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
>>> And in the end of this email is a prototype that I have written to run
>>> map reduce to calculate the HTML content length of each URL that I have
>>> scraped.
>>>
>>> The mapper part runs perfectly fine as expected, however, the whole
>>> program stops after all the mappers finished and the reducer did not get a
>>> chance to run: (I am sure there are certain number of pages got scraped and
>>> in the Eclipse console, there are same number of Mapper.. so I assume all
>>> the mapper finished.)
>>>
>>> Can anyone, who is pretty into writing java map reduce job take a look
>>> at my code and see what the error might be... I am not a Java developer at
>>> all so any debug trick or common sense will be appreciated!
>>> (I heard that it is fairly hard to debug code written using hadoop
>>> API... is that true?)
>>>
>>> Many thanks!
>>>
>>> /usr/bin
>>>
>>> _
>>>
>>> Eclipse Console Info
>>>
>>> Starting Mapper ...
>>>
>>> Key: http://url1
>>>
>>> Result: 134943
>>>
>>> Starting Mapper ...
>>>
>>> Key: http://url2
>>>
>>> Result: 258588
>>>
>>> Exception in thread "main" java.io.IOException: Job failed!
>>>
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)
>>>
>>> at arrow.ParseMapred.run(ParseMapred.java:68)
>>>
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>>
>>> at arrow.ParseMapred.main(ParseMapred.java:18)
>>>
>>> 

[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-01-04 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862240#comment-13862240
 ] 

Tejas Patil commented on NUTCH-1325:


Could anyone please look at the patch and let us know if there are any flaws or 
improvements that must be addressed ?

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Inject operation: can't it be done in a single map-reduce job ?

2014-01-04 Thread Tejas Patil
Hi nutch-dev,

I am looking at Injector code in trunk and I see that currently we are
launching two map-reduce jobs for the same:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort
job. Merge and emit final CrawlDatum objects.

I realized that by using MultipleInputs, we can read CrawlDatum objects
from crawldb and urls from seeds file simultaneously and perform inject in
a single map-reduce job. PFA Injector2.java which is an implementation of
this approach. I did some basic testing on it and so far I have not
encountered any problems.

I am not sure why Injector was not written this way which is more efficient
than the one currently in trunk (maybe MultipleInputs was later added in
Hadoop). Wondering if I am wrong somewhere in my understanding. Any
comments about this ?

Thanks,
Tejas
/**
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.nutch.crawl;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.MultipleInputs;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.metadata.Nutch;
import org.apache.nutch.net.URLFilters;
import org.apache.nutch.net.URLNormalizers;
import org.apache.nutch.scoring.ScoringFilterException;
import org.apache.nutch.scoring.ScoringFilters;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.util.NutchJob;
import org.apache.nutch.util.TimingUtil;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Iterator;
import java.util.Map;
import java.util.Random;
import java.util.TreeMap;

/** This class takes a flat file of URLs and adds them to the of pages to be
 * crawled.  Useful for bootstrapping the system. 
 * The URL files contain one URL per line, optionally followed by custom metadata 
 * separated by tabs with the metadata key separated from the corresponding value by '='. 
 * Note that some metadata keys are reserved : 
 * - nutch.score : allows to set a custom score for a specific URL 
 * - nutch.fetchInterval : allows to set a custom fetch interval for a specific URL 
 * - nutch.fetchInterval.fixed : allows to set a custom fetch interval for a specific URL that is not changed by AdaptiveFetchSchedule 
 * e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t userType=open_source
 **/
public class Injector2 extends Configured implements Tool {
  public static final Logger LOG = LoggerFactory.getLogger(Injector2.class);

  /** metadata key reserved for setting a custom score for a specific URL */
  public static String nutchScoreMDName = "nutch.score";
  /** metadata key reserved for setting a custom fetchInterval for a specific URL */
  public static String nutchFetchIntervalMDName = "nutch.fetchInterval";
  /** metadata key reserved for setting a fixed custom fetchInterval for a specific URL */
  public static String nutchFixedFetchIntervalMDName = "nutch.fetchInterval.fixed";

  /** Normalize and filter injected urls. */
  public static class InjectMapper implements Mapper {
public static final String URL_NORMALIZING_SCOPE = "crawldb.url.normalizers.scope";
public static final String TAB_CHARACTER = "\t";
public static final String EQUAL_CHARACTER = "=";

private URLNormalizers urlNormalizers;
private int interval;
private float scoreInjected;
private JobConf jobConf;
private URLFilters filters;
private ScoringFilters scfilters;
private long curTime;
private boolean url404Purging;
private String scope;

public void configure(JobConf job) {
  this.jobConf = job;
  scope = job.get(URL_NORMALIZING_SCOPE, URLNormalizers.SCOPE_CRAWLDB);
  urlNormalizers = new URLNormalizers(job, scope);
  interval = jobConf.getInt("db.fetch.interval.default", 2592000);
  

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862237#comment-13862237
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],
Yes. I think that it should be there too. I will be working on the patch this 
weekend and update on the same. Thanks for your inputs and suggestions till now 
in, were super helpful in chalking out the right specs for this feature.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Independent Map Reduce to parse Nutch content (Cont.)

2014-01-03 Thread Tejas Patil
Hi Bin Wang,
I would suggest you to NOT use eclipse and run your code over command line.
Use logger statements and see the logs for full stack traces of the
failure. In my personal experience, logs are the best way to debug hadoop
code compared to Eclipse debugger.

Thanks,
Tejas


On Fri, Jan 3, 2014 at 8:56 PM, Bin Wang  wrote:

> Hi,
>
> I tried to modify the code here to parse the nutch content data...
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
> And in the end of this email is a prototype that I have written to run map
> reduce to calculate the HTML content length of each URL that I have scraped.
>
> The mapper part runs perfectly fine as expected, however, the whole
> program stops after all the mappers finished and the reducer did not get a
> chance to run: (I am sure there are certain number of pages got scraped and
> in the Eclipse console, there are same number of Mapper.. so I assume all
> the mapper finished.)
>
> Can anyone, who is pretty into writing java map reduce job take a look at
> my code and see what the error might be... I am not a Java developer at all
> so any debug trick or common sense will be appreciated!
> (I heard that it is fairly hard to debug code written using hadoop API...
> is that true?)
>
> Many thanks!
>
> /usr/bin
>
> _
>
> Eclipse Console Info
>
> Starting Mapper ...
>
> Key: http://url1
>
> Result: 134943
>
> Starting Mapper ...
>
> Key: http://url2
>
> Result: 258588
>
> Exception in thread "main" java.io.IOException: Job failed!
>
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:784)
>
> at arrow.ParseMapred.run(ParseMapred.java:68)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>
> at arrow.ParseMapred.main(ParseMapred.java:18)
>
> _
>
> // my code
>
> package example;
>
> import ...;
>
> public class ParseMapred extends Configured implements Tool,
>
>  Mapper, Content, Text, IntWritable>,
>
>  Reducer {
>
>
>  public static void main(String[] args) throws Exception {
>
>  int res = ToolRunner.run(NutchConfiguration.create(),
>
>   new ParseMapred(), args);
>
>  System.exit(res);
>
> }
>
>
>  public void configure(JobConf job) {
>
>  setConf(job);
>
> }
>
>
>  public void close() throws IOException {}
>
>
>  public void reduce(Text key, Iterator values,
>
>  OutputCollector output, Reporter reporter)
>
>  throws IOException {
>
>  System.out.println("Starting Reducer ...");
>
>  System.out.println("Reducer: " + "key" + key);
>
> output.collect(key, values.next()); // collect first value
>
> }
>
>
>  public void map(WritableComparable key, Content content,
>
>  OutputCollector output, Reporter reporter)
>
>  throws IOException {
>
>  Text url = new Text();
>
>  IntWritable result = new IntWritable();
>
>  url.set("fail");
>
>  result = new IntWritable(1);
>
>  try {
>
>  System.out.println("Starting Mapper ...");
>
>  url.set(key.toString());
>
>  result = new IntWritable(content.getContent().length);
>
>  System.out.println("Key: " + url);
>
>  System.out.println("Result: " + result);
>
>  output.collect(url, result);
>
>  } catch (Exception e) {
>
>  // TODO Auto-generated catch block
>
>  output.collect(url, result);
>
>  }
>
> }
>
>
>  public int run(String[] args) throws Exception {
>
> JobConf job = new NutchJob(getConf());
>
> job.setJobName("ParseData");
>
> FileInputFormat.addInputPath(job, new Path("/Users/.../data/"));
>
> FileOutputFormat.setOutputPath(job, new Path("/Users/.../result"));
>
> job.setInputFormat(SequenceFileInputFormat.class);
>
> job.setOutputFormat(TextOutputFormat.class);
>
> job.setOutputKeyClass(Text.class);
>
> job.setOutputValueClass(IntWritable.class);
>
> job.setMapperClass(ParseMapred.class);
>
> job.setReducerClass(ParseMapred.class);
>
> JobClient.runJob(job);
>
>  return 0;
>
> }
>
> }
>


Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

2014-01-03 Thread Tejas Patil
In local mode, the hadoop jars in the classpath (see
runtime/local/lib.hadoop-core-1.2.0.jar) of nutch jobs. From the hadoops'
FileSytem class (see line 132 in [0]), the default value of 'fs.default.name'
is picked up by code.

[0] :
http://svn.apache.org/viewvc/hadoop/common/branches/branch-1.2/src/core/org/apache/hadoop/fs/FileSystem.java?view=markup

On Fri, Jan 3, 2014 at 10:22 AM, Bin Wang  wrote:

> Hi Tejas,
>
> Thanks a lot for your response, now I completely understand how WordCount
> example read path as HDFS path because you use `hadoop` command to call the
> WordCount.jar. And `hadoop` configuration says:
> 
>  
>  fs.default.name
>  hdfs://localhost:9000
>  
> 
> ...
>
> However, Nutch 1.7 can be installed without Hadoop preinstalled. Where
> does Nutch read the filesystem configuration? there is no core-site.xml for
> Nutch.
> Isn't it? then it is default as local ?
>
> /usr/bin
>
>
>
>
> On Thu, Jan 2, 2014 at 10:02 PM, Tejas Patil wrote:
>
>> The config 'fs.default.name' of core-site.xml is what makes this happen.
>> Its default value is "file:///" which corresponds to local mode of Hadoop.
>> In local mode Hadoop looks for paths on the local file system. In
>> distributed mode of Hadoop, 'fs.default.name' would be
>> "hdfs://IP_OF_NAMENODE/" and it will look for those paths in HDFS.
>>
>> Thanks,
>> Tejas
>>
>>
>> On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang  wrote:
>>
>>> Hi there,
>>>
>>> When I went through the source code of Nutch - the ParseSegment class,
>>> which is the class to "parse content in a segment". Here is its map reduce
>>> job configuration part.
>>>
>>> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
>>>   (Line
>>> 199 - 213)
>>>
>>> 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName("parse "
>>> + segment); 201  202 FileInputFormat.addInputPath(job, new
>>> Path(segment, Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY,
>>> segment.getName()); 204
>>> job.setInputFormat(SequenceFileInputFormat.class); 205
>>> job.setMapperClass(ParseSegment.class); 206
>>> job.setReducerClass(ParseSegment.class); 207  208 
>>> FileOutputFormat.setOutputPath(job,
>>> segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210
>>> job.setOutputKeyClass(Text.class); 211
>>> job.setOutputValueClass(ParseImpl.class); 212  213 JobClient.runJob(job);
>>>
>>> Here, in line 202 and line 208, the map reduce input/output path has
>>> been configured by calling methods addInputPath/setOutputPath from
>>> FileInputFormat.
>>> And it is the absolute path in the Linux OS instead of HDFS virtual
>>> path.
>>>
>>> And on the other hand, when I look at the WordCount example in the
>>> hadoop homepage.
>>> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 -
>>> 55)
>>>
>>> 39.  JobConf conf = new JobConf(WordCount.class); 40.
>>> conf.setJobName("wordcount"); 41. 42.
>>> conf.setOutputKeyClass(Text.class); 43.
>>> conf.setOutputValueClass(IntWritable.class); 44. 45.
>>> conf.setMapperClass(Map.class); 46.
>>> conf.setCombinerClass(Reduce.class); 47.
>>> conf.setReducerClass(Reduce.class); 48. 49.
>>> conf.setInputFormat(TextInputFormat.class); 50.
>>> conf.setOutputFormat(TextOutputFormat.class); 51. 52. 
>>> FileInputFormat.setInputPaths(conf,
>>> new Path(args[0])); 53.  FileOutputFormat.setOutputPath(conf, new
>>> Path(args[1])); 54. 55. JobClient.runJob(conf);
>>> Here, the input/output path was configured in the same way as Nutch but
>>> the path was actually passed by passing the arguments.
>>> bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
>>> /usr/joe/wordcount/input /usr/joe/wordcount/output
>>> And we can see the paths passed to the program are actually HDFS path..
>>>  not Linux OS path..
>>> I am confused here is there some other configuration that I missed which
>>> lead to the run environment difference? In which case, should I pass
>>> absolute or HDFS path?
>>>
>>> Thanks a lot!
>>>
>>> /usr/bin
>>>
>>>
>>
>


[jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861217#comment-13861217
 ] 

Tejas Patil commented on NUTCH-356:
---

+1 for commit.

> Plugin repository cache can lead to memory leak
> ---
>
> Key: NUTCH-356
> URL: https://issues.apache.org/jira/browse/NUTCH-356
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Enrico Triolo
> Fix For: 2.3, 1.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, 
> ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch, cache_classes.patch
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), 
> I found out that actually the problem was related to the plugin cache used in 
> class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
> work, since I need to frequently submit new urls and append their contents to 
> the index; I don't (and I can't) have an urls.txt file with all urls I'm 
> going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch 
> as-is, since the problem I found occours only if nutch is used in a way 
> similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar 
> to what I need. It fetches and index some sample urls; to avoid webmasters 
> complaints I left the sample urls list empty, so you should modify the source 
> code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In 
> my experience I get an OutOfMemoryException after a couple of minutes, but it 
> clearly depends on your heap settings and on the plugins you are using (I'm 
> using 
> 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it 
> never get released. It seems that some class maintains a reference to it and 
> this class is never released since it is cached somewhere in the 
> configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the 
> cache and always returns a new instance (you can find the patch in 
> attachment). This way the memory consumption is always stable and I get no 
> OOM anymore.
> Clearly this is not the solution, since I guess there are many performance 
> issues involved, but for the moment it works.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: How Map Reduce code in Nutch run in local mode vs distributed mode?

2014-01-02 Thread Tejas Patil
The config 'fs.default.name' of core-site.xml is what makes this happen.
Its default value is "file:///" which corresponds to local mode of Hadoop.
In local mode Hadoop looks for paths on the local file system. In
distributed mode of Hadoop, 'fs.default.name' would be
"hdfs://IP_OF_NAMENODE/" and it will look for those paths in HDFS.

Thanks,
Tejas


On Thu, Jan 2, 2014 at 7:28 PM, Bin Wang  wrote:

> Hi there,
>
> When I went through the source code of Nutch - the ParseSegment class,
> which is the class to "parse content in a segment". Here is its map reduce
> job configuration part.
>
> http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/parse/ParseSegment.java?view=markup
>   (Line
> 199 - 213)
>
> 199 JobConf job = new NutchJob(getConf()); 200 job.setJobName("parse " +
> segment); 201  202 FileInputFormat.addInputPath(job, new Path(segment,
> Content.DIR_NAME)); 203 job.set(Nutch.SEGMENT_NAME_KEY,
> segment.getName()); 204 job.setInputFormat(SequenceFileInputFormat.class);
> 205 job.setMapperClass(ParseSegment.class); 206
> job.setReducerClass(ParseSegment.class); 207  208 
> FileOutputFormat.setOutputPath(job,
> segment); 209 job.setOutputFormat(ParseOutputFormat.class); 210
> job.setOutputKeyClass(Text.class); 211
> job.setOutputValueClass(ParseImpl.class); 212  213 JobClient.runJob(job);
> Here, in line 202 and line 208, the map reduce input/output path has been
> configured by calling methods addInputPath/setOutputPath from
> FileInputFormat.
> And it is the absolute path in the Linux OS instead of HDFS virtual path.
>
> And on the other hand, when I look at the WordCount example in the hadoop
> homepage.
> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html (Line 39 - 55)
>
> 39.  JobConf conf = new JobConf(WordCount.class); 40.
> conf.setJobName("wordcount"); 41. 42.
> conf.setOutputKeyClass(Text.class); 43.
> conf.setOutputValueClass(IntWritable.class); 44. 45.
> conf.setMapperClass(Map.class); 46.
> conf.setCombinerClass(Reduce.class); 47.
> conf.setReducerClass(Reduce.class); 48. 49.
> conf.setInputFormat(TextInputFormat.class); 50.
> conf.setOutputFormat(TextOutputFormat.class); 51. 52. 
> FileInputFormat.setInputPaths(conf,
> new Path(args[0])); 53.  FileOutputFormat.setOutputPath(conf, new
> Path(args[1])); 54. 55. JobClient.runJob(conf);
> Here, the input/output path was configured in the same way as Nutch but
> the path was actually passed by passing the arguments.
> bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount
> /usr/joe/wordcount/input /usr/joe/wordcount/output
> And we can see the paths passed to the program are actually HDFS path..
>  not Linux OS path..
> I am confused here is there some other configuration that I missed which
> lead to the run environment difference? In which case, should I pass
> absolute or HDFS path?
>
> Thanks a lot!
>
> /usr/bin
>
>


Re: use to parse big Nutch/Content file

2014-01-02 Thread Tejas Patil
Here is what I would do:
If you running a crawl, let it run with the default parser. Write a nutch
plugin with your customized parse implementation to evaluate your parse
logic. Now get some real segments (with a subset of those million pages)
and run only the 'bin/nutch parse' command and see how good it is. That
command will run your parser over the segment. Do this till you get a
satisfactory parser implementation.

~tejas


On Thu, Jan 2, 2014 at 2:48 PM, Bin Wang  wrote:

> Hi,
>
> I have a robot that scrapes a website daily and store the HTML locally so
> far(in nutch binary format in segment/content folder).
>
> The size of the scraping is fairly big. Million pages per day.
> One thing about the HTML pages themselves is that they follow exactly the
> same format.. so I can write a parser in Java to parse out the info I want
> (say unit price, part number...etc) for one page, and that parser will work
> for most of the pages..
>
> I am wondering is there some map reduce template already written so I can
> just replace the parser with my customized one and easily start a hadoop
> mapreduce job. (actually, there doesn't have to be any reduce job... in
> this case, we map every page to the parsed result and that is it...)
>
> I was looking at the map reduce example here:
> https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
> But I have some problem translating that into my real-world nutch problem.
>
> I know run map reduce against Nutch binary file will be a bit different
> than word count. I looked at the source code of Nutch and to me, it looks
> like the file are a sequence files of records where each records is a
> key/value pair where key is text type and value is
> org.apache.nutch.protocol.Content type. Then how should I configure the map
> job so it can read in the raw big content binary file and do the Inputsplit
> correctly and run the map job..
>
> Thanks a lot!
>
> /usr/bin
>
>
> ( Some explanations of why I decided not to write Java plugin ):
> I was thinking about writing a Nutch Plugin so it will be handy to parse
> the scraped data using Nutch command. However, the problem here is "it is
> hard to write a perfect parser" in one go. It probably makes a lot of sense
> for the people who deal with parsers a lot. You locate your HTML tag by
> some specific features that you think will be general... css class type,
> id...etc...even combining with regular expression. However, when you apply
> your logic to all the pages, it won't stand true for all the pages. Then
> you need to write many different parsers to run against the whole dataset
> (Million pages) in one go and see which one has the best performance. Then
> you run your parser against all your snapshots days * million pages.. to
> get the new dataset.. )
>


[jira] [Commented] (NUTCH-1454) parsing chm failed

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860803#comment-13860803
 ] 

Tejas Patil commented on NUTCH-1454:


TIKA-1122 is fixed and I have verified that 'parsechecker' works fine with the 
same. Upgrading to Tika 1.5 (yet to be released) should fix this for Nutch.

> parsing chm failed
> --
>
> Key: NUTCH-1454
> URL: https://issues.apache.org/jira/browse/NUTCH-1454
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.9
>
>
> (reported by Jan Riewe, see 
> http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
> Nutch fails to parse chm files with
> {quote}
>  ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
> application/vnd.ms-htmlhelp
> {quote}
> Tested with chm test files from Tika:
> {code}
>  % bin/nutch parsechecker 
> file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
> {code}
> Tika parses this document (but does not extract any content).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860678#comment-13860678
 ] 

Tejas Patil commented on NUTCH-1691:


Hi [~markus17],
Its a good solution. +1 from me. 
I would like to know the way you are invoking the plugin. I tried to use 
"bin/nutch plugin urlfilter-domainblacklist" but that didn't work as it doesn't 
got main().

> DomainBlacklist url filter does not allow -D filter file override
> -
>
> Key: NUTCH-1691
> URL: https://issues.apache.org/jira/browse/NUTCH-1691
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1691-trunk.patch
>
>
> This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
> plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Closed] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-02 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil closed NUTCH-1670.
--

Resolution: Fixed

Committed the patch by [~amuseme] to trunk (rev 1554883).

> set same crawldb directory in mergedb parameter
> ---
>
> Key: NUTCH-1670
> URL: https://issues.apache.org/jira/browse/NUTCH-1670
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
>  Labels: PatchAvailable
> Fix For: 1.8
>
> Attachments: NUTCH-1670.patch
>
>
> when merge two crawldb using the same crawldb directory in bin/nutch merge 
> paramater, it will throw data not found exception. 
> bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
> bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1080:
---

Fix Version/s: 1.8

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860643#comment-13860643
 ] 

Tejas Patil commented on NUTCH-1080:


Committed to trunk (rev 1554881). Will port the same to 2.x

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1080:
--

Assignee: Tejas Patil

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
>    Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-01 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1080:
---

Attachment: NUTCH-1080-tejasp-trunk-v2.patch

Attaching a patch for trunk. Uploaded the same over review board: 
https://reviews.apache.org/r/16563/

Comments are welcome !!!

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
> Fix For: 2.3
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-01 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: NUTCH-1325-trunk-v3.patch

A final patch (NUTCH-1325-trunk-v3.patch) to complete this feature.
Uploaded the patch over review board too: https://reviews.apache.org/r/16555/

Comments are welcome !!!

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-01 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859987#comment-13859987
 ] 

Tejas Patil commented on NUTCH-1670:


Hi [~amuseme.lu],
The patch looks good to me. +1 from me for commit.

> set same crawldb directory in mergedb parameter
> ---
>
> Key: NUTCH-1670
> URL: https://issues.apache.org/jira/browse/NUTCH-1670
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
>  Labels: PatchAvailable
> Fix For: 1.8
>
> Attachments: NUTCH-1670.patch
>
>
> when merge two crawldb using the same crawldb directory in bin/nutch merge 
> paramater, it will throw data not found exception. 
> bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
> bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859358#comment-13859358
 ] 

Tejas Patil commented on NUTCH-1687:


Created a review request: https://reviews.apache.org/r/16535/

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1687:
---

Attachment: NUTCH-1687.tejasp.v1.patch

I feel that there is no need for creating a separate class for Circular linked 
list and maintaining the circular list along with the original map. 

Uploading "NUTCH-1687.tejasp.v1.patch" : Uses 
[LinkedHashMap|http://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html]
 along with a [Guava cyclic 
iterator|http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#cycle(java.lang.Iterable)]
 to iterate the map of queues in a circular fashion. With that no separate list 
needs to be maintained. 

Comments are welcome.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859275#comment-13859275
 ] 

Tejas Patil commented on NUTCH-1687:


This is one good point by [~tiennm].  Although this might not give significant 
performance improvement, it would fairly distribute requests across all fetch 
queues.

Some comments wrt the patch:
1. Do you really need to make the methods of CircularLinkedList class thread 
safe ? The methods in "FetchItemQueues" which interact with the 
CircularLinkedList (ie. getFetchItemQueue and getFetchItem) are all 
synchronized. So, its ensured that only one thread accesses the list at a time.
2. Why is 'id' needed in FetchItemQueue ?

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1687:
---

Fix Version/s: 1.8

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Nutch Crawl a Specific List Of URLs (150K)

2013-12-29 Thread Tejas Patil
Hi Bin Wang,

>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 &
You were creating a new crawldb or reusing some old one ?

Were you running this on a cluster or in local mode ?
Was there any failure due to which the fetch round got aborted ? (see logs
for this).

I would like to reproduce this issue. Will it be possible for you to share
your config files and subset of urls ?

Thanks,
Tejas


On Sat, Dec 28, 2013 at 2:10 AM, Talat Uyarer  wrote:

> Hi Bin,
>
> You have interesting error. I don't use 1.7 but I can try with screen
> command. I believe you will not get same error.
>
> Talat
>
>
> 2013/12/27 Bin Wang 
>
>> Hi,
>>
>> I have a very specific list of URLs, which is about 140K URLs.
>>
>> I switch off the `db.update.additions.allowed` so it will not update the
>> crawldb... and I was assuming I can feed all the URLs to Nutch, and after
>> one round of fetching, it will finish and leave all the raw HTML files in
>> the segment folder.
>>
>> However, after I run this command:
>> nohup bin/nutch crawl urls -dir result -depth 1 -topN 20 &
>>
>> It ended up with a small number of URLs..
>> TOTAL urls: 872
>> retry 0: 872
>> min score: 1.0
>> avg score: 1.0
>> max score: 1.0
>>
>> And I double check the log to make sure that every url can pass the
>> filter and normalization. And here is the log:
>>
>> 2013-12-27 17:55:25,068 INFO  crawl.Injector - Injector: total number of
>> urls rejected by filters: 0
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: total number of
>> urls injected after normalization and filtering: 139058
>> 2013-12-27 17:55:25,069 INFO  crawl.Injector - Injector: Merging injected
>> urls into crawl db.
>>
>> I don't know how 140K URLs ended up being 872 in the end...
>>
>> /usr/bin
>>
>> --
>> AWS ubuntu instance
>> Nutch 1.7
>> java version "1.6.0_27"
>> OpenJDK Runtime Environment (IcedTea6 1.12.6)
>> (6b27-1.12.6-1ubuntu0.12.04.4)
>> OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
>>
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>


Re: Nutch Several Segment Folders Containing Duplicate Key/URLs

2013-12-24 Thread Tejas Patil
You ran 3 rounds of nutch crawl ("-depth 3") and those 3 folders are 3
segments created for each round of crawl.
About the 520 URLs, I don't see any obvious reason for that happening. You
should see few of the new urls that were added, what were their parent url
and then run a small crawl using those parent as seeds.

Thanks,
Tejas


On Tue, Dec 24, 2013 at 8:06 AM, Bin Wang  wrote:

> Hi,
>
> I have a very specific list of URLs to crawl and I implemented it by
> turning off this property:
> 
>   db.update.additions.allowed
>   false
>   If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   
> 
>
> So it will not add the parsed new URLs / outbound links into the crawldb.
>
> I tried to feed one link to Nutch and it works exactly the way I want, and
> I can read the raw HTML by going to the deserialize the
> segment/content/part-000/data file.
>
> However, when I feed 520 URLs to Nutch, the result is confusing me.
> It created 3 separate folders and each one has the same structure as the
> folder I just mentioned. When I check the data files in each folder...
> folder 1 contains:
> 400 URLs and their HTML
> folder 2 contains:
> 487 URLs ..
> folder 3 contains:
> 520 URLs ..
>
> And they add up to about 1400! There are many duplicates when you add them
> up and there are 900 distinct URLs in total which is even more than the
> URLs that I fed Nutch.
>
> Here is the research that I have done:
> I have read the source code for Injector and am working on Fetcher..
> Somehow it mentioned in the Fetcher that "the number of Queues is based on
> the number of hosts..."  And I am wondering does that have anything to do
> with how that three folders come.
>
> Can anyone help me understand how that three folders come to existence and
> why the URLs number is so weird.
>
> Any hint is appreciated or point me to the right class so I can do some
> homework myself.
> --
> Extra Info:
> I am using AWS EC2 Ubuntu and Nutch 1.7
> the command to run the crawl:
> nohup bin/nutch crawl urls -dir result -depth 3 -topN 1 &
> ---
>
> /usr/bin
>
>
>
>
>
>


Re: Step Through Nutch 1.7 Inside Eclipse Missing Argument

2013-12-23 Thread Tejas Patil
Hi Bin Wang,
You are welcome to edit the wiki and add your observations to it. Thanks
for your contribution.

~tejas


On Mon, Dec 23, 2013 at 8:19 AM, Bin Wang  wrote:

> Hi Tejas,
>
> Thanks a lot for your confirmation! And it is working for me now!
>
> I will take you as the tutorial author Tejas and correct me I was wrong.
> The tutorial you have written is very helpful, most of your tutorial have
> the mentioned how to work with Nutch 1.7 (trunk) even it is targeted at
> 2.X.
> I am wondering should I(/can I )go to the Wiki and add this solution to
> the Wiki so your tutorial is both consistent for Nutch 2.X and 1.X..
>
> (thought I should contribute back when I got the help from the community.)
>
> Thanks,
> /usr/bin
>
>
>
> On Sun, Dec 22, 2013 at 10:44 PM, Tejas Patil wrote:
>
>> You are asking the right question at the right place.
>> The example shown in the tutorial was for Nutch 2.X series. The 1.X
>> Injector needs an extra param as input which is the location of the crawldb
>> to inject the urls into. (For first time, it would create a new one on the
>> location in the command).
>>
>> Thanks,
>> Tejas
>>
>>
>> On Sun, Dec 22, 2013 at 8:05 PM, Bin Wang  wrote:
>>
>>> Hi there,
>>>
>>> I was following the RunNutchInEclipse tutorial (1.7 Nutch / trunk
>>> example).
>>>
>>> After I configured the java run configurations as the tutorial showed..
>>> and clicked run.  It did not show the injector process as shown in the
>>> tutorial, and instead, it showed error:
>>>
>>> Usage: Injector  
>>>
>>> I took a look at the source code of the injector class and obviously, it
>>> is somehow expecting two arguments.
>>>
>>> public void inject(Path crawlDb, Path urlDir) throws IOException..
>>>
>>> I don't know how everyone got through this step only passing the URL
>>> path to eclipse.
>>>
>>> Show I add something more in the run configuration to also pass the
>>> crawldb path as another argument? If so, where should the crawldb suppose
>>> to locate?
>>>
>>> (I am new to this mail list and assuming, question at the source code
>>> level might be a better fit for the developer one. So let me know if I am
>>> asking the wrong question in the wrong place..)
>>>
>>> /usr/bin
>>>
>>> --- here is some basically information of the platform that I am working
>>> on 
>>> Mac OS X 10.8.4
>>> Apache Ant(TM) version 1.8.2 compiled on June 16 2012
>>>  svn, version 1.6.18 (r1303927) compiled Feb  6 2013, 14:18:52
>>> Juno Eclipse SDK Version: 4.2.2 Build id: M20130204-1200
>>>
>>> System JAVA:
>>> java version "1.7.0_25"
>>> Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
>>> Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
>>>
>>> Eclipse JAVA:
>>> JVM 1.6.0.jdk
>>>
>>>
>>>
>>
>


[jira] [Commented] (NUTCH-1689) Improve CrawlDb stats

2013-12-22 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855416#comment-13855416
 ] 

Tejas Patil commented on NUTCH-1689:


Some concerns:
1. While you are removing fields from the output, there can be people relying 
on the existing output (grepping or awking to get required fields). It ain't 
wise to simply remove off all the fields directly. Keep things backward 
compatible.
2. You can make the command configurable so that users get to select what all 
fields they want in the output
3. While submitting patch, commenting out the older code is not the best way. 
Remove those lines instead of commenting them out.

> Improve CrawlDb stats
> -
>
> Key: NUTCH-1689
> URL: https://issues.apache.org/jira/browse/NUTCH-1689
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Nguyen Manh Tien
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1689.patch
>
>
> Crawldb stats now is slow due to it load all fields from store, I change to 
> load only necessary fields.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: Step Through Nutch 1.7 Inside Eclipse Missing Argument

2013-12-22 Thread Tejas Patil
You are asking the right question at the right place.
The example shown in the tutorial was for Nutch 2.X series. The 1.X
Injector needs an extra param as input which is the location of the crawldb
to inject the urls into. (For first time, it would create a new one on the
location in the command).

Thanks,
Tejas


On Sun, Dec 22, 2013 at 8:05 PM, Bin Wang  wrote:

> Hi there,
>
> I was following the RunNutchInEclipse tutorial (1.7 Nutch / trunk
> example).
>
> After I configured the java run configurations as the tutorial showed..
> and clicked run.  It did not show the injector process as shown in the
> tutorial, and instead, it showed error:
>
> Usage: Injector  
>
> I took a look at the source code of the injector class and obviously, it
> is somehow expecting two arguments.
>
> public void inject(Path crawlDb, Path urlDir) throws IOException..
>
> I don't know how everyone got through this step only passing the URL path
> to eclipse.
>
> Show I add something more in the run configuration to also pass the
> crawldb path as another argument? If so, where should the crawldb suppose
> to locate?
>
> (I am new to this mail list and assuming, question at the source code
> level might be a better fit for the developer one. So let me know if I am
> asking the wrong question in the wrong place..)
>
> /usr/bin
>
> --- here is some basically information of the platform that I am working
> on 
> Mac OS X 10.8.4
> Apache Ant(TM) version 1.8.2 compiled on June 16 2012
> svn, version 1.6.18 (r1303927) compiled Feb  6 2013, 14:18:52
> Juno Eclipse SDK Version: 4.2.2 Build id: M20130204-1200
>
> System JAVA:
> java version "1.7.0_25"
> Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
>
> Eclipse JAVA:
> JVM 1.6.0.jdk
>
>
>


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848723#comment-13848723
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>    Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848723#comment-13848723
 ] 

Tejas Patil edited comment on NUTCH-1465 at 12/16/13 12:09 AM:
---

Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
i) the number of initial hosts can be large : one time burden for users
ii) crawler discovers new hosts with time : constant pain for users to look out 
for the new hosts discovered and then get sitemaps from robots.txt manually. 
With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.


was (Author: tejasp):
Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848561#comment-13848561
 ] 

Tejas Patil commented on NUTCH-1465:


Revisited this Jira after a long time and gave a thought how this can be done 
cleanly. Two ways for implementing this:

*(A) Do the sitemap stuff in the fetch phase of nutch cycle.*
This was my original approach which the (in-progress) patch addresses. This 
would involve tweaking core nutch classes at several locations.

Pros:
- Sitemaps are nothing but normal pages with several outlinks. Fits well in the 
'fetch' cycle.

Cons:
- Sitemaps can be very huge in size. Fetching them need large size and time 
limits. Fetch code must have a special case to take into account that the url 
is a sitemap url and use custom limits => leads to hacky coding style.
- Outlink class cannot hold extra information contained in sitemaps (like 
lastmod, changefreq). Modify it to hold this information too. This would be 
specific for sitemaps only yet we end up making all outlinks to hold this info. 
We could create a special type of outlink and take care of this.

*(B) Have separate job for the sitemap stuff and merge its output into the 
crawldb.*
i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got 
all the hosts to be processed.
ii. Run a map-reduce job: for each host, 
  - get the robots page, extract sitemap urls, 
  - get xml content of these sitemap pages
  - create crawl datums with the requried info and write this to a 
sitemapDB

iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb

Pros:
- Cleaner code. 
- Users have control when to perform sitemap extraction. This is better than 
(A) wherein sitemap urls are sitting in the crawldb and get fetched along with 
normal pages (thus, eating up fetch time of every fetch phase). We can have a 
sitemap_fequency used insdie the crawl script so that users say that after 'x' 
nutch cycles, run sitemap processing.

Cons:
- Additional map-reduce jobs are needed. I think that this must be reasonable. 
Running sitemap job 1-5 times in a month on a production level crawl would work 
out well.

I am inclined towards implementing (B)

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2013-12-14 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848517#comment-13848517
 ] 

Tejas Patil commented on NUTCH-1325:


Hi [~markus17],
I stopped by this Jira (after a long time !!!) with an intention of getting it 
to a stage where we could have it inside trunk. 
You had replied to my two concerns.

For (1): 
{noformat}host_a.example.org, host_b.example.org ==> example.org{noformat}

This might *NOT* be a good idea. 
(a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted 
independently. It can be argued to consider them as different hosts.
(b) I am not sure about the standards, but if something like "uci.cs.edu" is 
valid (subdomain is suffix of domain) then there would be a problem when we 
resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu".

For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We 
have a modified domain filter that optionally takes a scheme so we can force 
HTTPS for specific domains. Those domains are filtered out because HTTP is not 
allowed."
Do you have any suggestion to work this out ?

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1577) Add target for creating eclipse project

2013-12-14 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848493#comment-13848493
 ] 

Tejas Patil commented on NUTCH-1577:


There was some checkin(s) in past few months which have lead to one jar 
(solr-solrj-3.4.0.jar) being required to be in eclipse classpath and 'ant 
eclipse' not building the project smoothly. Fixed the same. Committed at 
revision 1550987.

> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: Nutch with YARN (aka Hadoop 2.0)

2013-12-09 Thread Tejas Patil
Hi Julien,
In Hadoop 2.0, major components have been re-written to support the new
design. Hence, its likely to observe performance differences.

Thanks,
Tejas


On Mon, Dec 9, 2013 at 12:54 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> I don't think Nutch has been fully ported to the new mapreduce API which
> is a prerequisite for running it on Hadoop 2.
> I can't think of a reason why that the performance would be any different
> with Yarn.
>
> Julien
>
>
> On 9 December 2013 06:42, Tejas Patil  wrote:
>
>> Has anyone tried out running Nutch over YARN ? If so, were there were any
>> performance gains with the same ?
>>
>> Thanks,
>> Tejas
>>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


Nutch with YARN (aka Hadoop 2.0)

2013-12-08 Thread Tejas Patil
Has anyone tried out running Nutch over YARN ? If so, were there were any
performance gains with the same ?

Thanks,
Tejas


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2013-08-11 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736459#comment-13736459
 ] 

Tejas Patil commented on NUTCH-1325:


Hi [~markus17], 

>  think i've got a slightly newer version of the tools but don't know what 
> actually changed in the past year. I'll try to diff and upload it.

Could you kindly upload the newer version ?

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1599) Obtain consensus on new description of Nutch

2013-07-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699115#comment-13699115
 ] 

Tejas Patil commented on NUTCH-1599:


I agree with Julien: Nutch should be described as a web-crawler. Markus took it 
to the next level by adding more technicality :) So "Highly extensible and 
scalable web crawler software" it is !!

> Obtain consensus on new description of Nutch
> 
>
> Key: NUTCH-1599
> URL: https://issues.apache.org/jira/browse/NUTCH-1599
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3, 1.8
>
>
> As we seem to be sustaining pushes and maintenance (touch wood) of two 
> branches, I think it is about time we agreed on a more accurate description 
> of what Nutch actually is.
> We currently have (taken directly from our site)
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from 
> Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a 
> crawler, a link-graph database and parsing support handled by Apache Tika for 
> HTML and and array other document formats.
> Nutch can run on a single machine, but gains a lot of its strength from 
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a 
> highly flexible, easily extensible and thoroughly maintained plugin 
> infrastructure.
> {code}
> I suggest/propose something along the lines of
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from 
> Apache Lucene, the community now develops and maintains two branches:
> * 1.x; description of 1.x here
> * 2.x; description of 2.x here
> Both branches add web-specifics, such as a crawler, a link-graph database and 
> parsing support handled by Apache Tika for HTML and anarray other document 
> formats.
> Nutch can run on a single machine, but gains a lot of its strength from 
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a 
> highly flexible, easily extensible and thoroughly maintained plugin 
> infrastructure.
> {code}
> Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699096#comment-13699096
 ] 

Tejas Patil commented on NUTCH-1602:


Hi Lufeng, 
+1 from me too. One minor suggestion: You could add space in between "=" and 
";" to make it even better.

> improve the readability of metadata in readdb dump normal 
> --
>
> Key: NUTCH-1602
> URL: https://issues.apache.org/jira/browse/NUTCH-1602
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1602.patch
>
>
> the dumped metadata format is not readable.
> {code:xml}
> $bin/nutch readdb crawldb/ -dump dir
> http://www.baidu.com/ Version: 7
> Status: 3 (db_gone)
> Fetch time: Sat Aug 17 22:35:37 CST 2013
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 1.0
> Signature: null
> Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
> lastModified=0m6: v6
> {code}
> so I improve the Metadata format to this
> {code:xml}
> Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
> lastModified=0;m6=v6;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696840#comment-13696840
 ] 

Tejas Patil commented on NUTCH-1327:


Hi Markus,

1. The patch when applied as is didn't compile the plugin. I had to add entries 
into src/plugin/build.xml to get it compiled. 
2. Can you kindly add some javadoc comments in QuerystringURLNormalizer class 
so that people can quickly get an idea about what this plugin would do ?

> QueryStringNormalizer
> -
>
> Key: NUTCH-1327
> URL: https://issues.apache.org/jira/browse/NUTCH-1327
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1327-1.8-1.patch
>
>
> A normalizer for dealing with query strings. Sorting query strings is helpful 
> in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Adding nutch stage

2013-07-01 Thread Tejas Patil
On Mon, Jul 1, 2013 at 5:31 AM, Ahmet Emre Aladağ wrote:

> Hi,
>
> I'd like to add a new stage called "updatescore" after "updatedb" to Nutch
> 2.1.
>
> I tried two ways for this:
> 1) public class ScoreUpdaterJob extends NutchTool implements Tool;
>
> Nutch requires me to define the InputFormat, OutputFormat etc. to perform
> Map-reduce calculations.
>
> I don't want to perform map-reduce but call a Giraph job to run on Hadoop.
> When it's finished, Nutch can go on its way.
>

> 2) public class ScoreUpdaterJob implements Tool;
> or public class ScoreUpdaterJob;
>
> Then I can't use setJarClass of NutchTool, so hadoop job fails:
> Caused by: java.lang.**ClassNotFoundException: org.apache.giraph.examples.
> **LinkRank.LinkRankComputation
>

Isn't setJarClass a method provided in Hadoop itself and something that is
not provided in NutchTool ?
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setJarByClass%28java.lang.Class%29

>
> How can I fix this? What's the best way to add a giraph job as a Nutch
> stage?
>

My feeling is that #2 should work.


> Thanks,
>
>
>


Re: Updating the documentation for crawl via 2.x

2013-06-30 Thread Tejas Patil
I think that the wiki page was made with an intention that users knew about
1.x and would now be switching to 2.x. So it had only the gora and
datastore setup steps. I agree with you that it should contain complete set
of steps.

*@dev:* Unless there is any objection or better suggestion, I would get
this done in coming days.

On Sun, Jun 30, 2013 at 4:14 AM, Sznajder ForMailingList <
bs4mailingl...@gmail.com> wrote:

> Hi
>
> I think we may update the documentation of crawl instructions
>
> Currently, the instructions stop at the inject step.
>
> And we are supposed to follow the instructions in Nutch 1.x
>
> However in these instructions, the syntax is quite different
> For example:
>
> bin/nutch generate
>
> does not expect crawldb and segents path etc...
>
> I think an update would be very useful.
>
> Benjamin
>


Re: [VOTE] Apache Nutch 2.2.1 RC#1

2013-06-27 Thread Tejas Patil
+1 from me too


On Thu, Jun 27, 2013 at 12:00 PM, Markus Jelsma
wrote:

> Looks fine Lewis! +1
>
> -Original message-
> From: Lewis John Mcgibbney
> Sent: Thursday 27th June 2013 20:00
> To: dev@nutch.apache.org; u...@nutch.apache.org
> Subject: [VOTE] Apache Nutch 2.2.1 RC#1
>
> Hi,
>
> It would be greatly appreciated if you could take some time to VOTE on the
> release candidate for the Apache Nutch 2.2.1 artifacts. This candidate is
> (amongst other things) a bug fix for NUTCH-1591 - Incorrect conversion of
> ByteBuffer to String.
>
> The big fix solved 8 issues:
> http://s.apache.org/PGa 
>
> SVN source tag:
> http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1 <
> http://svn.apache.org/repos/asf/nutch/tags/release-2.2.1>
>
> Staging repo:
> https://repository.apache.org/content/repositories/orgapachenutch-082/ <
> https://repository.apache.org/content/repositories/orgapachenutch-082/>
>
> Release artifacts:
> http://people.apache.org/~lewismc/nutch/nutch2.2.1 <
> http://people.apache.org/~lewismc/nutch/nutch2.2.1>
>
> PGP release keys (signed using 4E21557F):
> http://nutch.apache.org/dist/KEYS 
>
> Vote will be open for at least least 72 hours.
>
> As RM, I would like to say sorry for all of the releases recently. I
> understand that reviewing these candidates is not a trivial task, however
> it is really encouraging to see us, as a vibrant community, able to push
> releases of this caliber on a regular basis.
>
> Best, have a great weekend when it comes around.
>
> Lewis
>
> [ ] +1, lets get it released!!!
> [ ] +/-0, fine, but consider to fix few issues before...
> [ ] -1, nope, because... (and please explain why)
>
> p.s. heres my +1
>
> --
> Lewis
>
>
>


[jira] [Commented] (NUTCH-1126) JUnit test for urlfilter-prefix

2013-06-24 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692069#comment-13692069
 ] 

Tejas Patil commented on NUTCH-1126:


Thanks Talat and Cihad :)

One small thing: @author tags should not be used in Apache projects - see 
http://mail-archives.apache.org/mod_mbox/www-community/200402.mbox/%3c403a144a.5040...@apache.org%3E
 Please remove those while submitting patches.

> JUnit test for urlfilter-prefix
> ---
>
> Key: NUTCH-1126
> URL: https://issues.apache.org/jira/browse/NUTCH-1126
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: test_case_for_urlfilter-prefix.patch
>
>
> This issue is part of the larger attempt to provide a Junit test case for 
> every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [VOTE] Apache Nutch 1.7 Release Candidate

2013-06-22 Thread Tejas Patil
+1 from me


On Fri, Jun 21, 2013 at 11:51 PM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Apologies. I had seen only that the status had changed to closed but
> removing the fix version definitely did the trick.
>
> +1 for releasing
>
> Thanks Lewis
>
>
> On 21 June 2013 21:46, Lewis John Mcgibbney wrote:
>
>> I merely removed the fix for version numbers for the issues.
>> Anyone is free to amend the various issues and do this now. I am on
>> mobile and Jira does not render well in my browser.
>> Can you please see to this?
>> Thank you
>>
>>
>> On Fri, Jun 21, 2013 at 12:54 PM, Julien Nioche <
>> lists.digitalpeb...@gmail.com> wrote:
>>
>>> Hi Lewis
>>>
>>> I don't think you got my comment. I was saying that won't fix is the
>>> right
>>> resolution for these issues but that the report should not include them.
>>> In
>>> the case of these issues ppl might get the wrong idea by looking at the
>>> report and think that we included the mongodb related stuff in the next
>>> release which we don't.
>>>
>>> J.
>>>
>>>
>>> On 21 June 2013 19:41, Lewis John Mcgibbney >> >wrote:
>>>
>>> > Hi Julien,
>>> > Done, thanks for the attention to detail.
>>> > I wonder if you got to check sigs as well? I have been dancing between
>>> > machines and it would be excellent to verify.
>>> > Thank you v much.
>>> > Lewis
>>> >
>>> >
>>> > On Fri, Jun 21, 2013 at 1:47 AM, Julien Nioche <
>>> > lists.digitalpeb...@gmail.com> wrote:
>>> >
>>> > > Hi Lewis
>>> > >
>>> > > The release notes [
>>> > >
>>> > >
>>> >
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680&version=12323281
>>> > > ]
>>> > > list issues marked as won't fix which is probably not a great idea.
>>> For
>>> > > instance it lists
>>> > > *- Port nutch-mongodb-indexer to Nutch*
>>> > > which is a won't fix but people could get the impression that it has
>>> been
>>> > > committed. Would be good to fix it if we can.
>>> > >
>>> > > The code compiles and passes the test. +1 to release
>>> > >
>>> > > Thanks
>>> > >
>>> > > Julien
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On 20 June 2013 22:48, lewis john mcgibbney 
>>> wrote:
>>> > >
>>> > > > Hi,
>>> > > >
>>> > > > Please VOTE on the release of the Apache Nutch 1.7 artifacts.
>>> > > >
>>> > > > As always, we solved a bunch of issues:
>>> > > > http://s.apache.org/1zE
>>> > > >
>>> > > > SVN source tag:
>>> > > > http://svn.apache.org/repos/asf/nutch/tags/release-1.7/
>>> > > >
>>> > > > Staging repo:
>>> > > >
>>> https://repository.apache.org/content/repositories/orgapachenutch-044/
>>> > > >
>>> > > > Release artifacts:
>>> > > > http://people.apache.org/~lewismc/nutch/nutch1.7/
>>> > > >
>>> > > > PGP release keys (signed using 4E21557F):
>>> > > > http://nutch.apache.org/dist/KEYS
>>> > > >
>>> > > > Vote will be open for at least least 72 hours.
>>> > > >
>>> > > > I would like to say a huge thanks all contributors and committers
>>> from
>>> > > far
>>> > > > and wide who helped with this release. Another string to add to the
>>> > Nutch
>>> > > > bow.
>>> > > >
>>> > > > Best
>>> > > >
>>> > > > Lewis
>>> > > >
>>> > > > [ ] +1, let's get it released!!!
>>> > > > [ ] +/-0, fine, but consider to fix few issues before...
>>> > > > [ ] -1, nope, because... (and please explain why)
>>> > > >
>>> > > > p.s. here's my +1
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > *
>>> > > *Open Source Solutions for Text Engineering
>>> > >
>>> > > http://digitalpebble.blogspot.com/
>>> > > http://www.digitalpebble.com
>>> > > http://twitter.com/digitalpebble
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > *Lewis*
>>> >
>>>
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>
> --
> *
> *
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>


Re: right place to put wiki images

2013-06-11 Thread Tejas Patil
As per suggestion by Seb, I have corrected wiki at several places.

The images over Admin UI Proposal are lost as they were hosted somewhere
else and the site is down now :(
http://wiki.apache.org/nutch/NutchAdministrationUserInterface


On Tue, Jun 11, 2013 at 11:14 AM, Tejas Patil  wrote:

> Currently, we dont have much images in nutch wiki. Here are few places
> where I could find images:
>
> Nutch logo on main wiki page is from external server:
> http://www.interadvertising.co.uk/files/nutch_logo_medium.gif
>
> The images over Admin UI Proposal are lost as they were hosted somewhere
> else and the site is down now.. those images are gone :(
> http://wiki.apache.org/nutch/NutchAdministrationUserInterface
>
> I have uploaded these images over imageshack server:
> http://wiki.apache.org/nutch/RunNutchInEclipse
>
> Should we have these images in our SVN repo alongside forrest [0] so that
> we have control over the same ?
>
> [0] : https://svn.apache.org/repos/asf/nutch/site
>
> Thanks,
> Tejas Patil
>


  1   2   3   4   >