[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-03-06 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13922724#comment-13922724
 ] 

Tejas Patil commented on NUTCH-1325:


It would take me few weeks before I can work on this one. The reason being: I 
have recently left school and started working at a company. There is some legal 
paperwork that I would have to finish off to work on open source projects (even 
if its during my free time).   

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-02-09 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1721.


Resolution: Fixed

Committed to trunk (rev 1566255) and 2.x (rev 1566257)

> Upgrade to Crawler commons 0.3
> --
>
> Key: NUTCH-1721
> URL: https://issues.apache.org/jira/browse/NUTCH-1721
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-01-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887784#comment-13887784
 ] 

Tejas Patil commented on NUTCH-1721:


Attached patches, all test cases are passing.

> Upgrade to Crawler commons 0.3
> --
>
> Key: NUTCH-1721
> URL: https://issues.apache.org/jira/browse/NUTCH-1721
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-01-31 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1721:
---

Attachment: NUTCH-1721-2.x.patch
NUTCH-1721-trunk.patch

> Upgrade to Crawler commons 0.3
> --
>
> Key: NUTCH-1721
> URL: https://issues.apache.org/jira/browse/NUTCH-1721
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1721-2.x.patch, NUTCH-1721-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1721) Upgrade to Crawler commons 0.3

2014-01-31 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1721:
--

 Summary: Upgrade to Crawler commons 0.3
 Key: NUTCH-1721
 URL: https://issues.apache.org/jira/browse/NUTCH-1721
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2.1, 2.2, 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 2.3, 1.8






--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887763#comment-13887763
 ] 

Tejas Patil commented on NUTCH-1465:


Re "filters and normalizers": +1.

Re "fetch intervals" and "reducer overwriting": I have never encountered bogus 
sitemaps but that was for a intranet crawl and it would be better to take care 
of that in this jira. Here is what I conclude from the discussion till now:
(1)  _fetch interval_: For old entries, don't use the value from sitemap. For 
new ones, use the value from sitemap provided 
(db.fetch.schedule.adaptive.min_interval <= interval <= db.fetch.interval.max)
(2) _score_: Never use value from sitemap. For new ones, use scoring filters. 
Keep the value of old entries as it is.
(3) _modified time_: Always use the value from sitemap provided its not a date 
in future.

Did I get it right ?
 
Re "score": I missed that the jar is old. Would file a jira to upgrade CC to 
v0.3 in Nutch.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886677#comment-13886677
 ] 

Tejas Patil commented on NUTCH-1465:


Interesting comments [~wastl-nagel].

Re "filters and normalizers" : By default I have kept those ON but can be 
disabled by using "-noFilter" and "-noNormalize".
Re "default content limits" and "fetch timeout": +1. Agree with you.
Re "Processing sitemap indexes fails" : +1. Nice catch.
Re "Fetch intervals of 1 second or 1 hour may cause troubles" : Currently, 
Injector allows users to provide a custom fetch interval with any value eg. 1 
sec. It makes sense not the correct it as user wants Nutch use that custom 
fetch interval. If we view sitemaps as custom seed list given by a content 
owner, then it would make sense to follow the intervals. But as you said that 
sitemaps can be wrongly set or outdated, the intervals might be incorrect. The 
question bolis down to: We are blindly accepting user's custom information in 
inject. Should we blindly assume that sitemaps are correct or not ? I have no 
strong opinion about either side of the argument. 

(PS : Default 'db.fetch.schedule.adaptive.min_interval' is 1 min so would allow 
1 hr as per db.fetch.schedule.adaptive.min_interval <= interval)

Re "SitemapReducer overwriting" : 
>> _"If a sitemap does not specify one of score, modified time, or fetch 
>> interval this values is set to zero. "_
Nope. See 
[SiteMapURL.java|https://code.google.com/p/crawler-commons/source/browse/trunk/src/main/java/crawlercommons/sitemaps/SiteMapURL.java]

 (a) score : Crawler commons assigns a default score of 0.5 if there was none 
provided in sitemap. 
We can do this: If an old entry has score other than 0.5, it can be preserved 
else update. For new entry, use scoring plugins for score equal to 0.5, else 
preserve the same. 
Limitation: Its not possible to distinguish if the score of 0.5 is from sitemap 
or the default one if  was absent.
 (b) fetch interval : Crawler commons does NOT set fetch interval if there was 
none provided in sitemap. So we are sure that whatever value is used is coming 
from . Validation might be needed as per comments above.
 (c) modified time : Same as fetch interval, unless parsed from sitemap file, 
modified time is set to NULL. Only possible validation is to drop values 
greater than current time.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1718) update description of property http.robots.agent

2014-01-29 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13885650#comment-13885650
 ] 

Tejas Patil commented on NUTCH-1718:


Hi [~someuser77], Yup. I am waiting for folks to comment if that addition is 
fine. If it is, then I would go ahead and update the description of this jira.

> update description of property http.robots.agent
> 
>
> Key: NUTCH-1718
> URL: https://issues.apache.org/jira/browse/NUTCH-1718
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1718-trunk.v1.patch
>
>
> The description of property http.robots.agent in nutch-default.xml recommends 
> to add a '*' to the list of agent names. This will cause the same problem as 
> described in NUTCH-1715. The description should be updated. Also regarding 
> "order of precedence" which is dictated since NUTCH-1031 only by ordering of 
> user agents in robots.txt.
> {code:xml}
> 
>   http.robots.agents
>   *
>   The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1718) update description of property http.robots.agent

2014-01-28 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1718:
---

Attachment: NUTCH-1718-trunk.v1.patch

Thanks [~wastl-nagel] for bringing this up. I should have updated the 
documentation with NUTCH-1715 but lost track of the same.

In addition to having a documentation, I am proposing this: 
Instead of making users to have 'http.agent.name' as the first agent in 
'http.robots.agents', make the program do that automatically. So users would 
make use of 'http.robots.agents' to specify any additional agents apart from 
'http.agent.name'. Here is a patch for the same.

> update description of property http.robots.agent
> 
>
> Key: NUTCH-1718
> URL: https://issues.apache.org/jira/browse/NUTCH-1718
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2, 2.2.1
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1718-trunk.v1.patch
>
>
> The description of property http.robots.agent in nutch-default.xml recommends 
> to add a '*' to the list of agent names. This will cause the same problem as 
> described in NUTCH-1715. The description should be updated. Also regarding 
> "order of precedence" which is dictated since NUTCH-1031 only by ordering of 
> user agents in robots.txt.
> {code:xml}
> 
>   http.robots.agents
>   *
>   The agent strings we'll look for in robots.txt files,
>   comma-separated, in decreasing order of precedence. You should
>   put the value of http.agent.name as the first agent name, and keep the
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-28 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v5.patch

Adding new patch 'v5' with below changes:
1. Added Apache license header as per review comment by [~wastl-nagel]
2. Added counters in log output as per review comment by [~wastl-nagel]
3. Implemented the change suggested by [~wastl-nagel] for 'isHost' and 
'filterNormalize'. I could do more re-factoring and make it more clean.
4. Added a new parameter "-noStrict" to control the checking done by sitemap 
parser 

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883204#comment-13883204
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],
Thanks a lot for your comments. First two were straight forward and I agree 
with those.

Re "hacky way" : For hosts from the HostDb, we don't know which protocol they 
below to. In the code I was checking if http:// is a match and if that was a 
bad guess then try with https://. I didn't handle for ftp:// and file:/ 
schemes. By "hacky" I meant this approach of trial-and-error till a suitable 
match is formed and we create a homepage url for the host. I have thought of 
your comment and would have a better (yet hacky) way in the coming patch.

Re "concurrency": I had thought of this and had searched over internet for 
internals of MultithreadedMapper. All I could get is that it has an internal 
thread pool and each input record to handed over to a thread in this pool to 
run map() over it. I wrote this code to check if thread safety was ensured in 
MultithreadedMapper:

{noformat}
  private static class SitemapMapper extends Mapper {
private String myurl = null;

public void map(Text key, Writable value, Context context) throws 
IOException, InterruptedException {
  if (value instanceof Text) {
String url = key.toString();
if(foo(url).compareTo(url) != 0) {
  LOG.warn("Race condition found !!!");
}
  }
}

private String foo(String url) {
  myurl = url;
  if(Thread.currentThread().getId() % 2 == 1) {
try {
  Thread.sleep(1);
} catch(InterruptedException e) {
  LOG.warn(e.getMessage());
}
  }
  return myurl;
}
{noformat}

I ran it multiple times with threads set to 10, 100, 1000 and 2000 but never 
hit the race condition in the code. Is the code snippet above a good way to 
reveal any race condition in the code ? Its won't be a formal conclusion and 
more of an experimental conclusion. How do I get a concrete conclusion whether 
MultithreadedMapper is thread safe or not ?

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

2014-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882771#comment-13882771
 ] 

Tejas Patil commented on NUTCH-1084:


The issue gets reproduced on current trunk. Attaching a test segment :
https://issues.apache.org/jira/secure/attachment/12625275/20140126210858.tgz

The workaround suggested by [~markus17] in comment above works correctly.

> ReadDB url throws exception
> ---
>
> Key: NUTCH-1084
> URL: https://issues.apache.org/jira/browse/NUTCH-1084
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to 
> write the _SUCCESS file. Until now that's the solution implemented for 
> similar issues. I've not been successful as to make the Hadoop readers simply 
> skip the file.
> The second issue seems a bit strange and did not happen on a local check out. 
> I'm not yet sure whether this is a Hadoop issue or something being corrupt in 
> the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: 
> org.apache.nutch.protocol.ProtocolStatus because 
> org.apache.nutch.protocol.ProtocolStatus
> at 
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
> at 
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> at 
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-27 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882770#comment-13882770
 ] 

Tejas Patil commented on NUTCH-1692:


Hi [~markus17], 
I didn't knew about NUTCH-1084 until now and after going through it totally 
agree that the exception I faced was due to that issue. With that workaround 
and the patch for this jira, the NPE issue seems fixed. +1 for commit.

> SegmentReader broken in distributed mode
> 
>
> Key: NUTCH-1692
> URL: https://issues.apache.org/jira/browse/NUTCH-1692
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch
>
>
> SegmentReader -list option ignores the -no* options, causing the following 
> exception in distributed mode:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at 
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-26 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v4.patch

Attaching v4 patch with the suggestions #1 and #2 from [~lewismc].

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-26 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1692:
---

Attachment: 20140126210858.tgz

Attaching the test segment (20140126210858.tgz)

> SegmentReader broken in distributed mode
> 
>
> Key: NUTCH-1692
> URL: https://issues.apache.org/jira/browse/NUTCH-1692
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: 20140126210858.tgz, NUTCH-1692-trunk.patch
>
>
> SegmentReader -list option ignores the -no* options, causing the following 
> exception in distributed mode:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at 
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1692) SegmentReader broken in distributed mode

2014-01-26 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882348#comment-13882348
 ] 

Tejas Patil commented on NUTCH-1692:


Hi [~markus17],
I am tried out the patch on a latest trunk checkout and it ran fine in local 
mode. In deploy mode, I encountered this:
{noformat}
$ bin/nutch readseg -list 20140126210858/ -nocontent -nogenerate
14/01/26 22:26:16 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/01/26 22:26:16 INFO zlib.ZlibFactory: Successfully loaded & initialized 
native-zlib library
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
14/01/26 22:26:16 INFO compress.CodecPool: Got brand-new decompressor
Exception in thread "main" java.io.IOException: can't find class: 
org.apache.nutch.protocol.ProtocolStatus because 
org.apache.nutch.protocol.ProtocolStatus
at 
org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:280)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1813)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1941)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:517)
at 
org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:485)
at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:597)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
{noformat}

> SegmentReader broken in distributed mode
> 
>
> Key: NUTCH-1692
> URL: https://issues.apache.org/jira/browse/NUTCH-1692
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8
>
> Attachments: NUTCH-1692-trunk.patch
>
>
> SegmentReader -list option ignores the -no* options, causing the following 
> exception in distributed mode:
> {code}
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at 
> org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at 
> org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1715.


Resolution: Fixed

The change was verified over nutch-user mailing list. Committed to trunk 
(revision 1561087) and 2.x (revision 1561088).

> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1715.2.x.patch, NUTCH-1715.trunk.patch
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1715:
---

Attachment: NUTCH-1715.2.x.patch
NUTCH-1715.trunk.patch

> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1715.2.x.patch, NUTCH-1715.trunk.patch
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1715:
---

Description: 
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (ie. *) to it in the end. This is sent to 
crawler commons while parsing the rules. The wildcard gets matched first in 
robots file with (User-agent: *) if that comes before any other matching rule 
thus resulting in a allowed url being robots denied. 

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E

  was:
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard '*' to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard '*' added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E


> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1716) RobotRulesParser adds extra '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1716.


Resolution: Duplicate

Accidentally duplicated NUTCH-1715

> RobotRulesParser adds extra '*' to the robots name
> --
>
> Key: NUTCH-1716
> URL: https://issues.apache.org/jira/browse/NUTCH-1716
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard (ie. *) to it in the end. This is sent 
> to crawler commons while parsing the rules. The wildcard gets matched first 
> in robots file with (User-agent: *) if that comes before any other matching 
> rule thus resulting in a allowed url being robots denied. 
> This bug was reported by @Markus Jelsma. The discussion over nutch-user can 
> be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E
>  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1715:
---

Description: 
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard '*' to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard '*' added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E

  was:
In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (*) to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard (*) added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E


> RobotRulesParser adds additional '*' to the robots name
> ---
>
> Key: NUTCH-1715
> URL: https://issues.apache.org/jira/browse/NUTCH-1715
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7, 2.2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
>
> In RobotRulesParser, when Nutch creates a agent string from multiple agents, 
> it combines agents from both 'http.agent.name' and 'http.robots.agents'. 
> Along with that it appends a wildcard '*' to it in the end. This is sent to 
> crawler commons while parsing the rules. The wildcard '*' added to the end 
> gets matched with the first rule in robots file and thus results in the url 
> being robots denied while the robots.txt actually allows them.
> This issue was reported by [~markus17]. The discussion over nutch-user is 
> here:
> http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1716) RobotRulesParser adds extra '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1716:
--

 Summary: RobotRulesParser adds extra '*' to the robots name
 Key: NUTCH-1716
 URL: https://issues.apache.org/jira/browse/NUTCH-1716
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.2.1, 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 2.3, 1.8


In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (ie. *) to it in the end. This is sent to 
crawler commons while parsing the rules. The wildcard gets matched first in 
robots file with (User-agent: *) if that comes before any other matching rule 
thus resulting in a allowed url being robots denied. 

This bug was reported by @Markus Jelsma. The discussion over nutch-user can be 
found here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E
 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1715) RobotRulesParser adds additional '*' to the robots name

2014-01-24 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1715:
--

 Summary: RobotRulesParser adds additional '*' to the robots name
 Key: NUTCH-1715
 URL: https://issues.apache.org/jira/browse/NUTCH-1715
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.2.1, 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 2.3, 1.8


In RobotRulesParser, when Nutch creates a agent string from multiple agents, it 
combines agents from both 'http.agent.name' and 'http.robots.agents'. Along 
with that it appends a wildcard (*) to it in the end. This is sent to crawler 
commons while parsing the rules. The wildcard (*) added to the end gets matched 
with the first rule in robots file and thus results in the url being robots 
denied while the robots.txt actually allows them.

This issue was reported by [~markus17]. The discussion over nutch-user is here:
http://mail-archives.apache.org/mod_mbox/nutch-user/201401.mbox/%3CCAFKhtFzBRpVv4MULSxw8RDRR_wbivOt%3DnhFX-w621BR8q%2BxVDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1676) Add rudimentary SSL support to protocol-http

2014-01-24 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881143#comment-13881143
 ] 

Tejas Patil commented on NUTCH-1676:


Hi [~markus17],
I tried out the patch with couple of https urls and it works correctly. Few 
comments on the patch:

(1) In src/plugin/protocol-http/plugin.xml, the same stuff is repeated twice. 
Not sure if that was accidental or meant to be different

{code:title=plugin.xml|borderStyle=solid}
+  
+  
+   
+  

+  
+   
+  
{code}

(2) In HttpBase.java: The values in this line go till column 2070 and might be 
painful while looking at the list. Is there any way to avoid it (maybe using a 
String array) ?

{code:title=HttpBase.java|borderStyle=solid}
conf.getStrings("http.tls.supported.cipher.suites", 
"TLS_ECDHE_ECDSA_WITH_AES_256_CBC
{code}

(3) The class description is empty after the deletion of author tag. Can you 
please fill that ?

{code:title=HttpBase.java|borderStyle=solid}
/**
 */
public abstract class HttpBase implements Protocol {
{code}

> Add rudimentary SSL support to protocol-http
> 
>
> Key: NUTCH-1676
> URL: https://issues.apache.org/jira/browse/NUTCH-1676
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.7
>Reporter: Julien Nioche
> Fix For: 1.8
>
> Attachments: NUTCH-1676-2x.patch, NUTCH-1676.patch, NUTCH-1676.patch, 
> NUTCH-1676.patch, NUTCH-1676.patch
>
>
> Adding https support to our http protocol would be a good thing even if it 
> does not handle the security. This would save us from having to use the 
> http-client plugin which is buggy in its current form. 
> Patch generated from 
> https://github.com/Aloisius/nutch/commit/d3e15a1db0eb323ccdcf5ad69a3d3a01ec65762c#commitcomment-4720772
> Needs testing...



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880295#comment-13880295
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~lewismc],
+1 for the first two suggestions. For #3: I skimmed through the methods inside 
URLUtil.java and nothing came to my notice that I could use in the Sitemap code 
you pointed. Can you please confirm ?

A big thanks mate for trying out the feature. Hopefully we get this into 1.8 
release.
Cheers !!


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Fix Version/s: (was: 1.9)
   1.8

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880288#comment-13880288
 ] 

Tejas Patil commented on NUTCH-1712:


The performance gains due to this patch won't be phenomenal for small seeds 
file w/o any metadata and large crawldb's. The only savings with this patch is 
in terms of saving time over :-
1. dumping the output of the first job (ie. datum objects for the seed urls)
2. reading this output as input for the next job
3. job launch and cleanup.

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1164) Write JUnit tests for protocol-http

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1164.


Resolution: Fixed

The patch is better now and all tests pass. It needed little modification: you 
can't check string equality using equals sign and re-factoring. Committed to 
2.x (rev 1560786). 

Thanks a lot for your contribution [~Sertac Turkel] !!

> Write JUnit tests for protocol-http
> ---
>
> Key: NUTCH-1164
> URL: https://issues.apache.org/jira/browse/NUTCH-1164
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.4
>
> Attachments: NUTCH-1164.patch, 
> TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1712:
---

Attachment: NUTCH-1712-trunk.v1.patch

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1712:
---

Description: 
Currently Injector creates two mapreduce jobs:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort job. 
Merge and emit final CrawlDatum objects.

Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from 
seeds file simultaneously and perform inject in a single map-reduce job.

Also, here are additional things covered with this jira:
1. Pushed filtering and normalization above metadata extraction so that the 
unwanted records are ruled out quickly.
2. Migrated to new mapreduce API
3. Improved documentation 
4. New junits with better coverage

Relevant discussion over nutch-dev can be found here:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E

  was:
Currently Injector creates two mapreduce jobs:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort job. 
Merge and emit final CrawlDatum objects.

Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from 
seeds file simultaneously and perform inject in a single map-reduce job.

Also, there are few other things adressed in this patch:
1. Pushed filtering and normalization above metadata extraction so that the 
unwanted records are ruled out quickly.
2. Migrated to new mapreduce API
3. Improved documentation 
4. New junits with better coverage

Relevant discussion over nutch-dev can be found here:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2014-01-23 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1712:
--

 Summary: Use MultipleInputs in Injector to make it a single 
mapreduce job
 Key: NUTCH-1712
 URL: https://issues.apache.org/jira/browse/NUTCH-1712
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: 1.7
Reporter: Tejas Patil
Assignee: Tejas Patil
 Fix For: 1.8


Currently Injector creates two mapreduce jobs:
1. sort job: get the urls from seeds file, emit CrawlDatum objects.
2. merge job: read CrawlDatum objects from both crawldb and output of sort job. 
Merge and emit final CrawlDatum objects.

Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls from 
seeds file simultaneously and perform inject in a single map-reduce job.

Also, there are few other things adressed in this patch:
1. Pushed filtering and normalization above metadata extraction so that the 
unwanted records are ruled out quickly.
2. Migrated to new mapreduce API
3. Improved documentation 
4. New junits with better coverage

Relevant discussion over nutch-dev can be found here:
http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v3.patch

Now that HostDb (NUTCH-1365) is in trunk, updated the patch (v3). 
Also,
- included job counters
- more documentation
- added sitemap references in log4j.properties and bin/nutch script.

For usage, see https://wiki.apache.org/nutch/SitemapFeature 

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-01-22 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878623#comment-13878623
 ] 

Tejas Patil commented on NUTCH-1325:


Hi [~markus17], 
Thanks for the correction. This feature would have not been without you in the 
first place. Apart from being a good addition to Nutch, HostDb has also helped 
in getting a simple design for Sitemap feature (NUTCH-1465). 

Cheers !!! 

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1164) Write JUnit tests for protocol-http

2014-01-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1164:
---

Attachment: TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt

Hi [~Sertac Turkel],
I tried out your patch and encountered test case failure:
{noformat}
test:
 [echo] Testing plugin: protocol-http
[junit] Running org.apache.nutch.protocol.http.TestProtocolHttp
[junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 1.244 sec
[junit] Test org.apache.nutch.protocol.http.TestProtocolHttp FAILED
{noformat}

I have attached the test case failure log for reference.

> Write JUnit tests for protocol-http
> ---
>
> Key: NUTCH-1164
> URL: https://issues.apache.org/jira/browse/NUTCH-1164
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>  Labels: test
> Fix For: 2.4
>
> Attachments: NUTCH-1158.patch, 
> TEST-org.apache.nutch.protocol.http.TestProtocolHttp.txt
>
>
> This issue should provide a single Junit test as part of an effort to provide 
> JUnit tests for all nutchgora plugins



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Resolved] (NUTCH-1325) HostDB for Nutch

2014-01-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1325.


   Resolution: Fixed
Fix Version/s: (was: 1.9)
   1.8

Thanks [~markus17] for the heads up :) I have committed the patch to trunk (rev 
1560316).

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Tejas Patil
> Fix For: 1.8
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1325:
--

Assignee: Tejas Patil  (was: Markus Jelsma)

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
---

Attachment: NUTCH-1465-trunk.v2.patch

Attaching NUTCH-1465-trunk.v2.patch which has implementation of *option (B)* 
_Have separate job for the sitemap stuff and merge its output into the crawldb_

+I have tied both the cases in this patch:+
1. users with targeted crawl who want to get sitemaps injected from a list of 
sitemap urls - the use case which [~wastl-nagel] had pointed out.
2. large open web crawls where users cannot afford to generate sitemap seeds 
for all the hosts and want nutch to inject sitemaps automatically. 

+To try out this patch:+
1. Apply the patch for HostDb feature 
(https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch)
2. Apply this patch (NUTCH-1465-trunk.v2.patch)
3. (optional) Add this to conf/log4j.properties at line 11:
{noformat}
log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout
{noformat}
3. Run using 
{noformat}
bin/nutch org.apache.nutch.util.SitemapProcessor
{noformat}

I have started working on a *wiki page* describing this feature: 
https://wiki.apache.org/nutch/SitemapFeature 

Any suggestion and comments are welcome.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: NUTCH-1325-trunk-v4.patch

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: (was: NUTCH-1325-trunk-v4.patch)

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: NUTCH-1325-trunk-v4.patch

Attaching NUTCH-1325-trunk-v4.patch with following changes:
- Fixed filterNormalize() to prevent from incorrectly pre-pending "http://"; to 
normal urls.
- Migrated HostDb to new map-reduce API

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325-trunk-v4.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875981#comment-13875981
 ] 

Tejas Patil commented on NUTCH-1630:


Hi [~talat],
I didn't knew about NUTCH-1413. That the perfect way of getting the average 
response time. For larger crawls which spawn several days, this would give a 
good approximation of the response time. With that, the points in the first two 
paragraphs of my earlier comment are resolved. For the third paragraph, as you 
have made it configurable, crawl owners would have to make this choice. 

The concept behind the patch is good and would be value addition to Nutch. As 
[~jnioche] suggested, it would be super awesome if this could be a plugin or 
made less tangled with the Generate and Fetch code so that it accidentally 
doesn't introduce any bugs.

> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> -
>
> Key: NUTCH-1630
> URL: https://issues.apache.org/jira/browse/NUTCH-1630
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1, 2.2, 2.2.1
>Reporter: Talat UYARER
>  Labels: improvement
> Fix For: 2.3
>
> Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
> FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
> Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875947#comment-13875947
 ] 

Tejas Patil commented on NUTCH-1630:


Hi [~talat],
So from 2nd depth onwards, you would ping the host in generate phase and get 
the response time. 
For large scale crawl setups, Generator itself might runs for few hours and at 
the time when you ping the host it might be loaded or there might be network 
traffic. When the acutal fetch phase runs, the response time might be different 
depending upon the load on the server. As I mentioned in earlier comment, I 
thought you were doing a cumulative sum of response timings for several urls of 
a host and then getting an average from it... which would give a better 
response time numbers. This would be harder to code in the existing codebase 
and might look ugly as fetcher needs to pass on this information to generator.

+A more broader concern for crawls which run for days+
Server response timings itself change as the local time changes. For example 
during day time (say 8:00 - 11:00 am) there might be decent requests from users 
to the server as compared to night time (say 1:00 - 4:00 am) when there are 
very small number of users requesting the servers. Pinging the server during at 
some point in the 24 hour day would not give a good approximation for the 
response time for long running crawls. 

+Effect on crawlspace of slow servers+
If a server is genuine slow (say due to low end hardware), then it would always 
have slower response time as compared to other servers. Effectively, we would 
end up having smaller fetch queue for that host and thus creating huge backlog 
of its urls which would end up sitting in crawldb for not being generated over 
and over again. I would take your side on this: try to fetch as much as we can. 
But some crawl owners might be unhappy with this.

> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> -
>
> Key: NUTCH-1630
> URL: https://issues.apache.org/jira/browse/NUTCH-1630
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1, 2.2, 2.2.1
>Reporter: Talat UYARER
>  Labels: improvement
> Fix For: 2.3
>
> Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
> FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
> Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1697) SegmentMerger to implement Tool

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875910#comment-13875910
 ] 

Tejas Patil commented on NUTCH-1697:


Hi [~markus17],
Correct me if I am wrong: Hadoop properties should be passed as *-D 
property=value* (note the space after -D). The way you  were passing ie. 
(*-Dproperty=value*) is applicable for JVM system properties and won't be 
picked up by Tool

> SegmentMerger to implement Tool
> ---
>
> Key: NUTCH-1697
> URL: https://issues.apache.org/jira/browse/NUTCH-1697
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1697-trunk.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1630) How to achieve finishing fetch approximately at the same time for each queue (a.k.a adaptive queue size)

2014-01-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875902#comment-13875902
 ] 

Tejas Patil commented on NUTCH-1630:


Hi [~icebergx5],
How do you obtain the average response time of previous depth ? I was hoping 
that it would be somewhere in the fetch phase where you somehow stored the 
response timings for each host then then pass on that information to the 
generate phase.

> How to achieve finishing fetch approximately at the same time for each queue 
> (a.k.a adaptive queue size) 
> -
>
> Key: NUTCH-1630
> URL: https://issues.apache.org/jira/browse/NUTCH-1630
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.1, 2.2, 2.2.1
>Reporter: Talat UYARER
>  Labels: improvement
> Fix For: 2.3
>
> Attachments: NUTCH-1630.patch, NUTCH-1630v2.patch
>
>
> Problem Definition:
> When crawling, due to unproportional size of queues; fetching needs to wait 
> for a long time for long lasting queues when shorter ones are finished. That 
> means you may have to wait for a couple of days for some of queues.
> Normally we define max queue size with generate.max.count but that's a static 
> value. However number of URLs to be fetched increases with each depth. 
> Defining same length for all queues does not mean all queues will finish 
> around the same time. This problem has been addressed by some other users 
> before [1]. So we came up with a different approach to this issue.
> Solution:
> Nutch has three mods for creating fetch queues (byHost, byDomain, ByIp). Our 
> solution can be applicable to all three mods.
> 1-Define a "fetch workload of current queue" (FW) value for each queue based 
> on the previous fetches of that queue.
> We calculate this by:
> FW=average response time of previous depth * number of urls in current 
> queue
> 2- Calculate the harmonic mean [2] of all FW's to get the average workload of 
> current depth (AW)
> 3- Get the length for a queue by dividing AW by previously known average 
> response time of that queue:
> Queue Length=AW / average response time
> Using this algoritm leads to a fetch phase where all queues finish up around 
> the same time.
> As soon as posible i will send my patch. Do you have any comments ? 
> [1] http://osdir.com/ml/dev.nutch.apache.org/2011-11/msg00101.html
> [2] In our opinion; harmonic mean is best in our case because our data has a 
> few points that are much higher than the rest. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1680) CrawldbReader to dump minRetry value

2014-01-18 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875687#comment-13875687
 ] 

Tejas Patil commented on NUTCH-1680:


+1 

> CrawldbReader to dump minRetry value
> 
>
> Key: NUTCH-1680
> URL: https://issues.apache.org/jira/browse/NUTCH-1680
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.8
>
> Attachments: NUTCH-1680-trunk.patch
>
>
> CrawlDBReader should be able to dump records based on minimum retry value.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2014-01-04 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862240#comment-13862240
 ] 

Tejas Patil commented on NUTCH-1325:


Could anyone please look at the patch and let us know if there are any flaws or 
improvements that must be addressed ?

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2014-01-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862237#comment-13862237
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],
Yes. I think that it should be there too. I will be working on the patch this 
weekend and update on the same. Thanks for your inputs and suggestions till now 
in, were super helpful in chalking out the right specs for this feature.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-356) Plugin repository cache can lead to memory leak

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861217#comment-13861217
 ] 

Tejas Patil commented on NUTCH-356:
---

+1 for commit.

> Plugin repository cache can lead to memory leak
> ---
>
> Key: NUTCH-356
> URL: https://issues.apache.org/jira/browse/NUTCH-356
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Enrico Triolo
> Fix For: 2.3, 1.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--NutchTest.java, 
> ASF.LICENSE.NOT.GRANTED--patch.txt, NUTCH-356-trunk.patch, cache_classes.patch
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), 
> I found out that actually the problem was related to the plugin cache used in 
> class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
> work, since I need to frequently submit new urls and append their contents to 
> the index; I don't (and I can't) have an urls.txt file with all urls I'm 
> going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch 
> as-is, since the problem I found occours only if nutch is used in a way 
> similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar 
> to what I need. It fetches and index some sample urls; to avoid webmasters 
> complaints I left the sample urls list empty, so you should modify the source 
> code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In 
> my experience I get an OutOfMemoryException after a couple of minutes, but it 
> clearly depends on your heap settings and on the plugins you are using (I'm 
> using 
> 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it 
> never get released. It seems that some class maintains a reference to it and 
> this class is never released since it is cached somewhere in the 
> configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the 
> cache and always returns a new instance (you can find the patch in 
> attachment). This way the memory consumption is always stable and I get no 
> OOM anymore.
> Clearly this is not the solution, since I guess there are many performance 
> issues involved, but for the moment it works.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1454) parsing chm failed

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860803#comment-13860803
 ] 

Tejas Patil commented on NUTCH-1454:


TIKA-1122 is fixed and I have verified that 'parsechecker' works fine with the 
same. Upgrading to Tika 1.5 (yet to be released) should fix this for Nutch.

> parsing chm failed
> --
>
> Key: NUTCH-1454
> URL: https://issues.apache.org/jira/browse/NUTCH-1454
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.9
>
>
> (reported by Jan Riewe, see 
> http://lucene.472066.n3.nabble.com/CHM-Files-and-Tika-td3999735.html)
> Nutch fails to parse chm files with
> {quote}
>  ERROR tika.TikaParser - Can't retrieve Tika parser for mime-type 
> application/vnd.ms-htmlhelp
> {quote}
> Tested with chm test files from Tika:
> {code}
>  % bin/nutch parsechecker 
> file:/.../tika/trunk/tika-parsers/src/test/resources/test-documents/testChm.chm
> {code}
> Tika parses this document (but does not extract any content).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1691) DomainBlacklist url filter does not allow -D filter file override

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860678#comment-13860678
 ] 

Tejas Patil commented on NUTCH-1691:


Hi [~markus17],
Its a good solution. +1 from me. 
I would like to know the way you are invoking the plugin. I tried to use 
"bin/nutch plugin urlfilter-domainblacklist" but that didn't work as it doesn't 
got main().

> DomainBlacklist url filter does not allow -D filter file override
> -
>
> Key: NUTCH-1691
> URL: https://issues.apache.org/jira/browse/NUTCH-1691
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1691-trunk.patch
>
>
> This filter does not accept -Durlfilter.domainblacklist.file= overrides. The 
> plugin's file attribute is always used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Closed] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-02 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil closed NUTCH-1670.
--

Resolution: Fixed

Committed the patch by [~amuseme] to trunk (rev 1554883).

> set same crawldb directory in mergedb parameter
> ---
>
> Key: NUTCH-1670
> URL: https://issues.apache.org/jira/browse/NUTCH-1670
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
>  Labels: PatchAvailable
> Fix For: 1.8
>
> Attachments: NUTCH-1670.patch
>
>
> when merge two crawldb using the same crawldb directory in bin/nutch merge 
> paramater, it will throw data not found exception. 
> bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
> bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1080:
---

Fix Version/s: 1.8

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860643#comment-13860643
 ] 

Tejas Patil commented on NUTCH-1080:


Committed to trunk (rev 1554881). Will port the same to 2.x

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Assigned] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-02 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1080:
--

Assignee: Tejas Patil

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
>Assignee: Tejas Patil
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1080) Type safe members , arguments for better readability

2014-01-01 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1080:
---

Attachment: NUTCH-1080-tejasp-trunk-v2.patch

Attaching a patch for trunk. Uploaded the same over review board: 
https://reviews.apache.org/r/16563/

Comments are welcome !!!

> Type safe members , arguments for better readability 
> -
>
> Key: NUTCH-1080
> URL: https://issues.apache.org/jira/browse/NUTCH-1080
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Karthik K
> Fix For: 2.3
>
> Attachments: NUTCH-1080-tejasp-trunk-v2.patch, NUTCH-1080.patch, 
> NUTCH-rel_14-1080.patch
>
>
> Enable generics for some of the API, for better type safety and readability, 
> in the process. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1325) HostDB for Nutch

2014-01-01 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1325:
---

Attachment: NUTCH-1325-trunk-v3.patch

A final patch (NUTCH-1325-trunk-v3.patch) to complete this feature.
Uploaded the patch over review board too: https://reviews.apache.org/r/16555/

Comments are welcome !!!

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325-trunk-v3.patch, 
> NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1670) set same crawldb directory in mergedb parameter

2014-01-01 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859987#comment-13859987
 ] 

Tejas Patil commented on NUTCH-1670:


Hi [~amuseme.lu],
The patch looks good to me. +1 from me for commit.

> set same crawldb directory in mergedb parameter
> ---
>
> Key: NUTCH-1670
> URL: https://issues.apache.org/jira/browse/NUTCH-1670
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
>  Labels: PatchAvailable
> Fix For: 1.8
>
> Attachments: NUTCH-1670.patch
>
>
> when merge two crawldb using the same crawldb directory in bin/nutch merge 
> paramater, it will throw data not found exception. 
> bin/nutch mergedb crawldb_t1 crawldb_t1 crawldb_2
> bin/nutch generate crawldb_t1 segment



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859358#comment-13859358
 ] 

Tejas Patil commented on NUTCH-1687:


Created a review request: https://reviews.apache.org/r/16535/

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1687:
---

Attachment: NUTCH-1687.tejasp.v1.patch

I feel that there is no need for creating a separate class for Circular linked 
list and maintaining the circular list along with the original map. 

Uploading "NUTCH-1687.tejasp.v1.patch" : Uses 
[LinkedHashMap|http://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html]
 along with a [Guava cyclic 
iterator|http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Iterables.html#cycle(java.lang.Iterable)]
 to iterate the map of queues in a circular fashion. With that no separate list 
needs to be maintained. 

Comments are welcome.

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch, NUTCH-1687.tejasp.v1.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859275#comment-13859275
 ] 

Tejas Patil commented on NUTCH-1687:


This is one good point by [~tiennm].  Although this might not give significant 
performance improvement, it would fairly distribute requests across all fetch 
queues.

Some comments wrt the patch:
1. Do you really need to make the methods of CircularLinkedList class thread 
safe ? The methods in "FetchItemQueues" which interact with the 
CircularLinkedList (ie. getFetchItemQueue and getFetchItem) are all 
synchronized. So, its ensured that only one thread accesses the list at a time.
2. Why is 'id' needed in FetchItemQueue ?

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (NUTCH-1687) Pick queue in Round Robin

2013-12-30 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1687:
---

Fix Version/s: 1.8

> Pick queue in Round Robin
> -
>
> Key: NUTCH-1687
> URL: https://issues.apache.org/jira/browse/NUTCH-1687
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Tien Nguyen Manh
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1687.patch
>
>
> Currently we chose queue to pick url from start of queues list, so queue at 
> the start of list have more change to be pick first, that can cause problem 
> of long tail queue, which only few queue available at the end which have many 
> urls.
> public synchronized FetchItem getFetchItem() {
>   final Iterator> it =
> queues.entrySet().iterator(); ==> always reset to find queue from 
> start
>   while (it.hasNext()) {
> 
> I think it is better to pick queue in round robin, that can make reduce time 
> to find the available queue and make all queue was picked in round robin and 
> if we use TopN during generator there are no long tail queue at the end.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1689) Improve CrawlDb stats

2013-12-22 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855416#comment-13855416
 ] 

Tejas Patil commented on NUTCH-1689:


Some concerns:
1. While you are removing fields from the output, there can be people relying 
on the existing output (grepping or awking to get required fields). It ain't 
wise to simply remove off all the fields directly. Keep things backward 
compatible.
2. You can make the command configurable so that users get to select what all 
fields they want in the output
3. While submitting patch, commenting out the older code is not the best way. 
Remove those lines instead of commenting them out.

> Improve CrawlDb stats
> -
>
> Key: NUTCH-1689
> URL: https://issues.apache.org/jira/browse/NUTCH-1689
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Nguyen Manh Tien
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1689.patch
>
>
> Crawldb stats now is slow due to it load all fields from store, I change to 
> load only necessary fields.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848723#comment-13848723
 ] 

Tejas Patil commented on NUTCH-1465:


Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848723#comment-13848723
 ] 

Tejas Patil edited comment on NUTCH-1465 at 12/16/13 12:09 AM:
---

Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
i) the number of initial hosts can be large : one time burden for users
ii) crawler discovers new hosts with time : constant pain for users to look out 
for the new hosts discovered and then get sitemaps from robots.txt manually. 
With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.


was (Author: tejasp):
Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2013-12-15 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848561#comment-13848561
 ] 

Tejas Patil commented on NUTCH-1465:


Revisited this Jira after a long time and gave a thought how this can be done 
cleanly. Two ways for implementing this:

*(A) Do the sitemap stuff in the fetch phase of nutch cycle.*
This was my original approach which the (in-progress) patch addresses. This 
would involve tweaking core nutch classes at several locations.

Pros:
- Sitemaps are nothing but normal pages with several outlinks. Fits well in the 
'fetch' cycle.

Cons:
- Sitemaps can be very huge in size. Fetching them need large size and time 
limits. Fetch code must have a special case to take into account that the url 
is a sitemap url and use custom limits => leads to hacky coding style.
- Outlink class cannot hold extra information contained in sitemaps (like 
lastmod, changefreq). Modify it to hold this information too. This would be 
specific for sitemaps only yet we end up making all outlinks to hold this info. 
We could create a special type of outlink and take care of this.

*(B) Have separate job for the sitemap stuff and merge its output into the 
crawldb.*
i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got 
all the hosts to be processed.
ii. Run a map-reduce job: for each host, 
  - get the robots page, extract sitemap urls, 
  - get xml content of these sitemap pages
  - create crawl datums with the requried info and write this to a 
sitemapDB

iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb

Pros:
- Cleaner code. 
- Users have control when to perform sitemap extraction. This is better than 
(A) wherein sitemap urls are sitting in the crawldb and get fetched along with 
normal pages (thus, eating up fetch time of every fetch phase). We can have a 
sitemap_fequency used insdie the crawl script so that users say that after 'x' 
nutch cycles, run sitemap processing.

Cons:
- Additional map-reduce jobs are needed. I think that this must be reasonable. 
Running sitemap job 1-5 times in a month on a production level crawl would work 
out well.

I am inclined towards implementing (B)

> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
> Fix For: 1.9
>
> Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2013-12-14 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848517#comment-13848517
 ] 

Tejas Patil commented on NUTCH-1325:


Hi [~markus17],
I stopped by this Jira (after a long time !!!) with an intention of getting it 
to a stage where we could have it inside trunk. 
You had replied to my two concerns.

For (1): 
{noformat}host_a.example.org, host_b.example.org ==> example.org{noformat}

This might *NOT* be a good idea. 
(a) The websites for say "cs.uci.edu" and "bio.uci.edu" might be hosted 
independently. It can be argued to consider them as different hosts.
(b) I am not sure about the standards, but if something like "uci.cs.edu" is 
valid (subdomain is suffix of domain) then there would be a problem when we 
resolve "uci.cs.edu" and "ucla.cs.edu" to "cs.edu".

For (2): "I use the HTTP:// scheme but not all hosts may allow that scheme. We 
have a modified domain filter that optionally takes a scheme so we can force 
HTTPS for specific domains. Those domains are filtered out because HTTP is not 
allowed."
Do you have any suggestion to work this out ?

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1577) Add target for creating eclipse project

2013-12-14 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848493#comment-13848493
 ] 

Tejas Patil commented on NUTCH-1577:


There was some checkin(s) in past few months which have lead to one jar 
(solr-solrj-3.4.0.jar) being required to be in eclipse classpath and 'ant 
eclipse' not building the project smoothly. Fixed the same. Committed at 
revision 1550987.

> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (NUTCH-1325) HostDB for Nutch

2013-08-11 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736459#comment-13736459
 ] 

Tejas Patil commented on NUTCH-1325:


Hi [~markus17], 

>  think i've got a slightly newer version of the tools but don't know what 
> actually changed in the past year. I'll try to diff and upload it.

Could you kindly upload the newer version ?

> HostDB for Nutch
> 
>
> Key: NUTCH-1325
> URL: https://issues.apache.org/jira/browse/NUTCH-1325
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1325-1.6-1.patch, NUTCH-1325.trunk.v2.path
>
>
> A HostDB for Nutch and associated tools to create and read a database 
> containing information on hosts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1599) Obtain consensus on new description of Nutch

2013-07-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699115#comment-13699115
 ] 

Tejas Patil commented on NUTCH-1599:


I agree with Julien: Nutch should be described as a web-crawler. Markus took it 
to the next level by adding more technicality :) So "Highly extensible and 
scalable web crawler software" it is !!

> Obtain consensus on new description of Nutch
> 
>
> Key: NUTCH-1599
> URL: https://issues.apache.org/jira/browse/NUTCH-1599
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3, 1.8
>
>
> As we seem to be sustaining pushes and maintenance (touch wood) of two 
> branches, I think it is about time we agreed on a more accurate description 
> of what Nutch actually is.
> We currently have (taken directly from our site)
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from 
> Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a 
> crawler, a link-graph database and parsing support handled by Apache Tika for 
> HTML and and array other document formats.
> Nutch can run on a single machine, but gains a lot of its strength from 
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a 
> highly flexible, easily extensible and thoroughly maintained plugin 
> infrastructure.
> {code}
> I suggest/propose something along the lines of
> {code:xml}
> Apache Nutch is an open source web-search software project. Stemming from 
> Apache Lucene, the community now develops and maintains two branches:
> * 1.x; description of 1.x here
> * 2.x; description of 2.x here
> Both branches add web-specifics, such as a crawler, a link-graph database and 
> parsing support handled by Apache Tika for HTML and anarray other document 
> formats.
> Nutch can run on a single machine, but gains a lot of its strength from 
> running in a Hadoop cluster
> The system can be enhanced (eg other document formats can be parsed) using a 
> highly flexible, easily extensible and thoroughly maintained plugin 
> infrastructure.
> {code}
> Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1602) improve the readability of metadata in readdb dump normal

2013-07-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699096#comment-13699096
 ] 

Tejas Patil commented on NUTCH-1602:


Hi Lufeng, 
+1 from me too. One minor suggestion: You could add space in between "=" and 
";" to make it even better.

> improve the readability of metadata in readdb dump normal 
> --
>
> Key: NUTCH-1602
> URL: https://issues.apache.org/jira/browse/NUTCH-1602
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.7
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 1.8
>
> Attachments: NUTCH-1602.patch
>
>
> the dumped metadata format is not readable.
> {code:xml}
> $bin/nutch readdb crawldb/ -dump dir
> http://www.baidu.com/ Version: 7
> Status: 3 (db_gone)
> Fetch time: Sat Aug 17 22:35:37 CST 2013
> Modified time: Thu Jan 01 08:00:00 CST 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 1.0
> Signature: null
> Metadata: m1: v22m3: v3m2: v2m5: v5m4: m4_pst_: robots_denied(18), 
> lastModified=0m6: v6
> {code}
> so I improve the Metadata format to this
> {code:xml}
> Metadata: m1=v22;m3=v3;m2=v2;m5=v5;m4=m4;_pst_=robots_denied(18), 
> lastModified=0;m6=v6;
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1327) QueryStringNormalizer

2013-07-01 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13696840#comment-13696840
 ] 

Tejas Patil commented on NUTCH-1327:


Hi Markus,

1. The patch when applied as is didn't compile the plugin. I had to add entries 
into src/plugin/build.xml to get it compiled. 
2. Can you kindly add some javadoc comments in QuerystringURLNormalizer class 
so that people can quickly get an idea about what this plugin would do ?

> QueryStringNormalizer
> -
>
> Key: NUTCH-1327
> URL: https://issues.apache.org/jira/browse/NUTCH-1327
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.9
>
> Attachments: NUTCH-1327-1.8-1.patch
>
>
> A normalizer for dealing with query strings. Sorting query strings is helpful 
> in preventing duplicates for some (bad) websites.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1126) JUnit test for urlfilter-prefix

2013-06-24 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692069#comment-13692069
 ] 

Tejas Patil commented on NUTCH-1126:


Thanks Talat and Cihad :)

One small thing: @author tags should not be used in Apache projects - see 
http://mail-archives.apache.org/mod_mbox/www-community/200402.mbox/%3c403a144a.5040...@apache.org%3E
 Please remove those while submitting patches.

> JUnit test for urlfilter-prefix
> ---
>
> Key: NUTCH-1126
> URL: https://issues.apache.org/jira/browse/NUTCH-1126
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: 1.4
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 2.3, 1.8
>
> Attachments: test_case_for_urlfilter-prefix.patch
>
>
> This issue is part of the larger attempt to provide a Junit test case for 
> every Nutch plugin.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1578) Upgrade to Hadoop 1.2.0

2013-06-03 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672868#comment-13672868
 ] 

Tejas Patil commented on NUTCH-1578:


+1. We should go for this.

> Upgrade to Hadoop 1.2.0
> ---
>
> Key: NUTCH-1578
> URL: https://issues.apache.org/jira/browse/NUTCH-1578
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.7, 2.3
>
>
> Hadoop 1.2.0 finally has the ability to run mappers in parallel when running 
> in local mode. In trunk at least the generator seems to run slightly faster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1577) Add target for creating eclipse project

2013-06-02 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672650#comment-13672650
 ] 

Tejas Patil commented on NUTCH-1577:


Hi [~wastl-nagel], 
+1 for the suggestion. I have sorted the plugin packages now. Committed to 
trunk (r1488768) and 2.x (r1488770).

> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1577) Add target for creating eclipse project

2013-05-31 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1577.


Resolution: Fixed

Updated the documentation page 
[RunNutchInEclipse|http://wiki.apache.org/nutch/RunNutchInEclipse] to reflect 
the new steps.

> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1577) Add target for creating eclipse project

2013-05-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671823#comment-13671823
 ] 

Tejas Patil commented on NUTCH-1577:


Committed to trunk at rev1488396. 
My next task is to update the wiki page with the new steps and then close this 
jira.

> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1577) Add target for creating eclipse project

2013-05-31 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1577:
---

Attachment: NUTCH-1577.2.x.patch

Patch for 2.x

> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.2.x.patch, NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1577) Add target for creating eclipse project

2013-05-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671305#comment-13671305
 ] 

Tejas Patil edited comment on NUTCH-1577 at 5/31/13 10:22 AM:
--

Here is a patch for trunk. How to use it:
* on a SVN checkout of trunk, apply the patch
* run "ant eclipse"
* In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give 
the path of the trunk directory.

Initially it would show some errors (red dots) but those will go away after it 
builds the workspace.

  was (Author: tejasp):
Here is a patch for trunk. How to use it:
* on a SVN checkout of trunk, apply the patch
* run "ant eclipse"
* In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give 
the path of the trunk directory.

Initially it would show some errors (red dots) but those will go away after it 
auto-compiles the newly imported project.
  
> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1577) Add target for creating eclipse project

2013-05-31 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1577:
---

Attachment: NUTCH-1577.trunk.patch

Here is a patch for trunk. How to use it:
* on a SVN checkout of trunk, apply the patch
* run "ant eclipse"
* In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give 
the patch of the trunk directory.

Initially it would show some errors (red dots) but those will go away after it 
auto-compiles the newly imported project.

> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1577) Add target for creating eclipse project

2013-05-31 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13671305#comment-13671305
 ] 

Tejas Patil edited comment on NUTCH-1577 at 5/31/13 10:19 AM:
--

Here is a patch for trunk. How to use it:
* on a SVN checkout of trunk, apply the patch
* run "ant eclipse"
* In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give 
the path of the trunk directory.

Initially it would show some errors (red dots) but those will go away after it 
auto-compiles the newly imported project.

  was (Author: tejasp):
Here is a patch for trunk. How to use it:
* on a SVN checkout of trunk, apply the patch
* run "ant eclipse"
* In eclipse: "File" -> "Import" -> "Existing projects into workspace". Give 
the patch of the trunk directory.

Initially it would show some errors (red dots) but those will go away after it 
auto-compiles the newly imported project.
  
> Add target for creating eclipse project
> ---
>
> Key: NUTCH-1577
> URL: https://issues.apache.org/jira/browse/NUTCH-1577
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.6, 2.1
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: build, eclipse
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1577.trunk.patch
>
>
> Currently, loading Nutch source code in Eclipse as a project is cumbersome 
> and involves lot of manual steps as given over 
> [wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
> automate this. Adding a ant target to do that would remove burden off from 
> developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1577) Add target for creating eclipse project

2013-05-31 Thread Tejas Patil (JIRA)
Tejas Patil created NUTCH-1577:
--

 Summary: Add target for creating eclipse project
 Key: NUTCH-1577
 URL: https://issues.apache.org/jira/browse/NUTCH-1577
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.1, 1.6
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
 Fix For: 1.7, 2.2


Currently, loading Nutch source code in Eclipse as a project is cumbersome and 
involves lot of manual steps as given over 
[wiki|http://wiki.apache.org/nutch/RunNutchInEclipse]. It would be great to 
automate this. Adding a ant target to do that would remove burden off from 
developers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-23 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665267#comment-13665267
 ] 

Tejas Patil commented on NUTCH-1563:


You pushed it at the right place [~amuseme] :) If there is nothing left to be 
done for this Jira, please close it off.

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1563) FetchSchedule#getFields is never used by GeneraterJob

2013-05-22 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664408#comment-13664408
 ] 

Tejas Patil commented on NUTCH-1563:


I think this is relevant to only 2.x and [~amuseme.lu] has pushed the patch to 
svn. Any work left here ?

> FetchSchedule#getFields is never used by GeneraterJob
> -
>
> Key: NUTCH-1563
> URL: https://issues.apache.org/jira/browse/NUTCH-1563
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.1
>Reporter: lufeng
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.3
>
> Attachments: NUTCH-1563.patch
>
>
> The method of getFields in FetchSchedule if never used, so if user extends 
> the FetchSchedule and want to get some fields of WebPage, it always return 
> null.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2013-05-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1249.


   Resolution: Fixed
Fix Version/s: 2.2
 Assignee: Tejas Patil  (was: Lewis John McGibbney)

Ported the patch for trunk to 2.x. All the tests are passing (verified on Java 
1.7.0_10 and 1.6.0_38). Committed to svn at rev 1485125.

> Resolve all issues flagged up by adding javac -Xlint arguement
> --
>
> Key: NUTCH-1249
> URL: https://issues.apache.org/jira/browse/NUTCH-1249
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1249.trunk.patch
>
>
> There are a heap of issues flagged up by NUTCH-1237, I think over time it 
> would be great to get these addressed and resolved.
> What is interesting is that adding the same arguements to 
> /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
> Some of this stuff is documented in the link below
> http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1275) Fix [unchecked] javac warnings

2013-05-22 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1275.


   Resolution: Fixed
Fix Version/s: 2.2

Got resolved with NUTCH-1249

> Fix [unchecked] javac warnings
> --
>
> Key: NUTCH-1275
> URL: https://issues.apache.org/jira/browse/NUTCH-1275
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 1.7, 2.2
>
>
> We can simply suppress these warnings using  
> {code}
> SuppressWarnings [unchecked]
> {code}
> However if there is a another method for resolving these warnings then they 
> should be implemented if deemed beneficial to code quality.
> Some resources 
> http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1275) Fix [unchecked] javac warnings

2013-05-21 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13663792#comment-13663792
 ] 

Tejas Patil commented on NUTCH-1275:


Hi [~lewismc],
I am working on a patch for 2.x. 

> Fix [unchecked] javac warnings
> --
>
> Key: NUTCH-1275
> URL: https://issues.apache.org/jira/browse/NUTCH-1275
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 1.7
>
>
> We can simply suppress these warnings using  
> {code}
> SuppressWarnings [unchecked]
> {code}
> However if there is a another method for resolving these warnings then they 
> should be implemented if deemed beneficial to code quality.
> Some resources 
> http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (NUTCH-1275) Fix [unchecked] javac warnings

2013-05-21 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1275:
--

Assignee: Tejas Patil

> Fix [unchecked] javac warnings
> --
>
> Key: NUTCH-1275
> URL: https://issues.apache.org/jira/browse/NUTCH-1275
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 1.7
>
>
> We can simply suppress these warnings using  
> {code}
> SuppressWarnings [unchecked]
> {code}
> However if there is a another method for resolving these warnings then they 
> should be implemented if deemed beneficial to code quality.
> Some resources 
> http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-05-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662571#comment-13662571
 ] 

Tejas Patil commented on NUTCH-1569:


If using some other backend would be an overkill, then lets stick to MemStore.
+1 for me too. 

> Upgrade 2.x to Gora 0.3
> ---
>
> Key: NUTCH-1569
> URL: https://issues.apache.org/jira/browse/NUTCH-1569
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, storage
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.2
>
> Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch
>
>
> We just released the Maven artifacts and I would like to upgrade before we 
> push the RC for 2.2 :)
> Patch coming up

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-05-20 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1513.


Resolution: Fixed

Committed to trunk (rev 1484638) and 2.x (rev 1484637)

> Support Robots.txt for Ftp urls
> ---
>
> Key: NUTCH-1513
> URL: https://issues.apache.org/jira/browse/NUTCH-1513
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.7, 2.2
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: robots.txt
> Fix For: 2.3, 1.8
>
> Attachments: NUTCH-1513.2.x.v2.patch, NUTCH-1513.trunk.patch, 
> NUTCH-1513.trunk.v2.patch
>
>
> As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
> Ftp plugin is not parsing the robots file and accepting all urls.
> In "_src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_"
> {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
> return EmptyRobotRules.RULES;
>   }{noformat} 
> Its not clear of this was part of design or if its a bug. 
> [0] : 
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
> [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1275) Fix [unchecked] javac warnings

2013-05-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662554#comment-13662554
 ] 

Tejas Patil commented on NUTCH-1275:


Committed to trunk @ revision 1484634. For patch see NUTCH-1249

> Fix [unchecked] javac warnings
> --
>
> Key: NUTCH-1275
> URL: https://issues.apache.org/jira/browse/NUTCH-1275
> Project: Nutch
>  Issue Type: Sub-task
>  Components: build
>Affects Versions: nutchgora, 1.5
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.7
>
>
> We can simply suppress these warnings using  
> {code}
> SuppressWarnings [unchecked]
> {code}
> However if there is a another method for resolving these warnings then they 
> should be implemented if deemed beneficial to code quality.
> Some resources 
> http://java.sun.com/docs/books/jls/third_edition/html/conversions.html#190772

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1249) Resolve all issues flagged up by adding javac -Xlint arguement

2013-05-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662553#comment-13662553
 ] 

Tejas Patil commented on NUTCH-1249:


Committed to trunk @ revision 1484634

> Resolve all issues flagged up by adding javac -Xlint arguement
> --
>
> Key: NUTCH-1249
> URL: https://issues.apache.org/jira/browse/NUTCH-1249
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: nutchgora
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1249.trunk.patch
>
>
> There are a heap of issues flagged up by NUTCH-1237, I think over time it 
> would be great to get these addressed and resolved.
> What is interesting is that adding the same arguements to 
> /src/plugin/plugin-build.xml actually breaks my build as tests begin to fail.
> Some of this stuff is documented in the link below
> http://docs.oracle.com/javase/1.5.0/docs/tooldocs/windows/javac.html#options

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-05-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662540#comment-13662540
 ] 

Tejas Patil commented on NUTCH-1569:


Hey Lewis,
I took a fresh checkout of 2.x and applied that patch. I am using HBase for 
storage. 
About disabling the junits: Is there anything else apart from 'MemStore' that 
can be used ? If not, then what we are currently should be fine.

> Upgrade 2.x to Gora 0.3
> ---
>
> Key: NUTCH-1569
> URL: https://issues.apache.org/jira/browse/NUTCH-1569
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, storage
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.2
>
> Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch
>
>
> We just released the Maven artifacts and I would like to upgrade before we 
> push the RC for 2.2 :)
> Patch coming up

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1053) Parsing of RSS feeds fails

2013-05-20 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil resolved NUTCH-1053.


Resolution: Fixed

Committed to trunk (rev 1484628) and 2.x (rev 1484627). 
NOTE : Currently feeds parser is not supported (and hence disabled) in 2.x.

> Parsing of RSS feeds fails 
> ---
>
> Key: NUTCH-1053
> URL: https://issues.apache.org/jira/browse/NUTCH-1053
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.7
>
> Attachments: nutch-1053.patch, NUTCH-1053.trunk.patch, seed.txt
>
>
> See discussion on 
> http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html
> Have been able to reproduce the problem and will look into it

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1569) Upgrade 2.x to Gora 0.3

2013-05-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662497#comment-13662497
 ] 

Tejas Patil commented on NUTCH-1569:


I am running 2.x with this patch since past few hours and so far have not found 
anything breaking. Crawl is running smooth. Will keep you posted.

> Upgrade 2.x to Gora 0.3
> ---
>
> Key: NUTCH-1569
> URL: https://issues.apache.org/jira/browse/NUTCH-1569
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, storage
>Affects Versions: 2.2
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.2
>
> Attachments: NUTCH-1569.patch, NUTCH-1569.v2.patch
>
>
> We just released the Maven artifacts and I would like to upgrade before we 
> push the RC for 2.2 :)
> Patch coming up

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1545) capture batchId and remove references to segments in 2.x crawl script.

2013-05-20 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13662230#comment-13662230
 ] 

Tejas Patil commented on NUTCH-1545:


+1 for commit.

> capture batchId and remove references to segments in 2.x crawl script.
> --
>
> Key: NUTCH-1545
> URL: https://issues.apache.org/jira/browse/NUTCH-1545
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 2.1
>Reporter: Lewis John McGibbney
>Assignee: lufeng
>Priority: Minor
> Fix For: 2.2
>
> Attachments: NUTCH-1545.patch, NUTCH-1545-v2.patch
>
>
> The concept of segment is replaced by batchId in 2.x
> I'm currently getting rid of segments references in 2.x
> This issue was flagged up and separate from NUTCH-1532 which I am working on.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1573) Upgrade to most recent JUnit 4.x to improve test flexibility

2013-05-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661663#comment-13661663
 ] 

Tejas Patil commented on NUTCH-1573:


Oh... just saw your comment that you have committed it

> Upgrade to most recent JUnit 4.x to improve test flexibility
> 
>
> Key: NUTCH-1573
> URL: https://issues.apache.org/jira/browse/NUTCH-1573
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1573.2.x.v1.patch, NUTCH-1573.2.x.v2.patch
>
>
> I wanted to try using the @Ignore functionality within JUnit, however I don't 
> think it is available in the current JUnit version we use in Nutch. We should 
> upgrade.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1573) Upgrade to most recent JUnit 4.x to improve test flexibility

2013-05-19 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661662#comment-13661662
 ] 

Tejas Patil commented on NUTCH-1573:


[~lewismc] great !! 
Only if there were no homeworks then my life would have been awesome and I 
could have worked on ASF projects when-ever I wanted :(
Anyways, I will verify the patch on my system and update you soon. Lets get 
this change to repo today !!

> Upgrade to most recent JUnit 4.x to improve test flexibility
> 
>
> Key: NUTCH-1573
> URL: https://issues.apache.org/jira/browse/NUTCH-1573
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1573.2.x.v1.patch, NUTCH-1573.2.x.v2.patch
>
>
> I wanted to try using the @Ignore functionality within JUnit, however I don't 
> think it is available in the current JUnit version we use in Nutch. We should 
> upgrade.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1573) Upgrade to most recent JUnit 4.x to improve test flexibility

2013-05-18 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661450#comment-13661450
 ] 

Tejas Patil commented on NUTCH-1573:


Hi Lewis,

Quick question: Besides modifying the ivy dependency (and then adding @ignore 
tag for NUTCH-1569), is there anything else that needs to be done ? 

> Upgrade to most recent JUnit 4.x to improve test flexibility
> 
>
> Key: NUTCH-1573
> URL: https://issues.apache.org/jira/browse/NUTCH-1573
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, test
>Affects Versions: 1.6, 2.1
>Reporter: Lewis John McGibbney
> Fix For: 1.7, 2.2
>
>
> I wanted to try using the @Ignore functionality within JUnit, however I don't 
> think it is available in the current JUnit version we use in Nutch. We should 
> upgrade.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1566) bin/nutch to allow whitespace in paths

2013-05-17 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661264#comment-13661264
 ] 

Tejas Patil edited comment on NUTCH-1566 at 5/18/13 4:05 AM:
-

Hi Seb,
I tried the patch over a windows machine with cygwin and it worked (I have not 
ran all possible scenarios exhaustively...just tried few).

One minor suggestion:
With the current patch, I see this error message (on cygwin console) while 
running nutch in local mode: 
{noformat}cygpath: can't convert empty path{noformat}

I figured out the responsible place (line 115) in the nutch script:
{noformat}NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`{noformat}

As the NUTCH_JOB value is empty while running in local mode, it gave that error 
message. The if case for adjusting NUTCH_JOB at lines 113-116 in [nutch 
script|http://svn.apache.org/viewvc/nutch/trunk/src/bin/nutch?view=markup] 
could be moved in the block just above it to address that. What say ?

  was (Author: tejasp):
Hi Seb,
I tried the patch over a windows machine with cygwin and it worked (I have not 
ran all possible scenarios exhaustively...just tried few).

One minor suggestion:
With the current patch, I see this 
{noformat}cygpath: can't convert empty path{noformat}

I figured out the responsible place (line 115) in the nutch script:
{noformat}NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`{noformat}

As the NUTCH_JOB value is empty while running in local mode, it gave that error 
message. The if case for adjusting NUTCH_JOB at lines 113-116 in [nutch 
script|http://svn.apache.org/viewvc/nutch/trunk/src/bin/nutch?view=markup] 
could be moved in the block just above it to address that. What say ?
  
> bin/nutch to allow whitespace in paths
> --
>
> Key: NUTCH-1566
> URL: https://issues.apache.org/jira/browse/NUTCH-1566
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6, 2.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.7, 2.3
>
> Attachments: NUTCH-1566-trunk.patch
>
>
> bin/nutch and bin/crawl choke if a path contains white space, eg, if 
> JAVA_HOME is "{{C:\Program Files\jdk}}". If you don't have the permission to 
> change the path it is impossible to run Nutch. This has been reported 
> frequently 
> ([1|http://stackoverflow.com/questions/9345629/nutch-cygwin-how-to-set-java-home],
>  
> [2|http://lucene.472066.n3.nabble.com/Problem-running-Nutch-on-Win-7-Cygwin-td3487163.html],
>  and 
> [3|http://nutchinstall.blogspot.de/2007/07/setting-up-cygwin-and-nutch.html]),
>  see also NUTCH-19.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1566) bin/nutch to allow whitespace in paths

2013-05-17 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13661264#comment-13661264
 ] 

Tejas Patil commented on NUTCH-1566:


Hi Seb,
I tried the patch over a windows machine with cygwin and it worked (I have not 
ran all possible scenarios exhaustively...just tried few).

One minor suggestion:
With the current patch, I see this 
{noformat}cygpath: can't convert empty path{noformat}

I figured out the responsible place (line 115) in the nutch script:
{noformat}NUTCH_JOB=`cygpath -p -w "$NUTCH_JOB"`{noformat}

As the NUTCH_JOB value is empty while running in local mode, it gave that error 
message. The if case for adjusting NUTCH_JOB at lines 113-116 in [nutch 
script|http://svn.apache.org/viewvc/nutch/trunk/src/bin/nutch?view=markup] 
could be moved in the block just above it to address that. What say ?

> bin/nutch to allow whitespace in paths
> --
>
> Key: NUTCH-1566
> URL: https://issues.apache.org/jira/browse/NUTCH-1566
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.6, 2.1
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.7, 2.3
>
> Attachments: NUTCH-1566-trunk.patch
>
>
> bin/nutch and bin/crawl choke if a path contains white space, eg, if 
> JAVA_HOME is "{{C:\Program Files\jdk}}". If you don't have the permission to 
> change the path it is impossible to run Nutch. This has been reported 
> frequently 
> ([1|http://stackoverflow.com/questions/9345629/nutch-cygwin-how-to-set-java-home],
>  
> [2|http://lucene.472066.n3.nabble.com/Problem-running-Nutch-on-Win-7-Cygwin-td3487163.html],
>  and 
> [3|http://nutchinstall.blogspot.de/2007/07/setting-up-cygwin-and-nutch.html]),
>  see also NUTCH-19.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   3   >