Compilation errors at revision 638548

2008-03-18 Thread All day coders
Hi there. Following instructions from this link:
http://wiki.apache.org/nutch/RunNutchInEclipse0.9

i checked out Nutch and configured within Eclipse. Then I noticed there were
some compilations errors. They were mainly about methods changing their
signature. Well I believe they are now fixed and here I'm attaching a patch
which fixes.
Index: /home/data/software/java/nutch/nutch-svn/contrib/web2/src/main/java/org/apache/nutch/webapp/common/WebAppModule.java
===
--- /home/data/software/java/nutch/nutch-svn/contrib/web2/src/main/java/org/apache/nutch/webapp/common/WebAppModule.java	(revision 638548)
+++ /home/data/software/java/nutch/nutch-svn/contrib/web2/src/main/java/org/apache/nutch/webapp/common/WebAppModule.java	(working copy)
@@ -158,8 +158,8 @@
   Element pattern = (Element) mapping.getElementsByTagName("url-pattern")
   .item(0);
 
-  String servletName = servlet.getTextContent().trim();
-  String urlPattern = pattern.getTextContent().trim();
+  String servletName = servlet.getNodeValue().trim();
+  String urlPattern = pattern.getNodeValue().trim();
 
   servlets.put(urlPattern, servletName);
   urlPatterns.add(urlPattern);
Index: /home/data/software/java/nutch/nutch-svn/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java
===
--- /home/data/software/java/nutch/nutch-svn/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java	(revision 638548)
+++ /home/data/software/java/nutch/nutch-svn/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java	(working copy)
@@ -51,7 +51,7 @@
   setArtist(value);
 
 if (name.indexOf("URL Link") > -1) {
-  links.add(new Outlink(value, "", this.conf));
+  links.add(new Outlink(value, ""));
 } else if (name.indexOf("Text") > -1) {
   text += value + "\n";
 }
Index: /home/data/software/java/nutch/nutch-svn/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java
===
--- /home/data/software/java/nutch/nutch-svn/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java	(revision 638548)
+++ /home/data/software/java/nutch/nutch-svn/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java	(working copy)
@@ -24,26 +24,21 @@
 import java.net.MalformedURLException;
 import java.util.Iterator;
 
-// Java ID3 Tag imports
-import org.farng.mp3.MP3File;
-import org.farng.mp3.TagException;
-import org.farng.mp3.id3.AbstractID3v2;
-import org.farng.mp3.id3.AbstractID3v2Frame;
-import org.farng.mp3.id3.ID3v1;
-import org.farng.mp3.object.AbstractMP3Object;
-
-// Hadoop imports
 import org.apache.hadoop.conf.Configuration;
-
-// Nutch imports
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.parse.Parse;
 import org.apache.nutch.parse.ParseData;
-import org.apache.nutch.parse.ParseException;
 import org.apache.nutch.parse.ParseImpl;
+import org.apache.nutch.parse.ParseResult;
 import org.apache.nutch.parse.ParseStatus;
 import org.apache.nutch.parse.Parser;
 import org.apache.nutch.protocol.Content;
+import org.farng.mp3.MP3File;
+import org.farng.mp3.TagException;
+import org.farng.mp3.id3.AbstractID3v2;
+import org.farng.mp3.id3.AbstractID3v2Frame;
+import org.farng.mp3.id3.ID3v1;
+import org.farng.mp3.object.AbstractMP3Object;
 
 
 /**
@@ -55,7 +50,7 @@
   private MetadataCollector metadataCollector;
   private Configuration conf;
 
-  public Parse getParse(Content content) {
+  public ParseResult getParse(Content content) {
 
 Parse parse = null;
 byte[] raw = content.getContent();
@@ -73,22 +68,25 @@
   } else if (mp3.hasID3v1Tag()) {
 parse = getID3v1Parse(mp3, content.getMetadata());
   } else {
-return new ParseStatus(ParseStatus.FAILED,
+parse = new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_MISSING_CONTENT,
"No textual content available").getEmptyParse(conf);
+return ParseResult.createParseResult(content.getUrl(), parse);
   }
 } catch (IOException e) {
-  return new ParseStatus(ParseStatus.FAILED,
+  parse = new ParseStatus(ParseStatus.FAILED,
  ParseStatus.FAILED_EXCEPTION,
  "Couldn't create temporary file:" + e).getEmptyParse(conf);
+  return ParseResult.createParseResult(content.getUrl(), parse);
 } catch (TagException e) {
-  return new ParseStatus(ParseStatus.FAILED,
+  parse = new ParseStatus(ParseStatus.FAILED,
  ParseStatus.FAILED_EXCEPTION,
  "ID3 Tags could not be parsed:" + e).getEmptyParse(conf);
+  return ParseResult.createParseResult(content.getUrl(), parse);
 } finally{
   tmp.delete();
 }
-ret

Re: Why is Nutch not involved in Google Summer of Code - 2008?

2008-03-23 Thread All day coders
Well Susam I agree with you. I can dedicate some time to the POST
based authentication(something i've been working on).

Also, i've noticed there's no book about nutch, which makes things
extremely hard  if you want to dive in.  Well, I know it takes time to
do such a thing but maybe we can put our efforts to create something
closer to it.

So, here are the things I miss the most:

- Supported Solr Integration
- POST based authentication

Regards,
   Yoanis







On 3/22/08, Susam Pal <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I was wondering why Nutch project is not involved in Google SoC:
> http://code.google.com/soc/2008/ Many Apache projects including
> Commons, Hadoop and Mahout have put up the ideas here:
> http://wiki.apache.org/general/SummerOfCode2008
>
> Wouldn't it be great to have students helping the project out with
> some of the work which noone has found time for? For example, many
> people have requested for a POST based authentication support in
> Nutch. I personally wanted to do it after adding HTTP Authentication
> Schemes, but unfortunately I could never manage my time well to do it
> since it would require a good deal of effort. I am sure, there are
> many such ideas which have not been done because the contributors did
> not get time. IMHO, it would be great if students are given
> opportunity to contribute through GSoC 2008. The mentors can guide
> them through the work for a few hours every week and some valuable
> work can be done. What do you say?
>
> Regards,
> Susam Pal
>


Re: Why is Nutch not involved in Google Summer of Code - 2008?

2008-03-24 Thread All day coders
Sishen:
I'm not very good at organizing things, but I'm looking forward to do it.
Are you a student?

Susam, would I be asking too much if I ask you to share your experiences
about how to came up with the HTTP Authentication for Nutch? I spent a
couple of days struggling with the code, but I didn't make much progress. I
guess I'm missing the big picture (something that happens quite often when
trying to extend Nutch, at least for me).



On Mon, Mar 24, 2008 at 4:04 AM, sishen <[EMAIL PROTECTED]> wrote:

> I'm also looking forward to solr integration to nutch.
>
> On Mon, Mar 24, 2008 at 2:39 AM, All day coders <[EMAIL PROTECTED]>
> wrote:
>
> > Well Susam I agree with you. I can dedicate some time to the POST
> > based authentication(something i've been working on).
> >
> > Also, i've noticed there's no book about nutch, which makes things
> > extremely hard  if you want to dive in.  Well, I know it takes time to
> > do such a thing but maybe we can put our efforts to create something
> > closer to it.
> >
> > So, here are the things I miss the most:
> >
> > - Supported Solr Integration
> > - POST based authentication
> >
> > Regards,
> >   Yoanis
> >
> >
> >
> >
> >
> >
> >
> > On 3/22/08, Susam Pal <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > I was wondering why Nutch project is not involved in Google SoC:
> > > http://code.google.com/soc/2008/ Many Apache projects including
> > > Commons, Hadoop and Mahout have put up the ideas here:
> > > http://wiki.apache.org/general/SummerOfCode2008
> > >
> > > Wouldn't it be great to have students helping the project out with
> > > some of the work which noone has found time for? For example, many
> > > people have requested for a POST based authentication support in
> > > Nutch. I personally wanted to do it after adding HTTP Authentication
> > > Schemes, but unfortunately I could never manage my time well to do it
> > > since it would require a good deal of effort. I am sure, there are
> > > many such ideas which have not been done because the contributors did
> > > not get time. IMHO, it would be great if students are given
> > > opportunity to contribute through GSoC 2008. The mentors can guide
> > > them through the work for a few hours every week and some valuable
> > > work can be done. What do you say?
> > >
> > > Regards,
> > > Susam Pal
> > >
> >
>


Re: Nutch Crawling - Failed for internet crawling

2008-05-24 Thread All day coders
Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)

On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling the
> internet websites from my work PC.My work environment is having a proxy to
> access the web.
> So I have configure the proxy information under the /conf/ by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> 
> 
>
> 
>
> 
> 
>  http.agent.name
>  ABC
>  ABC
> 
> 
>  http.agent.description
>  Acompany
>  A company
> 
> 
>  http.agent.url
>  
>  
> 
> 
> http.agent.email
>  
>  
> 
> 
>  http.timeout
>  1
>  The default network timeout, in milliseconds.
> 
> 
>  http.max.delays
>  100
>  The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.
> 
> 
>  plugin.includes
>
>
> protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
>  Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please enable
>  protocol-httpclient, but be aware of possible intermittent problems with
> the
>  underlying commons-httpclient library.
>  
> 
> 
>  http.proxy.host
>  proxy.ABC.COM
>  The proxy hostname.  If empty, no proxy is
> used.
> 
> 
>  http.proxy.port
>  8080
>  The proxy port.
> 
> 
>  http.proxy.username
>  ABCUSER
>  Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  
> 
> 
>  http.proxy.password
>  X
>  Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  
> 
> 
>  http.proxy.realm
>  ABC
>  Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  
> 
> 
>  http.agent.host
>  xxx.xxx.xxx.xx
>  Name or IP address of the host on which the Nutch crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  
> 
> 
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
>
> my regex-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # skip URLs with slash-delimited segment that repeats 3+ times

Boolean query

2008-06-20 Thread All day coders
Hi there list! I wonder if it's possible to create a boolean query using the
Nutch API. For instance, i would like to create a query like this:

(site:site1.domaing OR site:site2.domain.org) AND (football)

Is it possible to do something like that? I spent some time searching the
source code but I didn't find anything.

Regards
Yoanis.


Re: how do add a new filed and sort on this field

2008-06-20 Thread All day coders
On Thu, Jun 19, 2008 at 3:08 AM, Mr Shore <[EMAIL PROTECTED]> wrote:

> I've just launched nutch in eclipse very hard
> and now I want to add this sorting feature


I assume you want to sort your search results by a certain field. If that's
the case the Nutch API provides
ways of doing so. Take a look to org.apache.nutch.searcher.NutchBean.
search  (there are a few overload options). Would be nice if you explain how
are you planning to use Nutch.


>
> but I don't know which file(s) should I modify to make it work?
> I don't want to add a plugin because it's too much complicated for me now
> what I want to do is just modify a few rows in certain files and that's all
> Any advice is greatly apreciated!


Usually configuring a new plugin is just about modifying your nutch-site.xml
file and adding the plugin name in the right place.

Regards
   Yoanis


Re: how do add a new filed and sort on this field

2008-06-23 Thread All day coders
well the field will be indexed and stored, so you will have the value stored
in the index.

On Mon, Jun 23, 2008 at 11:26 AM, Mr Shore <[EMAIL PROTECTED]> wrote:

> I still have a doubt
> will the following statement store a duplicate value of dateString?
>
> new Field("date", dateString, Field.Store.YES, Field.Index.UN_TOKENIZED)
>
> 2008/6/23 Mr Shore <[EMAIL PROTECTED]>:
>
>> It seems I've fixed the problem simply by change one line in
>> MoreIndexingFilter.java @122
>> doc.add(new Field("date", dateString, Field.Store.NO,
>> Field.Index.UN_TOKENIZED))
>> ==>
>> doc.add(new Field("date", dateString, Field.Store.YES,
>> Field.Index.UN_TOKENIZED))
>> now the result can be sort by date
>>
>>
>> 2008/6/23 Mr Shore <[EMAIL PROTECTED]>:
>>
>>> I've found the link here
>>> http://wiki.apache.org/nutch/WritingPluginExample-0.9
>>> but not the case fit for me...
>>>
>>> 2008/6/23 Mr Shore <[EMAIL PROTECTED]>:
>>>
>>> Sorry for my late reply,just return from travelling...
>>>> I think you got me right but the solution is not the case
>>>> in my opinion I should modify the indexer to make it sortable by certain
>>>> field,not the searcher
>>>> is it?
>>>> or could you provide a link?
>>>> thanks very much
>>>>
>>>> 2008/6/21 All day coders <[EMAIL PROTECTED]>:
>>>>
>>>>
>>>>>
>>>>> On Thu, Jun 19, 2008 at 3:08 AM, Mr Shore <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> I've just launched nutch in eclipse very hard
>>>>>> and now I want to add this sorting feature
>>>>>
>>>>>
>>>>> I assume you want to sort your search results by a certain field. If
>>>>> that's the case the Nutch API provides
>>>>> ways of doing so. Take a look to org.apache.nutch.searcher.NutchBean.
>>>>> search  (there are a few overload options). Would be nice if you explain 
>>>>> how
>>>>> are you planning to use Nutch.
>>>>>
>>>>>
>>>>>>
>>>>>> but I don't know which file(s) should I modify to make it work?
>>>>>> I don't want to add a plugin because it's too much complicated for me
>>>>>> now
>>>>>> what I want to do is just modify a few rows in certain files and
>>>>>> that's all
>>>>>> Any advice is greatly apreciated!
>>>>>
>>>>>
>>>>> Usually configuring a new plugin is just about modifying your
>>>>> nutch-site.xml file and adding the plugin name in the right place.
>>>>>
>>>>> Regards
>>>>>Yoanis
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: problem with URLS/nutch

2008-06-23 Thread All day coders
Well if you want to add URL using the Nutch API then you should trace the
program until you find the point where the directory containing the list of
URL it's used for loading the list of URLs.

On Mon, Jun 23, 2008 at 5:27 AM, yogesh somvanshi <[EMAIL PROTECTED]>
wrote:

> Hello all
>
> i m worrking on Nutch.
> When u use standered crawl command  like :bin/nutch crawl urls -dir crawl
> -depth 3 -topN 50
> crawling  do well but i want to remove need of  that Url folder
> i want to change or replace urls folder with some Array or map ,but when i
> try to du some change to
> code then i see it take help of Hadoop ...but i is to hard to create change
> any other option
> for that ..
> Yogi