[jira] Updated: (NUTCH-245) DTD Schemas for plugin.xml configuration files in conf directory

2006-04-11 Thread Chris A. Mattmann (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-245?page=all ]

Chris A. Mattmann updated NUTCH-245:


Attachment: NUTCH-245.Mattmann.patch.txt

Here's the patch for the plugin DTD file. I got a lot of info from:

http://help.eclipse.org/help31/index.jsp?topic=/org.eclipse.platform.doc.isv/reference/misc/plugin_manifest.html

i.e., the eclipse manifest file. Turns out that by examining the plugin 
manifest parser code though, a lot of some elements that eclipse uses we don't 
currently use in Nutch. Additionally, I also noticed that the element 
"implementation" can basically have any attribute name/value pair on it, 
besides "id" and "class", so I wasn't sure how exactly to represent this in DTD 
terminology other than going through all the plugin.xml files for the nutch 
plugins and adding #IMPLIED attribute names for the implementation attribute 
corresponding to each of the optional attributes used by different extension 
point implementations. Maybe there's a more elegant way, but, for now, this 
works (I ran it against all the plugin.xml files through my XML validator and 
they check out).

Okay, thanks!

> DTD Schemas for plugin.xml configuration files in conf directory
> 
>
>  Key: NUTCH-245
>  URL: http://issues.apache.org/jira/browse/NUTCH-245
>  Project: Nutch
> Type: New Feature

>   Components: fetcher, indexer, ndfs, searcher, web gui
> Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8-dev
>  Environment: Power PC Dual Processor 2.0 Ghz, Mac OS X 10.4, although 
> improvement is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Attachments: NUTCH-245.Mattmann.patch.txt
>
> Currently, the plugin.xml file does not have a DTD or XML Schema associated 
> with it, and most people just go look at an existing plugin's plugin.xml file 
> to determine what are the allowable elements, etc. There should be an 
> explicit plugin DTD file that describes the plugin.xml file. I'll look at the 
> code and attach a plugin.dtd file for the Nutch conf directory later today. 
> This way, people can use the DTD file to automatically (using tools such as 
> XMLSpy) generate plugin.xml files that can then be validated. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-245) DTD Schemas for plugin.xml configuration files in conf directory

2006-04-11 Thread Chris A. Mattmann (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-245?page=all ]

Chris A. Mattmann updated NUTCH-245:


Description: Currently, the plugin.xml file does not have a DTD or XML 
Schema associated with it, and most people just go look at an existing plugin's 
plugin.xml file to determine what are the allowable elements, etc. There should 
be an explicit plugin DTD file that describes the plugin.xml file. I'll look at 
the code and attach a plugin.dtd file for the Nutch conf directory later today. 
This way, people can use the DTD file to automatically (using tools such as 
XMLSpy) generate plugin.xml files that can then be validated.   (was: 
Currently, the plugin.xml file does not have a DTD or XML Schema associated 
with it, and most people just go look at an existing plugin's plugin.xml file 
to determine what are the allowable elements, etc. There should be an explicit 
plugin DTD file that describes the plugin.xml file. I'll look at the code and 
attach a plugin.dtd file for the Nutch conf directory later today. This way, 
people can use the DTD file to automatically (using tools such as XMLSpy) 
generate plugin.xml files that can then be validated. I'm also going to post 
another issue regarding adding an addition to the ant target that builds the 
Nutch website. The addition to the ant target would copy the existing DTD files 
in $NUTCH_HOME/conf to the Nutch website ROOT. That way, we could then 
reference the DTD file in all the XML instance files by reference something 
like http://lucene.apache.org/nutch/dtd/parse-plugins.dtd";>, 
within the parse-plugins.xml, or similarly for the nutch-site.xml, or 
mime-types.xml file.)

update the issue to just be a single issue - I may post the one about copying 
the DTDs to the website at a later point

> DTD Schemas for plugin.xml configuration files in conf directory
> 
>
>  Key: NUTCH-245
>  URL: http://issues.apache.org/jira/browse/NUTCH-245
>  Project: Nutch
> Type: New Feature

>   Components: fetcher, indexer, ndfs, searcher, web gui
> Versions: 0.7.2, 0.7.1, 0.7, 0.6, 0.8-dev
>  Environment: Power PC Dual Processor 2.0 Ghz, Mac OS X 10.4, although 
> improvement is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor

>
> Currently, the plugin.xml file does not have a DTD or XML Schema associated 
> with it, and most people just go look at an existing plugin's plugin.xml file 
> to determine what are the allowable elements, etc. There should be an 
> explicit plugin DTD file that describes the plugin.xml file. I'll look at the 
> code and attach a plugin.dtd file for the Nutch conf directory later today. 
> This way, people can use the DTD file to automatically (using tools such as 
> XMLSpy) generate plugin.xml files that can then be validated. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-247) robot parser to restrict.

2006-04-11 Thread Stefan Groschupf (JIRA)
robot parser to restrict.
-

 Key: NUTCH-247
 URL: http://issues.apache.org/jira/browse/NUTCH-247
 Project: Nutch
Type: Bug

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


If the agent name and the robots agents are not proper configure the Robot rule 
parser uses LOG.severe to log the problem but solve it also. 
Later on the fetcher thread checks for severe errors and stop if there is one.


RobotRulesParser:

if (agents.size() == 0) {
  agents.add(agentName);
  LOG.severe("No agents listed in 'http.robots.agents' property!");
} else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
  agents.add(0, agentName);
  LOG.severe("Agent we advertise (" + agentName
 + ") not listed first in 'http.robots.agents' property!");
}

Fetcher.FetcherThread:
 if (LogFormatter.hasLoggedSevere()) // something bad happened
break;  

I suggest to use warn or something similar instead of severe to log this 
problem.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



RE: Swap with Nutch

2006-04-11 Thread Ledio Ago
You can go even further and load all of the index into RAM using RAM Disk.  How 
big of
a index are you talking about?

-Ledio

-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED]
Sent: Tuesday, April 11, 2006 3:51 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Swap with Nutch


larryp wrote:
> Hi, I'm trying to get Nutch to load it's index into swap as I believe it will
> give better performance that having it as a file on the hard drive as it
> will be mapped as virtual memory, has anyone every attempted this - any
> suggestion as to how one might force the index into swap?
>
>
> Thanks in advance
>
> larry
> --
> View this message in context: 
> http://www.nabble.com/Swap-with-Nutch-t1434922.html#a3871982
> Sent from the Nutch - Dev forum at Nabble.com.
>
>   
The FSDirectory in Lucene uses the org.apache.lucene.store.MMapDirectory 
underlying which already uses memory mapping (basically the same as 
virtual memory).

Dennis


Re: Swap with Nutch

2006-04-11 Thread Dennis Kubes

larryp wrote:

Hi, I'm trying to get Nutch to load it's index into swap as I believe it will
give better performance that having it as a file on the hard drive as it
will be mapped as virtual memory, has anyone every attempted this - any
suggestion as to how one might force the index into swap?


Thanks in advance

larry
--
View this message in context: 
http://www.nabble.com/Swap-with-Nutch-t1434922.html#a3871982
Sent from the Nutch - Dev forum at Nabble.com.

  
The FSDirectory in Lucene uses the org.apache.lucene.store.MMapDirectory 
underlying which already uses memory mapping (basically the same as 
virtual memory).


Dennis


Swap with Nutch

2006-04-11 Thread larryp

Hi, I'm trying to get Nutch to load it's index into swap as I believe it will
give better performance that having it as a file on the hard drive as it
will be mapped as virtual memory, has anyone every attempted this - any
suggestion as to how one might force the index into swap?


Thanks in advance

larry
--
View this message in context: 
http://www.nabble.com/Swap-with-Nutch-t1434922.html#a3871982
Sent from the Nutch - Dev forum at Nabble.com.



Re: Microformats Support - HReview

2006-04-11 Thread mikeyc

Thanks.  I'll go through your rel-tag plugin in version 0.8 and use it as a
basis for adding my hreview code. 
--
View this message in context: 
http://www.nabble.com/Microformats-Support---HReview-t1433896.html#a3869485
Sent from the Nutch - Dev forum at Nabble.com.



Re: Microformats Support - HReview

2006-04-11 Thread Jérôme Charron
> I have noticed that there are the beginnings of microformats support
> (rel-tag) in nutch version 0.8.

Hi Mike, I have created this plugin for playing a little around
microformats.
It can be a kind of "tutorial" for people who want to add support for
further microformats.


>   Is anyone still working on adding other
> microformats (hreview, hcard)?

I don't remember somebody spoke about this on the lists.


> If so, I would be interested in helping and/or collaborating.  I already
> created a simple hreview parser using nutch version 0.7.

You can for instance adapt it for nutch 0.8 and then attach the patch to a
JIRA issue.
(I will be interested in committing it in nutch)

Regards

--
http://motrech.free.fr/
http://www.frutch.org/


Microformats Support - HReview

2006-04-11 Thread mikeyc

I have noticed that there are the beginnings of microformats support
(rel-tag) in nutch version 0.8.  Is anyone still working on adding other
microformats (hreview, hcard)?  

If so, I would be interested in helping and/or collaborating.  I already
created a simple hreview parser using nutch version 0.7.

-Mike
--
View this message in context: 
http://www.nabble.com/Microformats-Support---HReview-t1433896.html#a3868806
Sent from the Nutch - Dev forum at Nabble.com.



Re: PMD integration

2006-04-11 Thread Jérôme Charron
> > Piotr, please keep oro-2.0.8 in pmd-ext
> I do not agree here - we are going to make a new release next week and
> releasing with two versions of oro does not look nice. oro is quite
> stable product and changes are in fact minimal:
> http://svn.apache.org/repos/asf/jakarta/oro/trunk/CHANGES

OK for me.
But we cannot make a release without minimal tests.
(I will made some tests for removing oro from nutch's regex for post 0.8release)

Jérôme


[jira] Commented: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-11 Thread Chris Schneider (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-246?page=comments#action_12374049 ] 

Chris Schneider commented on NUTCH-246:
---

A few more details:

Stefan and I were able to reproduce this problem using either an injection set 
of 4500 URLs or a larger set of DMOZ URLs. With the 4500 URL injection, only 
653 URLs were generated for the first segment, despite the fact that topN was 
set to 500K. I confirmed that nearly all of the 4500 injected URLs passed our 
URL filer and were actually injected into the crawldb.

To eliminate the possibility that the bug had been fixed recently or was due to 
a code modification that we'd made ourselves, we deployed yesterday's sandbox 
version of nutch (2006-04-10), including hadoop-0.1.1.jar (though I believe 
that Stefan had to build it himself because the nutch-0.8-dev.jar didn't match 
the source). We made the absolute minimum changes to nutch-site.xml, 
hadoop-site.xml, and hadoop-env.sh in order to deploy this version properly in 
our cluster (1 jobtracker/namenode machine, 10 tasktracker/datanode machines). 
However, we got the same results (i.e., very few URLs actually generated).

This bug has apparently been present since at least change 382948, but I 
suspect that it may have been present for the entire history of the mapreduce 
implementation of Nutch. It may also be the root cause of NUTCH-136, the 
explanation for which has always left me somewhat dissatisfied. Just because a 
nutch-site.xml containing default properties may override the desired mapred 
properties (incorrectly) specified in one of the *-default.xml files, and may 
therefore set mapred.map.tasks and mapred.reduce.tasks back to the defaults (2 
and 1, respectively), it's not clear to me exactly how/why you'd get only a 
fraction of topN URLs fetched. As Stefan has suggested, it would actually seem 
more plausible if each tasktracker tried to fetch the entire set of URLs in 
this case.

I would suggest that someone with a good understanding of the hadoop 
implementation investigate the first generation job in fine detail, both for 
the case where the mapred properties are specified in an appropriate manner and 
for the case where nutch-site.xml overrides the desired properties, setting 
them back to the defaults.

> segment size is never as big as topN or crawlDB size in a distributed 
> deployement
> -
>
>  Key: NUTCH-246
>  URL: http://issues.apache.org/jira/browse/NUTCH-246
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Fix For: 0.8-dev

>
> I didn't reopen NUTCH-136 since it is may related to the hadoop split.
> I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
> and 9 ttracks and 1 jobtracker).
> Defining map and reduce task number in a mapred-default.xml does not solve 
> the problem. (is in nutch/conf on all boxes)
> We verified that it is not  a problem of maximum urls per hosts and also not 
> a problem of the url filter.
> Looks like the first job of the Generator (Selector) already got to less 
> entries to process. 
> May be this is somehow releasted to split generation or configuration inside 
> the distributed jobtracker since it runs in a different jvm as the jobclient.
> However we was not able to find the source for this problem.
> I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-11 Thread Stefan Groschupf (JIRA)
segment size is never as big as topN or crawlDB size in a distributed 
deployement
-

 Key: NUTCH-246
 URL: http://issues.apache.org/jira/browse/NUTCH-246
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


I didn't reopen NUTCH-136 since it is may related to the hadoop split.
I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
and 9 ttracks and 1 jobtracker).
Defining map and reduce task number in a mapred-default.xml does not solve the 
problem. (is in nutch/conf on all boxes)
We verified that it is not  a problem of maximum urls per hosts and also not a 
problem of the url filter.

Looks like the first job of the Generator (Selector) already got to less 
entries to process. 
May be this is somehow releasted to split generation or configuration inside 
the distributed jobtracker since it runs in a different jvm as the jobclient.
However we was not able to find the source for this problem.

I think that should be fixed before  publishing a nutch 0.8. 




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: nighly build brocken?

2006-04-11 Thread Byron Miller
I didn't even think about that. trying it out now :)

thanks,
-byron

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

> Hi Byron,
> 
> This sounds like the url filter problem.
> Please try to remove the "-.*(/.+?)/.*?\1/.*?\1/"
> from regex- 
> urlfilter.txt just for a test and tell us if this
> may be would solve  
> the problem.
> Thanks.
> Stefan
> Am 11.04.2006 um 14:43 schrieb Byron Miller:
> 
> > i get nightly to run, but it never completes
> anything.
> > always get stuck at 98% here and there.. i'll try
> > todays build and see what happens.
> >
> > --- Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> >
> >> Hi,
> >>
> >> looks like the latest nightly build is broken.
> >> Looks like the jar that comes with the nightly
> build
> >> contains some
> >> patches that are not yet in the svn sources.
> >> Is someone able to get the latest nutch nightly
> to
> >> run?
> >>
> >> Thanks.
> >> Stefan
> >>
> >>
> >>
> >
> >
> 
>
---
> company:http://www.media-style.com
> forum:http://www.text-mining.org
> blog:http://www.find23.net
> 
> 
> 



Re: nighly build brocken?

2006-04-11 Thread Stefan Groschupf

Hi Byron,

This sounds like the url filter problem.
Please try to remove the "-.*(/.+?)/.*?\1/.*?\1/" from regex- 
urlfilter.txt just for a test and tell us if this may be would solve  
the problem.

Thanks.
Stefan
Am 11.04.2006 um 14:43 schrieb Byron Miller:


i get nightly to run, but it never completes anything.
always get stuck at 98% here and there.. i'll try
todays build and see what happens.

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:


Hi,

looks like the latest nightly build is broken.
Looks like the jar that comes with the nightly build
contains some
patches that are not yet in the svn sources.
Is someone able to get the latest nutch nightly to
run?

Thanks.
Stefan








---
company:http://www.media-style.com
forum:http://www.text-mining.org
blog:http://www.find23.net




Re: nighly build brocken?

2006-04-11 Thread Byron Miller
i get nightly to run, but it never completes anything.
always get stuck at 98% here and there.. i'll try
todays build and see what happens.

--- Stefan Groschupf <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> looks like the latest nightly build is broken.
> Looks like the jar that comes with the nightly build
> contains some  
> patches that are not yet in the svn sources.
> Is someone able to get the latest nutch nightly to
> run?
> 
> Thanks.
> Stefan
> 
> 
>