Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs

This is all I did (and from what I have read, double checked locking is
works correctly in jdk 5)

private static volatile IndexingFilters INSTANCE;

public static IndexingFilters getInstance(final Configuration configuration)
{
 if(INSTANCE == null) {
   synchronized(IndexingFilters.class) {
 if(INSTANCE == null) {
   INSTANCE = new IndexingFilters(configuration);
 }
   }
 }
 return INSTANCE;
}

So, I just updated all the code that calls new IndexingFilters(..) to call
IndexingFilters.getInstance(...).  This works for me, perhaps not everyone.
I think that the filter interface should be refitted to allow the
configuration instance to be passed along the filters too, or allow a way
for the thread to obtain it's current configuration, rather than
instantiating these things over and over again.  If a filter is designed to
be thread-safe, there is no need for all this unnecessary object creation.


On 6/6/07, Briggs [EMAIL PROTECTED] wrote:


FYI, I ran into the same problem.   I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes.  So, if you
ever dump the stats from your app (use 'jmap -histo;) you can see all the
classes that have been loaded. You will notice, if you have been running
nutch for a while,  classes being loaded thousands of times and never
unloaded. My quick fix was to just edit all the main plugin points (
URLFilters.java, IndexFilters.java etc) and made them all singletons.  I
haven't had time to look into the classloading facility.  There is a bit of
a bug in there (IMHO), but some people may not want singletons.  But, there
needs to be a way of just instantiating a new plugin, and not instantiating
a new classloader everytime a plugin is requested.  These seem to never get
garbage collected.

Anyway.. that's all I have to say at the moment.



On 6/5/07, Doğacan Güney [EMAIL PROTECTED]  wrote:

 Hi,

 It seems that plugin-loading code is somehow broken. There is some
 discussion going on about this on
 http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y .

 On 6/5/07, Enzo Michelangeli  [EMAIL PROTECTED] wrote:
  I have a question about the loading mechanism of plugin classes. I'm
 working
  with a custom URLFilter, and I need a singleton object loaded and
  initialized by the first instance of the URLFilter, and shared by
 other
  instances (e.g., instantiated by other threads). I was assuming that
 the
  URLFilter class was being loaded only once even when the filter is
 used by
  multiple threads, so I tried to use a static member variable of my
 URLFilter
  class to hold a reference to the object to be shared: but it appears
 that
  the supposed singleton, actually, isn't, because the method
 responsible for
  its instantiation finds the static field initialized to null. So: are
  URLFilter classes loaded multiple times by their classloader in Nutch?
 The
  wiki page at
 
 
http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem
  seems to suggest otherwise:
 
  Until Nutch runtime, only one instance of such a plugin
  class is alive in the Java virtual machine.
 
  (By the way, what does Until Nutch runtime mean here? Before Nutch
  runtime, no class whatsoever is supposed to be alive in the JVM, is
 it?)
 
  Enzo
 
 

 --
 Doğacan Güney




--
Conscious decisions by conscious minds are what make reality real





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Martin Kammerlander
Hi

I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
Unfortunately I get the following error:

[EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test 
-depth 10
crawl started in: crawl.test
rootUrlDir = urls/nutch
threads = 10
depth = 10
Injector: starting
Injector: crawlDb: crawl.test/crawldb
Injector: urlDir: urls/nutch
Injector: Converting injected urls to crawl db entries.
Exception in thread main java.io.IOException: Input directory
/scratch/nutch-0.8.1/urls/nutch in local is invalid.
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

Any ideas what is causing that?

regards
martin

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs

FYI, I ran into the same problem.   I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes.  So, if you
ever dump the stats from your app (use 'jmap -histo;) you can see all the
classes that have been loaded. You will notice, if you have been running
nutch for a while,  classes being loaded thousands of times and never
unloaded. My quick fix was to just edit all the main plugin points (
URLFilters.java, IndexFilters.java etc) and made them all singletons.  I
haven't had time to look into the classloading facility.  There is a bit of
a bug in there (IMHO), but some people may not want singletons.  But, there
needs to be a way of just instantiating a new plugin, and not instantiating
a new classloader everytime a plugin is requested.  These seem to never get
garbage collected.

Anyway.. that's all I have to say at the moment.



On 6/5/07, Doğacan Güney [EMAIL PROTECTED] wrote:


Hi,

It seems that plugin-loading code is somehow broken. There is some
discussion going on about this on
http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y .

On 6/5/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
 I have a question about the loading mechanism of plugin classes. I'm
working
 with a custom URLFilter, and I need a singleton object loaded and
 initialized by the first instance of the URLFilter, and shared by other
 instances (e.g., instantiated by other threads). I was assuming that the
 URLFilter class was being loaded only once even when the filter is used
by
 multiple threads, so I tried to use a static member variable of my
URLFilter
 class to hold a reference to the object to be shared: but it appears
that
 the supposed singleton, actually, isn't, because the method responsible
for
 its instantiation finds the static field initialized to null. So: are
 URLFilter classes loaded multiple times by their classloader in Nutch?
The
 wiki page at

http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem
 seems to suggest otherwise:

 Until Nutch runtime, only one instance of such a plugin
 class is alive in the Java virtual machine.

 (By the way, what does Until Nutch runtime mean here? Before Nutch
 runtime, no class whatsoever is supposed to be alive in the JVM, is it?)

 Enzo



--
Doğacan Güney





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Is fetcher.throttle.bandwidth known to work?

2007-06-06 Thread Matthias Jaekle
Hello Enzo,

we never developed a patch for this issue.

I believe back in 2004 and nutch 0.4 version, there was an other fetcher 
modul which was replaced in 0.5 version.

This fetcher was able to throttle bandwith, but it was also very buggy.

So the wiki description would be obsolete.

I am not familar with all the changes since version 0.7
So, it might be good, if somebody could change the wiki.

If you are interested to see, how this option was implemented, maybe you 
could find the old version in cvs.

Regards,

Matthias




Enzo Michelangeli schrieb:
  Hi Matthias,
 
  I'm writing you about the Nutch config file option
  fetcher.throttle.bandwidth , referenced by you at
  http://wiki.apache.org/nutch/FetchOptions . According to Andrzej
  Bialecki in
  the thread
  
http://www.nabble.com/Is--fetcher.throttle.bandwidth-known-to-work--t3861057.html
 

  ,
  that refers to a private patch not part of Nutch' mainline code base. Is
  that patch available from you for submission to the Nutch team?
 
  Thanks,
 
  Enzo
 
 


Enzo Michelangeli schrieb:
 - Original Message - From: Andrzej Bialecki [EMAIL PROTECTED]
 Sent: Tuesday, June 05, 2007 4:56 PM
 
 [...]
 You can achieve a somewhat similar effect by controlling the number of 
 fetcher threads. I realize this is not as accurate as a specific 
 control mechanism, but so far it was sufficient for most users.

 If this feature is important to you, please provide a patch that 
 implements it, and we'll consider it for inclusion.
 
 I think that for the time being I'll just channel the traffic through a 
 Squid proxy, and use its delay pools feature to throttle the bandwidth 
 (and also its DNS caching, which, as I mentioned a few days ago, I also 
 need...). For Nutch, it might make sense to find the original patch. 
 I'll try to get n touch with Matthias Jaekle, who authored that wiki 
 page where fetcher.throttle.bandwidth was referenced.
 
 Thanks anyway,
 
 Enzo
 
 
 

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Briggs
is urls/nutch a file or directory?

On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote:
 Hi

 I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
 Unfortunately I get the following error:

 [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test 
 -depth 10
 crawl started in: crawl.test
 rootUrlDir = urls/nutch
 threads = 10
 depth = 10
 Injector: starting
 Injector: crawlDb: crawl.test/crawldb
 Injector: urlDir: urls/nutch
 Injector: Converting injected urls to crawl db entries.
 Exception in thread main java.io.IOException: Input directory
 /scratch/nutch-0.8.1/urls/nutch in local is invalid.
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

 Any ideas what is causing that?

 regards
 martin



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Martin Kammerlander
I see now whats causing the error. /urls/nutch is a file...but you have to give
as input only the urls folder not the file as i did ;)

ps: is there an irc channel for nutch or 'only' mailing list?

thx
martin

Zitat von Briggs [EMAIL PROTECTED]:

 is urls/nutch a file or directory?

 On 6/6/07, Martin Kammerlander [EMAIL PROTECTED]
 wrote:
  Hi
 
  I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
  Unfortunately I get the following error:
 
  [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test 
  -depth 10
  crawl started in: crawl.test
  rootUrlDir = urls/nutch
  threads = 10
  depth = 10
  Injector: starting
  Injector: crawlDb: crawl.test/crawldb
  Injector: urlDir: urls/nutch
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: Input directory
  /scratch/nutch-0.8.1/urls/nutch in local is invalid.
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
 
  Any ideas what is causing that?
 
  regards
  martin
 


 --
 Conscious decisions by conscious minds are what make reality real





-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Bolle, Jeffrey F.
You must give nutch the URL directory.  It reads the text files in
there for the URLs to inject.  In your case this would be /urls.


Jeff
 

-Original Message-
From: Martin Kammerlander
[mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 06, 2007 12:03 PM
To: [EMAIL PROTECTED]
Subject: Re: urls/nutch in local is invalid

I see now whats causing the error. /urls/nutch is a file...but you have
to give
as input only the urls folder not the file as i did ;)

ps: is there an irc channel for nutch or 'only' mailing list?

thx
martin

Zitat von Briggs [EMAIL PROTECTED]:

 is urls/nutch a file or directory?

 On 6/6/07, Martin Kammerlander
[EMAIL PROTECTED]
 wrote:
  Hi
 
  I wanted to start a crawl like it is done in the nutch 0.8.x
tutorial.
  Unfortunately I get the following error:
 
  [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test
-depth 10
  crawl started in: crawl.test
  rootUrlDir = urls/nutch
  threads = 10
  depth = 10
  Injector: starting
  Injector: crawlDb: crawl.test/crawldb
  Injector: urlDir: urls/nutch
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: Input directory
  /scratch/nutch-0.8.1/urls/nutch in local is invalid.
  at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
  at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
  at
org.apache.nutch.crawl.Injector.inject(Injector.java:138)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
 
  Any ideas what is causing that?
 
  regards
  martin
 


 --
 Conscious decisions by conscious minds are what make reality real





-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Briggs

I haven't heard of an IRC channel for it, but that would be cool.


On 6/6/07, Martin Kammerlander [EMAIL PROTECTED]
wrote:


I see now whats causing the error. /urls/nutch is a file...but you have to
give
as input only the urls folder not the file as i did ;)

ps: is there an irc channel for nutch or 'only' mailing list?

thx
martin

Zitat von Briggs [EMAIL PROTECTED]:

 is urls/nutch a file or directory?

 On 6/6/07, Martin Kammerlander [EMAIL PROTECTED]
 wrote:
  Hi
 
  I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
  Unfortunately I get the following error:
 
  [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir 
crawl.test-depth 10
  crawl started in: crawl.test
  rootUrlDir = urls/nutch
  threads = 10
  depth = 10
  Injector: starting
  Injector: crawlDb: crawl.test/crawldb
  Injector: urlDir: urls/nutch
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: Input directory
  /scratch/nutch-0.8.1/urls/nutch in local is invalid.
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
:274)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
:327)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
 
  Any ideas what is causing that?
 
  regards
  martin
 


 --
 Conscious decisions by conscious minds are what make reality real








--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] stackoverflow error

2007-06-06 Thread djames

Hi all,

I got a probleme with parser when i try to crawl 2000 site with a depth of
3.
I use nutch 0.81 version and my setup worked well with other site but this
list gave me this error:

2007-06-06 13:49:27,997 WARN  mapred.LocalJobRunner - job_qsjobz
java.lang.StackOverflowError
at org.apache.xerces.dom.ParentNode.getLength(Unknown Source)
at
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305)
at
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347)
at
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347)
at
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347)
at
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347)
at
org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:347)

i cut the message because he's very long

Could someone help me please, i don't think there is already an answer in
the forum or in the jira 
Thank you very mutch for your help.
-- 
View this message in context: 
http://www.nabble.com/stackoverflow-error-tf3879034.html#a10992519
Sent from the Nutch - User mailing list archive at Nabble.com.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] stackoverflow error

2007-06-06 Thread Andrzej Bialecki
djames wrote:
 Hi all,
 
 I got a probleme with parser when i try to crawl 2000 site with a depth of
 3.
 I use nutch 0.81 version and my setup worked well with other site but this
 list gave me this error:
 
 2007-06-06 13:49:27,997 WARN  mapred.LocalJobRunner - job_qsjobz
 java.lang.StackOverflowError
   at org.apache.xerces.dom.ParentNode.getLength(Unknown Source)
   at
 org.apache.nutch.parse.html.DOMContentUtils.getOutlinks(DOMContentUtils.java:305)

I've seen this on some occasions, but I haven't discovered the real 
reason for this error yet - for now I suggest that you modify the source 
of DOMContentUtils to artificially limit the level of recursion in 
getOutlinks to something like 200-300.


-- 
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] stackoverflow error

2007-06-06 Thread djames

Thanks a lot for your help
I'll give you a feedback
-- 
View this message in context: 
http://www.nabble.com/stackoverflow-error-tf3879034.html#a10993864
Sent from the Nutch - User mailing list archive at Nabble.com.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] indexing only special documents

2007-06-06 Thread Martin Kammerlander


hi!

I have a question. If I have for example the seed urls and do a crawl based o
that seeds. If I want to index then only pages that contain for example pdf
documents, how can I do that?

cheers
martin



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] indexing only special documents

2007-06-06 Thread Briggs
You set that up in your nutch-site.xml file. Open the
nutch-default.xml file (located in the NUTCH_INSTALL_DIR/conf. Look
for this element:

property
  nameplugin.includes/name
 
valueprotocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
/property


You'll notice the parse plugins that uses the regex
parse-(text|html|pdf|msword|rss).  You remove/add the available
parsers here. So, if you only wanted pdfs, you only use the pdf
parser, parse-(pdf) or just parse-pdf.

Don't edit the nutch-default file. Create a new nutch-site.xml file
for your cusomizations.  So, basically copy the nutch-default.xml
file, remove everything you don't need to override, and there ya go.

I believe that is the correct way.


On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote:


 hi!

 I have a question. If I have for example the seed urls and do a crawl based o
 that seeds. If I want to index then only pages that contain for example pdf
 documents, how can I do that?

 cheers
 martin





-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] indexing only special documents

2007-06-06 Thread Martin Kammerlander
Wow thx Briggs that's pretty cool and it looks easy :) great!! I will try this
out right tomorrow..bit late now here.

Another 2 additonal questions:

1.Those parse plugins where do I find them in the nutch source code? Is it
possible and easy going to write a own parser plugin...cause I think I'm gonna
need some additional non standard parser plugin(s).

2. When I do a crawl. Is it possible that I can activate or see some statistics
in nutch for that. I mean that at the end of indexing process it shows me how
many urls nutch had parsed and how much of them contained i.e. pdfs and
additionally how long the crawling and indexing process tooked and so on?

thx for support
martin



Zitat von Briggs [EMAIL PROTECTED]:

 You set that up in your nutch-site.xml file. Open the
 nutch-default.xml file (located in the NUTCH_INSTALL_DIR/conf. Look
 for this element:

 property
   nameplugin.includes/name
 
valueprotocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
   descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable
   protocol-httpclient, but be aware of possible intermittent problems with
 the
   underlying commons-httpclient library.
   /description
 /property


 You'll notice the parse plugins that uses the regex
 parse-(text|html|pdf|msword|rss).  You remove/add the available
 parsers here. So, if you only wanted pdfs, you only use the pdf
 parser, parse-(pdf) or just parse-pdf.

 Don't edit the nutch-default file. Create a new nutch-site.xml file
 for your cusomizations.  So, basically copy the nutch-default.xml
 file, remove everything you don't need to override, and there ya go.

 I believe that is the correct way.


 On 6/6/07, Martin Kammerlander [EMAIL PROTECTED]
 wrote:
 
 
  hi!
 
  I have a question. If I have for example the seed urls and do a crawl based
 o
  that seeds. If I want to index then only pages that contain for example pdf
  documents, how can I do that?
 
  cheers
  martin
 
 
 


 --
 Conscious decisions by conscious minds are what make reality real





-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] stackoverflow error

2007-06-06 Thread Dennis Kubes
This error is due to a webpage with an extreme nesting of  tags.  For 
example something like bibi./i/b/i/b but thousands 
of levels deep.  It is a form of a spider trap.

I just created NUTCH-497 for this issue and attached a very
rudimentary patch as a workaround.  The patch successfully fixes the 
problem but it is not very robust and has no unit tests as of yet.  I 
have run this successfully myself.  I will provide a more robust patch 
when time allows but this should help you for now.

Dennis Kubes

djames wrote:
 Thanks a lot for your help
 I'll give you a feedback

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Hadoop oddity

2007-06-06 Thread Dennis Kubes
If the hosts file on the namenode is not setup correctly it could be 
listening only on localhost.  Make sure your /etc/hosts file looks 
something like this:

127.0.0.1   localhost, localhost.localdomain
x.x.x.x yourcomputer.domain.tld

Dennis Kubes

Bolle, Jeffrey F. wrote:
 In theory I have a cluster with 4 nodes.  When running something like
 bin/slaves.sh uptime I get the desired results (all four servers
 respond with their uptimes).  However, when I run a crawl only one
 server, the host (which also acts as a slave), appears under the nodes
 display.  This has happened after the primary server died and had now
 been rebuilt.  Had anyone experienced this before or does anyone have
 any tips as to where to begin looking for the problem.  Thanks.
  
 Jeff
 

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Hadoop oddity

2007-06-06 Thread Bolle, Jeffrey F.
The hosts file looks fine...still only showing 1 node.  

Jeff
 

-Original Message-
From: Dennis Kubes [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 06, 2007 7:42 PM
To: [EMAIL PROTECTED]
Subject: Re: Hadoop oddity

If the hosts file on the namenode is not setup correctly it could be 
listening only on localhost.  Make sure your /etc/hosts file looks 
something like this:

127.0.0.1   localhost, localhost.localdomain
x.x.x.x yourcomputer.domain.tld

Dennis Kubes

Bolle, Jeffrey F. wrote:
 In theory I have a cluster with 4 nodes.  When running something like
 bin/slaves.sh uptime I get the desired results (all four servers
 respond with their uptimes).  However, when I run a crawl only one
 server, the host (which also acts as a slave), appears under the
nodes
 display.  This has happened after the primary server died and had now
 been rebuilt.  Had anyone experienced this before or does anyone have
 any tips as to where to begin looking for the problem.  Thanks.
  
 Jeff
 

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Sicurezza dei dati personali

2007-06-06 Thread Poste Italiane
Title: Poste Italiane












  





  

  

  



  



   



  





  







  

  



   



   



  

  Caro cliente Poste.it,



  

   



  

  La preghiamo di esaminare con la massima serieta e immediatamente questo messaggio di posta elettronica che mostra le nuove misure di securezza. Il reparto sicurezza della nostra banca le notifica che sono state prese misure per accrescere il livello di sicurezza dell`online banking, in relazione ai frequenti tentativi di accedere illegalmente ai conti bancari. Per ottenere l`accesso alla versione piu sicura dell`area clienti preghiamo di dare la sua autorizzazione.

 



  

  

  

  FARE CLICK QUI PER ANDARE ALLA PAGINA DELL' AUTORIZZAZIONE » 



   



  





  



Considerazioni migliori, 



Il reparto sicurezza

 



  



CONFIDENZIALE!

 



  

Questo email contiene le informazioni confidenziali ed è inteso per il destinatario autorizzato soltanto. Se non siete un destinatario autorizzato, restituisca prego il email noi ed allora cancellilo dal vostri calcolatore e posta-assistente. 

Potete nè usare nè pubblicare qualsiasi email compreso i collegamenti, né rendete loro accessibili ai terzi in tutto il modo qualunque. 

  

 

  



Grazie per la vostra cooperazione Poste italiane S.p.A.

  

  





   









-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] ParseData encoding problem

2007-06-06 Thread xu xiong
Hi,

I use nutch 0.9 to crawl some Chinese web site, and search using nutch
web portal but found that cached html copy display incorrectly.
Then I use bin/nutch readseg -dump to dump segments :
ParseText(UTF-8) display correctly, but the Chinse character in
Content display incorrectly as '?'.--the original html uses gd2312
charset.

What's the possible cause? And how to fix?

Thanks in advance,
Xiong

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general