[Nutch-general] Separating nutch and hadoop configurations.

2007-07-11 Thread Briggs
I am currently trying to figure out how to deploy Nutch and Hadoop
separately.  I want to configure Hadoop outside of Nutch and have
Nutch use that service, rather than configuring hadoop within nutch.
I would think all that Nutch should need to know is the urls to
connect to Hadoop, but can't figure out how to get this to work.

Is this possible?  If so, is there some sort of document, or archive
of another list post for this?

Sorry for the ignorance.


-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Separating nutch and hadoop configurations.

2007-07-11 Thread Briggs
Hey, thanks.  My problem was that I also wanted the nutch conf out of
the nutch install dir. So, I did set the NUTCH_CONF_DIR variable in my
.bashrc and couldn't understand why it was never picking it up.  Well,
as it happens, that was the one variable I forgot to export!  Doh!

So, it wasn't hard at all. Though, I needed to replace
hadoop-12.whatever.jar to the lastest within the nutch build.  It
seems to be working. yay.


Thanks.




On 7/11/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Briggs wrote:
  I am currently trying to figure out how to deploy Nutch and Hadoop
  separately.  I want to configure Hadoop outside of Nutch and have
  Nutch use that service, rather than configuring hadoop within nutch.
  I would think all that Nutch should need to know is the urls to
  connect to Hadoop, but can't figure out how to get this to work.
 
  Is this possible?  If so, is there some sort of document, or archive
  of another list post for this?
 
  Sorry for the ignorance.

 If you have a clean hadoop installation up and running (made e.g. from
 one of the official Hadoop builds), it should be enough to put the
 nutch*.job file in ${hadoop.dir}, and copy bin/nutch (possibly with some
 minor modifications - my memory is a little vague on this ...).


 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] NUTCH-479 Support for OR queries - what is this about

2007-07-09 Thread Briggs
Thanks for the answer. That was helpful.

I was sooo wrong.

On 7/7/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
 Briggs wrote:
  Please keep this thread going as I am also curious to know why this
  has been 'forked'.   I am sure that most of this lies within the
  original OPIC filter but I still can't understand why straight forward
  lucene queries have not been used within the application.

 No, this has actually almost nothing to do with the scoring filters
 (which were added much later).

 The decision to use a different query syntax than the one from Lucene
 was motivated by a few reasons:

 * to avoid the need to support low-level index and searcher operations,
 which the Lucene API would require us to implement.

 * to keep the Nutch core largely independent of Lucene, so that it's
 possible to use Nutch with different back-end searcher implementations.
 This started to materialize only now, with the ongoing effort to use
 Solr as a possible backend.

 * to limit the query syntax to those queries that provide best tradeoff
 between functionality and performance, in a large-scale search engine.


  On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:

  Ok, so I guess what I don't understand is what is the Nutch query
  syntax?

 Query syntax is defined in an informal way on the Help page in
 nutch.war, or here:

 http://wiki.apache.org/nutch/Features

 Formal syntax definition can be gleaned from
 org.apache.nutch.analysis.NutchAnalysis.jj.



 
  The main discussion I found on nutch-user is this:
  http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
  I was wondering why the query syntax is so limited.
  There are no OR queries, there are no fielded queries,
  or fuzzy, or approximate... Why? The underlying index
  supports all these operations.


 Actually, it's possible to configure Nutch to allow raw field queries -
 you need to add a raw field query plugin for this. Pleae see
 RawFieldQueryFilter class, and existing plugins that use fielded
 queries: query-site, and query-more. Query-more / DateQueryFilter is
 especially interesting, because it shows how to use raw token values
 from a parsed query to build complex Lucene queries.


 
  I notice by looking at the or.patch file
  (https://issues.apache.org/jira/secure/attachment/12360659/or.patch)
  that one of the programs under consideration is:
  nutch/searcher/Query.java
  The code for this is distinct from
  lucene/search/Query.java


 See above - they are completely different classes, with completely
 different purpose. The use of the same class name is unfortunate and
 misleading.

 Nutch Query class is intended to express queries entered by search
 engine users, in a tokenized and parsed way, so that the rest of Nutch
 may deal with Clauses, Terms and Phrases instead of plain String-s.

 On the other hand, Lucene Query is intended to express arbitrarily
 complex Lucene queries - many of these queries would be prohibitively
 expensive for a large search engine (e.g. wildcard queries).


 
  It looks like this is an architecture issue that I don't understand.
  If nutch is an extension of lucene, why does it define a different
  Query class?

 Nutch is NOT an extension of Lucene. It's an application that uses
 Lucene as a library.


   Why don't we just use the Lucene code to query the
  indexes?  Does this have something to do with the nutch webapp
  (nutch.war)?  What is the historical genesis of this issue (or is that
  even relevant)?

 Nutch webapp doesn't have anything to do with it. The limitations in the
 query syntax have different roots (see above).

 --
 Best regards,
 Andrzej Bialecki 
   ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com




-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] NUTCH-479 Support for OR queries - what is this about

2007-07-07 Thread Briggs
Please keep this thread going as I am also curious to know why this
has been 'forked'.   I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.



On 7/6/07, Kai_testing Middleton [EMAIL PROTECTED] wrote:
 I've been reading up on NUTCH-479 Support for OR queries but I must be 
 missing something obvious because I don't understand what the JIRA is about:

 https://issues.apache.org/jira/browse/NUTCH-479

Description:
There have been many requests from users to extend Nutch query syntax

to add support for OR queries,
in addition to the implicit AND and NOT
 queries supported now.

 Ok, so I guess what I don't understand is what is the Nutch query syntax?

 The main discussion I found on nutch-user is this:
 http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
 I was wondering why the query syntax is so limited.
 There are no OR queries, there are no fielded queries,
 or fuzzy, or approximate... Why? The underlying index
 supports all these operations.

 I notice by looking at the or.patch file 
 (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one 
 of the programs under consideration is:
 nutch/searcher/Query.java
 The code for this is distinct from
 lucene/search/Query.java

 It looks like this is an architecture issue that I don't understand.  If 
 nutch is an extension of lucene, why does it define a different Query 
 class?  Why don't we just use the Lucene code to query the indexes?  Does 
 this have something to do with the nutch webapp (nutch.war)?  What is the 
 historical genesis of this issue (or is that even relevant)?






 
 We won't tell. Get more on shows you hate to love
 (and love to hate): Yahoo! TV's Guilty Pleasures list.
 http://tv.yahoo.com/collections/265


-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Reload index

2007-06-20 Thread Briggs
Strange... Here is the quoted, unedited, partially incorrect post... ;-)


I would say that the best thing to do is to create a new nutch bean.

I never cared much for the nutch bean containing logic to store itself
in a servlet context.  I do not believe that this is the place for
such logic.  It should be up to the user to place the nutch bean into
the servlet context and not the bean.  My implementation of a nutch
bean has no knowledge of a servlet context and I believe this
dependency should be removed.  Why should nutch care about such
details?

Anyway, enough with my tiny rant.

You could just create a 'reload.jsp' (or any servlet, or whatever you
want that can get ahold of the servlet context) and do the work...

The current way nutch finds an instance of the search bean is within
the static method
get(ServletContext, Configuration) within the NutchBean class.

So, in your java class, jsp or whatever, just replace the instance
with something like:

servletContext.setAttribute(nutchBean, new NutchBean(yourConfiguration));

Hope that gets you on your way.

You could always edit, or subclass the nutch bean with a
'reload/reinit' method too that could just do the same thing.

On 6/20/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 Thanks, Briggs.

 I will try to create a new NutchBean to se if that solves reloading
 issue.

 By the way. Your former mail do not seem to have reached the
 mailinglist. I can't seem to find it anyway.

 -Ronny

 -Opprinnelig melding-
 Fra: Briggs [mailto:[EMAIL PROTECTED]
 Sendt: 20. juni 2007 01:22
 Til: [EMAIL PROTECTED]
 Emne: Re: Reload index

 By the way, I was wrong about one thing, you can't override the 'get'
 method of nutch bean because it's static. Doh, that was a silly
 oversight.

 But again, if you are using nutch and you need to 'reload' the index,
 you need only to create a new NutchBean (that is if the NutchBean is
 what you are using).

 On 6/19/07, Naess, Ronny [EMAIL PROTECTED] wrote:
  This will reload the application, is'nt this correct? This is
  something I do not want as spesified below.
 
  Is it possible to maybe manupulate the IndexReader part of the nutch
  web client to read whenever i tell it to, or something like that?
 
  Or do I have to write my own client bottom up?
 
  Regards,
  Ronny
 
  -Opprinnelig melding-
  Fra: Susam Pal [mailto:[EMAIL PROTECTED]
  Sendt: 18. juni 2007 17:33
  Til: [EMAIL PROTECTED]
  Emne: Re: Reload index
 
  touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml
 
  $CATALINA_HOME is the top level directory of Tomcat. It works for most

  cases.
 
  Regards,
  Susam Pal
  http://susam.in/
 
  On 6/18/07, Naess, Ronny [EMAIL PROTECTED] wrote:
  
   Is there a way to reload index without restarting application server

   or reloading application?
  
   I have integrated Nutch into our app but we can not restart or
   reload the app everytime we have created a new index.
  
  
   Regards,
   Ronny
  
 
 
 
 


 --
 Conscious decisions by conscious minds are what make reality real

 !DSPAM:46786552232131573131950!




-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Reload index

2007-06-19 Thread Briggs
By the way, I was wrong about one thing, you can't override the 'get'
method of nutch bean because it's static. Doh, that was a silly
oversight.

But again, if you are using nutch and you need to 'reload' the index,
you need only to create a new NutchBean (that is if the NutchBean is
what you are using).

On 6/19/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 This will reload the application, is'nt this correct? This is something
 I do not want as spesified below.

 Is it possible to maybe manupulate the IndexReader part of the nutch web
 client to read whenever i tell it to, or something like that?

 Or do I have to write my own client bottom up?

 Regards,
 Ronny

 -Opprinnelig melding-
 Fra: Susam Pal [mailto:[EMAIL PROTECTED]
 Sendt: 18. juni 2007 17:33
 Til: [EMAIL PROTECTED]
 Emne: Re: Reload index

 touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml

 $CATALINA_HOME is the top level directory of Tomcat. It works for most
 cases.

 Regards,
 Susam Pal
 http://susam.in/

 On 6/18/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 
  Is there a way to reload index without restarting application server
  or reloading application?
 
  I have integrated Nutch into our app but we can not restart or reload
  the app everytime we have created a new index.
 
 
  Regards,
  Ronny
 

 !DSPAM:4676a5d0153227818312239!




-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Reload index

2007-06-18 Thread Briggs
I would say that the best thing to do is to create a new nutch bean.

I never cared much for the nutch bean containing logic to store itself
in a servlet context.  I do not believe that this is the place for
such logic.  It should be up to the user to place the nutch bean into
the servlet context and not the bean.  My implementation of a nutch
bean has no knowledge of a servlet context and I believe this
dependency should be removed.  Why should nutch care about such
details?

Anyway, enough with my tiny rant.

You could just create a 'reload.jsp' (or any servlet, or whatever you
want that can get ahold of the servlet context) and do the work...

The current way nutch finds an instance of the search bean is within
the static method
get(ServletContext, Configuration) within the NutchBean class.

So, in your java class, jsp or whatever, just replace the instance
with something like:

servletContext.setAttribute(nutchBean, new NutchBean(yourConfiguration));

Hope that gets you on your way.

You could always edit, or subclass the nutch bean with a
'reload/reinit' method too that could just do the same thing.

On 6/18/07, Susam Pal [EMAIL PROTECTED] wrote:
 touch $CATALINA_HOME/ROOT/webapps/WEB-INF/web.xml

 $CATALINA_HOME is the top level directory of Tomcat. It works for most cases.

 Regards,

 Susam Pal
 http://susam.in/

 On 6/18/07, Naess, Ronny [EMAIL PROTECTED] wrote:
 
  Is there a way to reload index without restarting application server or
  reloading application?
 
  I have integrated Nutch into our app but we can not restart or reload
  the app everytime we have created a new index.
 
 
  Regards,
  Ronny
 



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] fetch failing while crawling

2007-06-15 Thread Briggs
Yeah, you still don't have the agent configured.  All your values for
the agent (the value/value needs a value) are blank.  So, you need
to at least confugure an agent name.



On 6/15/07, karan thakral [EMAIL PROTECTED] wrote:
 i m using crawl on the cygwin while working on windows

 but the crawl output is not proper

 during fetch its says fetch: the document could not be fetched java runtime
 exception  agent not configured

 my nutch-site.xml is  as follows

 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration
 property
   namehttp.agent.name/name
   value/value
   descriptionHTTP 'User-Agent' request header. MUST NOT be empty -
   please set this to a single word uniquely related to your organization.

   NOTE: You should also check other related properties:

   http.robots.agents
   http.agent.description
   http.agent.url
   http.agent.email
   http.agent.version

   and set their values appropriately.

   /description
 /property

 property
   namehttp.agent.description/name
   value/value
   descriptionFurther description of our bot- this text is used in
   the User-Agent header.  It appears in parenthesis after the agent name.
   /description
 /property

 property
   namehttp.agent.url/name
   value/value
   descriptionA URL to advertise in the User-Agent header.  This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
   /description
 /property

 property
   namehttp.agent.email/name
   value/value
   descriptionAn email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
   /description
 /property
 /configuration

   but still thrs error

 also please throw some light on the searching of info through the web
 interface after the crawl is made successful
 --
 With Regards
 Karan Thakral



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] fetch failing while crawling

2007-06-15 Thread Briggs
Oh and as for the web interface, take a look at the wiki page:

http://wiki.apache.org/nutch/NutchTutorial

The bottom of the page has a section on searching.

On 6/15/07, Briggs [EMAIL PROTECTED] wrote:
 Yeah, you still don't have the agent configured.  All your values for
 the agent (the value/value needs a value) are blank.  So, you need
 to at least confugure an agent name.



 On 6/15/07, karan thakral [EMAIL PROTECTED] wrote:
  i m using crawl on the cygwin while working on windows
 
  but the crawl output is not proper
 
  during fetch its says fetch: the document could not be fetched java runtime
  exception  agent not configured
 
  my nutch-site.xml is  as follows
 
  ?xml version=1.0?
  ?xml-stylesheet type=text/xsl href=configuration.xsl?
 
  !-- Put site-specific property overrides in this file. --
 
  configuration
  property
namehttp.agent.name/name
value/value
descriptionHTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
 
NOTE: You should also check other related properties:
 
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
 
and set their values appropriately.
 
/description
  /property
 
  property
namehttp.agent.description/name
value/value
descriptionFurther description of our bot- this text is used in
the User-Agent header.  It appears in parenthesis after the agent name.
/description
  /property
 
  property
namehttp.agent.url/name
value/value
descriptionA URL to advertise in the User-Agent header.  This will
 appear in parenthesis after the agent name. Custom dictates that this
 should be a URL of a page explaining the purpose and behavior of this
 crawler.
/description
  /property
 
  property
namehttp.agent.email/name
value/value
descriptionAn email address to advertise in the HTTP 'From' request
 header and User-Agent header. A good practice is to mangle this
 address (e.g. 'info at example dot com') to avoid spamming.
/description
  /property
  /configuration
 
but still thrs error
 
  also please throw some light on the searching of info through the web
  interface after the crawl is made successful
  --
  With Regards
  Karan Thakral
 


 --
 Conscious decisions by conscious minds are what make reality real



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Explanation of topN

2007-06-08 Thread Briggs
Well, the quick/simple exlanation is:

If you have 5 urls with their associate nutch score:

http://a.com/something1 = 5.0
http://b.com/something2 = 4.0
http://c.com/something3 = 3.0
http://d.com/something4 = 2.0
http://e.com/something5 = 1.0

Then you set nutch to crawl with topN = 3 then a,b,c will be fetched
and d and e will not.  It just means give me the 3 best ranking URLs
from the current crawl database.

On 6/8/07, monkeynuts84 [EMAIL PROTECTED] wrote:

 Can someone give me an explanation of what topN does? I've seen various
 pieces of info but some of them seem to be conflicting. I've noticed in my
 crawls that certain sites are crawled more than other in each iteration of a
 fetch. Is this caused by topN?

 Thanks.
 --
 View this message in context: 
 http://www.nabble.com/Explanation-of-topN-tf3891964.html#a11033441
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] indexing only special documents

2007-06-07 Thread Briggs

Ronny, your way is probably better.  See, I was only dealing with the
fetched properties.  But, in your case, you don't fetch it, which gets rid
of all that wasted bandwidth.

For dealing with types that can be dealt with via the file extension, this
would probably work better.


On 6/7/07, Naess, Ronny [EMAIL PROTECTED] wrote:



Hi.

Configure crawl-urlfilter.txt
Thus you want to add something like +\.pdf$ I guess another way would be
to exclude all others

Try expanding the line below with html, doc, xls, ppt, etc
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$

Or try including
+\.pdf$
#
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|JS|dojo|DOJO|jsp|JSP)$
Followd by
-.

Have'nt tried it myself, but experiment some and I guess you figure it
out pretty soon.

Regards,
Ronny

-Opprinnelig melding-
Fra: Martin Kammerlander [mailto:[EMAIL PROTECTED]

Sendt: 6. juni 2007 20:30
Til: [EMAIL PROTECTED]
Emne: indexing only special documents



hi!

I have a question. If I have for example the seed urls and do a crawl
based o that seeds. If I want to index then only pages that contain for
example pdf documents, how can I do that?

cheers
martin



!DSPAM:4666ff05259891293215062!





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs

This is all I did (and from what I have read, double checked locking is
works correctly in jdk 5)

private static volatile IndexingFilters INSTANCE;

public static IndexingFilters getInstance(final Configuration configuration)
{
 if(INSTANCE == null) {
   synchronized(IndexingFilters.class) {
 if(INSTANCE == null) {
   INSTANCE = new IndexingFilters(configuration);
 }
   }
 }
 return INSTANCE;
}

So, I just updated all the code that calls new IndexingFilters(..) to call
IndexingFilters.getInstance(...).  This works for me, perhaps not everyone.
I think that the filter interface should be refitted to allow the
configuration instance to be passed along the filters too, or allow a way
for the thread to obtain it's current configuration, rather than
instantiating these things over and over again.  If a filter is designed to
be thread-safe, there is no need for all this unnecessary object creation.


On 6/6/07, Briggs [EMAIL PROTECTED] wrote:


FYI, I ran into the same problem.   I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes.  So, if you
ever dump the stats from your app (use 'jmap -histo;) you can see all the
classes that have been loaded. You will notice, if you have been running
nutch for a while,  classes being loaded thousands of times and never
unloaded. My quick fix was to just edit all the main plugin points (
URLFilters.java, IndexFilters.java etc) and made them all singletons.  I
haven't had time to look into the classloading facility.  There is a bit of
a bug in there (IMHO), but some people may not want singletons.  But, there
needs to be a way of just instantiating a new plugin, and not instantiating
a new classloader everytime a plugin is requested.  These seem to never get
garbage collected.

Anyway.. that's all I have to say at the moment.



On 6/5/07, Doğacan Güney [EMAIL PROTECTED]  wrote:

 Hi,

 It seems that plugin-loading code is somehow broken. There is some
 discussion going on about this on
 http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y .

 On 6/5/07, Enzo Michelangeli  [EMAIL PROTECTED] wrote:
  I have a question about the loading mechanism of plugin classes. I'm
 working
  with a custom URLFilter, and I need a singleton object loaded and
  initialized by the first instance of the URLFilter, and shared by
 other
  instances (e.g., instantiated by other threads). I was assuming that
 the
  URLFilter class was being loaded only once even when the filter is
 used by
  multiple threads, so I tried to use a static member variable of my
 URLFilter
  class to hold a reference to the object to be shared: but it appears
 that
  the supposed singleton, actually, isn't, because the method
 responsible for
  its instantiation finds the static field initialized to null. So: are
  URLFilter classes loaded multiple times by their classloader in Nutch?
 The
  wiki page at
 
 
http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem
  seems to suggest otherwise:
 
  Until Nutch runtime, only one instance of such a plugin
  class is alive in the Java virtual machine.
 
  (By the way, what does Until Nutch runtime mean here? Before Nutch
  runtime, no class whatsoever is supposed to be alive in the JVM, is
 it?)
 
  Enzo
 
 

 --
 Doğacan Güney




--
Conscious decisions by conscious minds are what make reality real





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Loading mechnism of plugin classes and singleton objects

2007-06-06 Thread Briggs

FYI, I ran into the same problem.   I wanted my filters to be instantiated
only once, and they not only get instantiated repeatedly, but the
classloading is flawed in that it keeps reloading the classes.  So, if you
ever dump the stats from your app (use 'jmap -histo;) you can see all the
classes that have been loaded. You will notice, if you have been running
nutch for a while,  classes being loaded thousands of times and never
unloaded. My quick fix was to just edit all the main plugin points (
URLFilters.java, IndexFilters.java etc) and made them all singletons.  I
haven't had time to look into the classloading facility.  There is a bit of
a bug in there (IMHO), but some people may not want singletons.  But, there
needs to be a way of just instantiating a new plugin, and not instantiating
a new classloader everytime a plugin is requested.  These seem to never get
garbage collected.

Anyway.. that's all I have to say at the moment.



On 6/5/07, Doğacan Güney [EMAIL PROTECTED] wrote:


Hi,

It seems that plugin-loading code is somehow broken. There is some
discussion going on about this on
http://www.nabble.com/forum/ViewPost.jtp?post=10844164framed=y .

On 6/5/07, Enzo Michelangeli [EMAIL PROTECTED] wrote:
 I have a question about the loading mechanism of plugin classes. I'm
working
 with a custom URLFilter, and I need a singleton object loaded and
 initialized by the first instance of the URLFilter, and shared by other
 instances (e.g., instantiated by other threads). I was assuming that the
 URLFilter class was being loaded only once even when the filter is used
by
 multiple threads, so I tried to use a static member variable of my
URLFilter
 class to hold a reference to the object to be shared: but it appears
that
 the supposed singleton, actually, isn't, because the method responsible
for
 its instantiation finds the static field initialized to null. So: are
 URLFilter classes loaded multiple times by their classloader in Nutch?
The
 wiki page at

http://wiki.apache.org/nutch/WhichTechnicalConceptsAreBehindTheNutchPluginSystem
 seems to suggest otherwise:

 Until Nutch runtime, only one instance of such a plugin
 class is alive in the Java virtual machine.

 (By the way, what does Until Nutch runtime mean here? Before Nutch
 runtime, no class whatsoever is supposed to be alive in the JVM, is it?)

 Enzo



--
Doğacan Güney





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Briggs
is urls/nutch a file or directory?

On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote:
 Hi

 I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
 Unfortunately I get the following error:

 [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir crawl.test 
 -depth 10
 crawl started in: crawl.test
 rootUrlDir = urls/nutch
 threads = 10
 depth = 10
 Injector: starting
 Injector: crawlDb: crawl.test/crawldb
 Injector: urlDir: urls/nutch
 Injector: Converting injected urls to crawl db entries.
 Exception in thread main java.io.IOException: Input directory
 /scratch/nutch-0.8.1/urls/nutch in local is invalid.
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274)
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327)
 at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)

 Any ideas what is causing that?

 regards
 martin



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] urls/nutch in local is invalid

2007-06-06 Thread Briggs

I haven't heard of an IRC channel for it, but that would be cool.


On 6/6/07, Martin Kammerlander [EMAIL PROTECTED]
wrote:


I see now whats causing the error. /urls/nutch is a file...but you have to
give
as input only the urls folder not the file as i did ;)

ps: is there an irc channel for nutch or 'only' mailing list?

thx
martin

Zitat von Briggs [EMAIL PROTECTED]:

 is urls/nutch a file or directory?

 On 6/6/07, Martin Kammerlander [EMAIL PROTECTED]
 wrote:
  Hi
 
  I wanted to start a crawl like it is done in the nutch 0.8.x tutorial.
  Unfortunately I get the following error:
 
  [EMAIL PROTECTED] nutch-0.8.1]$ bin/nutch crawl urls/nutch -dir 
crawl.test-depth 10
  crawl started in: crawl.test
  rootUrlDir = urls/nutch
  threads = 10
  depth = 10
  Injector: starting
  Injector: crawlDb: crawl.test/crawldb
  Injector: urlDir: urls/nutch
  Injector: Converting injected urls to crawl db entries.
  Exception in thread main java.io.IOException: Input directory
  /scratch/nutch-0.8.1/urls/nutch in local is invalid.
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java
:274)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java
:327)
  at org.apache.nutch.crawl.Injector.inject(Injector.java:138)
  at org.apache.nutch.crawl.Crawl.main(Crawl.java:105)
 
  Any ideas what is causing that?
 
  regards
  martin
 


 --
 Conscious decisions by conscious minds are what make reality real








--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] indexing only special documents

2007-06-06 Thread Briggs
You set that up in your nutch-site.xml file. Open the
nutch-default.xml file (located in the NUTCH_INSTALL_DIR/conf. Look
for this element:

property
  nameplugin.includes/name
 
valueprotocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value
  descriptionRegular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  /description
/property


You'll notice the parse plugins that uses the regex
parse-(text|html|pdf|msword|rss).  You remove/add the available
parsers here. So, if you only wanted pdfs, you only use the pdf
parser, parse-(pdf) or just parse-pdf.

Don't edit the nutch-default file. Create a new nutch-site.xml file
for your cusomizations.  So, basically copy the nutch-default.xml
file, remove everything you don't need to override, and there ya go.

I believe that is the correct way.


On 6/6/07, Martin Kammerlander [EMAIL PROTECTED] wrote:


 hi!

 I have a question. If I have for example the seed urls and do a crawl based o
 that seeds. If I want to index then only pages that contain for example pdf
 documents, how can I do that?

 cheers
 martin





-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs
So, I have been having huge problems with parsing.  It seems that many
urls are being ignored because the parser plugins throw and exception
saying there is no parser found for, what is reportedly, and
unresolved contentType.  So, if you look at the exception:

  org.apache.nutch.parse.ParseException: parser not found for
contentType= url=http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl

You can see that it says the contentType is .  But, if you look at
the headers for this request you can see that the Content-Type header
is set at text/html:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 13:54:19 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: y1851mbb91.JS1
Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

Is there something that I have set up wrong?  This happens on a LOT of
pages/sites.  My current plugins are set at:

protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)


Here is another URL:

http://www.bionews.org.uk/


Same issue with parsing (parrser not found for contentType=
url=http://www.bionews.org.uk/), but the header says:

HTTP/1.0 200 OK
Server: Lasso/3.6.5 ID/ACGI
MIME-Version: 1.0
Content-type: text/html
Content-length: 69417


Any clues?  Does nutch look at the headers or not?


-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

So, here is one:

http://hea.sagepub.com/cgi/alerts

Segment Reader reports:

Content::
Version: 2
url: http://hea.sagepub.com/cgi/alerts
base: http://hea.sagepub.com/cgi/alerts
contentType:
metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.04168
Content:

So, I notice when I try to crawl that url specifically, I get a job failed
(array index out of bounds -1 exception).

But if I use curl like:

curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt

I get content and the headers are:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:03:28 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: xlz2cgcww1.JS1
Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

So, I'm lost.


On 6/1/07, Doğacan Güney [EMAIL PROTECTED] wrote:


Hi,

On 6/1/07, Briggs [EMAIL PROTECTED] wrote:
 So, I have been having huge problems with parsing.  It seems that many
 urls are being ignored because the parser plugins throw and exception
 saying there is no parser found for, what is reportedly, and
 unresolved contentType.  So, if you look at the exception:

   org.apache.nutch.parse.ParseException: parser not found for
 contentType= url=
http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl

 You can see that it says the contentType is .  But, if you look at
 the headers for this request you can see that the Content-Type header
 is set at text/html:

 HTTP/1.1 200 OK
 Date: Fri, 01 Jun 2007 13:54:19 GMT
 Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
 Cache-Control: no-store
 X-Highwire-SessionId: y1851mbb91.JS1
 Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
 Transfer-Encoding: chunked
 Content-Type: text/html

 Is there something that I have set up wrong?  This happens on a LOT of
 pages/sites.  My current plugins are set at:


protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)


 Here is another URL:

 http://www.bionews.org.uk/


 Same issue with parsing (parrser not found for contentType=
 url=http://www.bionews.org.uk/), but the header says:

 HTTP/1.0 200 OK
 Server: Lasso/3.6.5 ID/ACGI
 MIME-Version: 1.0
 Content-type: text/html
 Content-length: 69417


 Any clues?  Does nutch look at the headers or not?

Can you do a
bin/nutch readseg -get segment url -noparse -noparsetext
-noparsedata -nofetch -nogenerate

And send the result? This should show use what nutch fetched as content.



 --
 Conscious decisions by conscious minds are what make reality real



--
Doğacan Güney





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

Looking into the first URL.. Don't look at the second, I screwed up on
that.  It's a Disallow bad example... But working on finding the segment
for the first thanks for your quick response, I'll be getting right back
to you.


http://www.bionews.org.uk/
On 6/1/07, Doğacan Güney [EMAIL PROTECTED] wrote:


Hi,

On 6/1/07, Briggs [EMAIL PROTECTED] wrote:
 So, I have been having huge problems with parsing.  It seems that many
 urls are being ignored because the parser plugins throw and exception
 saying there is no parser found for, what is reportedly, and
 unresolved contentType.  So, if you look at the exception:

   org.apache.nutch.parse.ParseException: parser not found for
 contentType= url=
http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl

 You can see that it says the contentType is .  But, if you look at
 the headers for this request you can see that the Content-Type header
 is set at text/html:

 HTTP/1.1 200 OK
 Date: Fri, 01 Jun 2007 13:54:19 GMT
 Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
 Cache-Control: no-store
 X-Highwire-SessionId: y1851mbb91.JS1
 Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
 Transfer-Encoding: chunked
 Content-Type: text/html

 Is there something that I have set up wrong?  This happens on a LOT of
 pages/sites.  My current plugins are set at:


protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)


 Here is another URL:

 http://www.bionews.org.uk/


 Same issue with parsing (parrser not found for contentType=
 url=http://www.bionews.org.uk/), but the header says:

 HTTP/1.0 200 OK
 Server: Lasso/3.6.5 ID/ACGI
 MIME-Version: 1.0
 Content-type: text/html
 Content-length: 69417


 Any clues?  Does nutch look at the headers or not?

Can you do a
bin/nutch readseg -get segment url -noparse -noparsetext
-noparsedata -nofetch -nogenerate

And send the result? This should show use what nutch fetched as content.



 --
 Conscious decisions by conscious minds are what make reality real



--
Doğacan Güney





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Content Type Not Resolved Correctly?

2007-06-01 Thread Briggs

Here is another example that keeps saying it can't parse it...

SegmentReader: get '
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir'
Content::
Version: 2
url: http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
base:
http://www.annals.org/cgi/eletter-submit/146/9/621?title=Re%3A+Dear+Sir
contentType:
metadata: nutch.segment.name=20070601050840 nutch.crawl.score=3.5455807E-5
Content:

These are the headers:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:38:15 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Window-Target: _top
X-Highwire-SessionId: nh2ukcdpv1.JS1
Set-Cookie: JServSessionIdroot=nh2ukcdpv1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html



So, that's it. any ideas?



On 6/1/07, Briggs [EMAIL PROTECTED] wrote:



So, here is one:

http://hea.sagepub.com/cgi/alerts

Segment Reader reports:

Content::
Version: 2
url: http://hea.sagepub.com/cgi/alerts
base: http://hea.sagepub.com/cgi/alerts
contentType:
metadata: nutch.segment.name=20070601045920 nutch.crawl.score=0.04168
Content:

So, I notice when I try to crawl that url specifically, I get a job failed
(array index out of bounds -1 exception).

But if I use curl like:

curl -G http://hea.sagepub.com/cgi/alerts --dump-header header.txt

I get content and the headers are:

HTTP/1.1 200 OK
Date: Fri, 01 Jun 2007 15:03:28 GMT
Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
Cache-Control: no-store
X-Highwire-SessionId: xlz2cgcww1.JS1
Set-Cookie: JServSessionIdroot=xlz2cgcww1.JS1; path=/
Transfer-Encoding: chunked
Content-Type: text/html

So, I'm lost.


On 6/1/07, Doğacan Güney [EMAIL PROTECTED] wrote:

 Hi,

 On 6/1/07, Briggs [EMAIL PROTECTED] wrote:
  So, I have been having huge problems with parsing.  It seems that many
  urls are being ignored because the parser plugins throw and exception
  saying there is no parser found for, what is reportedly, and
  unresolved contentType.  So, if you look at the exception:
 
org.apache.nutch.parse.ParseException: parser not found for
  contentType= url=
 http://hea.sagepub.com/cgi/login?uri=%2Fpolicies%2Fterms.dtl
 
  You can see that it says the contentType is .  But, if you look at
  the headers for this request you can see that the Content-Type header
  is set at text/html:
 
  HTTP/1.1 200 OK
  Date: Fri, 01 Jun 2007 13:54:19 GMT
  Server: Apache/1.3.26 (Unix) DAV/1.0.3 ApacheJServ/1.1.2
  Cache-Control: no-store
  X-Highwire-SessionId: y1851mbb91.JS1
  Set-Cookie: JServSessionIdroot=y1851mbb91.JS1; path=/
  Transfer-Encoding: chunked
  Content-Type: text/html
 
  Is there something that I have set up wrong?  This happens on a LOT of

  pages/sites.  My current plugins are set at:
 
 
 
protocol-httpclient|language-identifier|urlfilter-regex|nutch-extensionpoints|parse-(text|html|pdf|msword|rss)|index-basic|query-(basic|site|url)|index-more|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)

 
 
  Here is another URL:
 
  http://www.bionews.org.uk/
 
 
  Same issue with parsing (parrser not found for contentType=
  url= http://www.bionews.org.uk/), but the header says:
 
  HTTP/1.0 200 OK
  Server: Lasso/3.6.5 ID/ACGI
  MIME-Version: 1.0
  Content-type: text/html
  Content-length: 69417
 
 
  Any clues?  Does nutch look at the headers or not?

 Can you do a
 bin/nutch readseg -get segment url -noparse -noparsetext
 -noparsedata -nofetch -nogenerate

 And send the result? This should show use what nutch fetched as content.


 
 
  --
  Conscious decisions by conscious minds are what make reality real
 


 --
 Doğacan Güney




--
Conscious decisions by conscious minds are what make reality real





--
Conscious decisions by conscious minds are what make reality real
-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Nutch on Windows. ssh: command not found

2007-05-30 Thread Briggs
so, when in cygwin, if you type 'ssh' (without the quotes, do you get
the same error? If so, then you need to go back into the cygwin setup
and install ssh.


On 5/30/07, Ilya Vishnevsky [EMAIL PROTECTED] wrote:
 Hello. I try to run shell scripts starting Nutch. I use Windows XP, so I
 installed cygwin. When I execute bin/start-all.sh, I get following
 messages:

 localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
 command not found

 localhost: /cygdrive/c/nutch/nutch-0.9/bin/slaves.sh: line 45: ssh:
 command not found

 Could you help me with this problem?



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Speed up indexing....

2007-05-30 Thread Briggs
Anyone have any good configuration ideas for indexing/merging with 0.9
using hadoop on a local fs?  Our segment merging is taking an
extremely long time compared with nutch 0.7.  Currently, I am trying
to merge 300 segments, which amounts to about 1gig of data.  It has
taken hours to merge, and it's still not done. This box has dual zeon
2.8ghz processors with 4 gigs of ram.

So, I figure there must be a better setup in the mapred-default.xml
for a single machine.  Do I increase the file size for I/O buffers,
sort buffers, etc.?  Do I reduce the number of tasks or increase them?
 I'm at a loss.

Any advice would be greatly appreciated.


-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Problem crawling in Nutch 0.9

2007-05-14 Thread Briggs
Just curious, did you happen to limit the number of urls using the
topN switch?

On 5/14/07, Annona Keene [EMAIL PROTECTED] wrote:
 I recently upgraded to 0.9, and I've started encountering a problem. I began 
 with a single url and crawled with a depth of 10, assuming I would get every 
 page on my site. This same configuration worked for me in 0.8.  However, I 
 noticed a particular url that I was especially interested in was not in the 
 index. So I added the url explicitly and crawled again. And it still was not 
 in the index. So I checked the logs, and it is being fetched. So I tried a 
 lower depth, and it worked. With a depth of 6, the url does appear in the 
 index. Any ideas on what would be causing this? I'm very confused.

 Thanks,
 Ann





 Pinpoint
  customers who are looking for what you sell.
 http://searchmarketing.yahoo.com/


-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Nutch Indexer

2007-05-01 Thread Briggs
I would assume that it need these for handling the indexing of the
link scores.  Lucene puts no scoring weight on things such as urls,
page rank and such. Since lucene only indexes documents, and
calculates its keyword/query relevancy based only on term vectors (or
whatever) nutch needs to add the url scoring and such to the index.



On 5/1/07, hzhong [EMAIL PROTECTED] wrote:

 Hello,

 In Indexer.java,  index(Path indexDir, Path crawlDb, Path linkDb, Path[]
 segments), can someone explain to me why crawlDB and linkDB is needed for
 indexing?

 In Lucene, there's no crawlDB and linkDB for indexing.

 Thank you very much

 Hanna
 --
 View this message in context: 
 http://www.nabble.com/Nutch-Indexer-tf3673420.html#a10264625
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Nutch Indexer

2007-05-01 Thread Briggs
Man, I should proofread this stuff before I send them. That is all I
have to say.

On 5/1/07, Briggs [EMAIL PROTECTED] wrote:
 I would assume that it need these for handling the indexing of the
 link scores.  Lucene puts no scoring weight on things such as urls,
 page rank and such. Since lucene only indexes documents, and
 calculates its keyword/query relevancy based only on term vectors (or
 whatever) nutch needs to add the url scoring and such to the index.



 On 5/1/07, hzhong [EMAIL PROTECTED] wrote:
 
  Hello,
 
  In Indexer.java,  index(Path indexDir, Path crawlDb, Path linkDb, Path[]
  segments), can someone explain to me why crawlDB and linkDB is needed for
  indexing?
 
  In Lucene, there's no crawlDB and linkDB for indexing.
 
  Thank you very much
 
  Hanna
  --
  View this message in context: 
  http://www.nabble.com/Nutch-Indexer-tf3673420.html#a10264625
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 


 --
 Conscious decisions by conscious minds are what make reality real



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Nutch and running crawls within a container.

2007-04-30 Thread Briggs
Version:  Nutch 0.9 (but this applies to just about all versions)

I'm really in a bind.

Is anyone crawling from within a web application, or is everyone
running Nutch using the shell scripts provided?  I am trying to write
a web application around the Nutch crawling facilities, but it seems
that there is are huge memory issues when trying to do this.   The
container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
on the stack) runs out of memory in less that an hour.  When profiling
version 0.7.2 we can see that there is a constant pool of objects that
grow, but never get garbage collected.  So, even when the crawl is
finished, these objects tend to just hang around forever, until we get
the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
the application to use Nutch 0.9 and the problem got about 80x worse
(it use to run for about 16 hours, now it runs out of memory in 20
minutes).  We were using 5 concurrent crawlers, meaning we have
Crawl.man running 5 times within the application.

So, the current design is/was to have an event happen within the
system, which would fire off a crawler (currently just calls
org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
but grief.  We need to have several crawlers running concurrently. We
didn't want large 'batch' jobs.  The requirement is to crawl a domain
as it comes into the system and not wait for days or hours to run the
job.

Has anyone else attempted to run the crawl in this manner?  Have you
run into the same problems?  Does controlling the fetcher and all the
other instances needed for crawling solve this issue?  There is
nothing in the org.apache.nutch.crawl.Crawl instance, from what I had
seen in the past, that would cause such a memory leak.  This must be
way down somewhere else in the code.

Since Nutch handles so much of its threading, could this be causing the problem?

I am not sure if I should x-post this to the dev group or not.

Anyway, thanks.

Briggs



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Nutch and running crawls within a container.

2007-04-30 Thread Briggs
Well, in nutch 0.7 it was all due to NGramEntry instances held within
hashmaps that never get cleaned up. This code was in the language
plugin, but it has been moved into the nutch codebase.

That wasn't the only problem, but that was a big one.  I though
removing it would solve the problem, but then another creeped up.

On 4/30/07, Sami Siren [EMAIL PROTECTED] wrote:
 Briggs wrote:
  Version:  Nutch 0.9 (but this applies to just about all versions)
 
  I'm really in a bind.
 
  Is anyone crawling from within a web application, or is everyone
  running Nutch using the shell scripts provided?  I am trying to write
  a web application around the Nutch crawling facilities, but it seems
  that there is are huge memory issues when trying to do this.   The
  container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
  on the stack) runs out of memory in less that an hour.  When profiling
  version 0.7.2 we can see that there is a constant pool of objects that
  grow, but never get garbage collected.  So, even when the crawl is
  finished, these objects tend to just hang around forever, until we get
  the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
  the application to use Nutch 0.9 and the problem got about 80x worse

 Have you analyzed in any level of detail what is causing this memory
 wasting?  Have you tried tweaking jvms XX:MaxPermSize?

 I believe that all the classes required by plugins need to be loaded
 multiple times (every time you execute a command where Configuration
 object is created) because of the design of plugin system where every
 plugin has it's own class loader (per configuration).

  So, the current design is/was to have an event happen within the
  system, which would fire off a crawler (currently just calls
  org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
  but grief.  We need to have several crawlers running concurrently. We

 You should perhaps use and call the classes directly and take control of
 managing the Configuration object, this way PermGen size is not wasted
 by loading same classes over and over again.

 --
  Sami Siren



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Nutch and running crawls within a container.

2007-04-30 Thread Briggs
I'll look around the code to make sure I am creating only one instance
of Configuration in my classes, and will play around with the
maxpermgen settings.

Any other input from people that have attempted this sort of setup
would be appreciated.

On 4/30/07, Briggs [EMAIL PROTECTED] wrote:
 Well, in nutch 0.7 it was all due to NGramEntry instances held within
 hashmaps that never get cleaned up. This code was in the language
 plugin, but it has been moved into the nutch codebase.

 That wasn't the only problem, but that was a big one.  I though
 removing it would solve the problem, but then another creeped up.

 On 4/30/07, Sami Siren [EMAIL PROTECTED] wrote:
  Briggs wrote:
   Version:  Nutch 0.9 (but this applies to just about all versions)
  
   I'm really in a bind.
  
   Is anyone crawling from within a web application, or is everyone
   running Nutch using the shell scripts provided?  I am trying to write
   a web application around the Nutch crawling facilities, but it seems
   that there is are huge memory issues when trying to do this.   The
   container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
   on the stack) runs out of memory in less that an hour.  When profiling
   version 0.7.2 we can see that there is a constant pool of objects that
   grow, but never get garbage collected.  So, even when the crawl is
   finished, these objects tend to just hang around forever, until we get
   the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
   the application to use Nutch 0.9 and the problem got about 80x worse
 
  Have you analyzed in any level of detail what is causing this memory
  wasting?  Have you tried tweaking jvms XX:MaxPermSize?
 
  I believe that all the classes required by plugins need to be loaded
  multiple times (every time you execute a command where Configuration
  object is created) because of the design of plugin system where every
  plugin has it's own class loader (per configuration).
 
   So, the current design is/was to have an event happen within the
   system, which would fire off a crawler (currently just calls
   org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
   but grief.  We need to have several crawlers running concurrently. We
 
  You should perhaps use and call the classes directly and take control of
  managing the Configuration object, this way PermGen size is not wasted
  by loading same classes over and over again.
 
  --
   Sami Siren
 


 --
 Conscious decisions by conscious minds are what make reality real



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Removing pages from index immediately

2007-04-27 Thread Briggs
Here is the link to the docs: http://lucene.apache.org/nutch/apidocs/index.html

You would then need to create a filter of 'pruned' urls to ignore if
they are discovered again.  This list can get quite large, but I
really don't know how else to do it.  It would be cool if we could
hack the crawldb (or webdb I believe in your version) to include a
flag of 'good/bad' or something.


On 4/27/07, Briggs [EMAIL PROTECTED] wrote:
 Isn't this what you are looking for?

 org.apache.nutch.tools.PruneIndexTool.



 On 4/27/07, franklinb4u [EMAIL PROTECTED] wrote:
 
  hi Enis,
  This is franklin ..currently i m using nutch 0.7.2 for my crawling and
  indexing for my search engine...
  i read from ur message that u can delete a particular index directly?if so
  how its possible..i m desperately searching for a clue to do this one...
  my requirement is to delete the porn site's index from my crawled data...
  ur help is highly needed
 
  expecting u to help me in this regards ..
 
  Thanks in advance..
  Franklin.S
 
 
  ogjunk-nutch wrote:
  
   Hi Enis,
  
   Right, I can easily delete the page from the Lucene index, though I'd
   prefer to follow the Nutch protocol and avoid messing something up by
   touching the index directly.  However, I don't want that page to re-appear
   in one of the subsequent fetches.  Well, it won't re-appear, because it
   will remain missing, but it would be great to be able to tell Nutch to
   forget it from everywhere.  Is that doable?
   I could read and re-write the *Db Maps, but that's a lot of IO... just to
   get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
   flags a given page as forget this page as soon as possible and it just
   happens later on.
  
   Thanks,
   Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
   Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
  
   - Original Message 
   From: Enis Soztutar [EMAIL PROTECTED]
   To: [EMAIL PROTECTED]
   Sent: Thursday, April 5, 2007 3:29:55 AM
   Subject: Re: [Nutch-general] Removing pages from index immediately
  
   Since hadoop's map files are write once, it is not possible to delete
   some urls from the crawldb and linkdb. The only thing you can do is to
   create the map files once again without the deleted urls. But running
   the crawl once more as you suggested seems more appropriate. Deleting
   documents from the index is just lucene stuff.
  
   In your case it seems that every once in a while, you crawl the whole
   site, and create the indexes and db's and then just throw the old one
   out. And between two crawls you can delete the urls from the index.
  
   [EMAIL PROTECTED] wrote:
   Hi,
  
   I'd like to be able to immediately remove certain pages from Nutch
   (index, crawldb, linkdb...).
   The scenario is that I'm using Nutch to index a single site or a set of
   internal sites.  Once in a while editors of the site remove a page from
   the site.  When that happens, I want to update at least the index and
   ideally crawldb, linkdb, so that people searching the index don't get the
   missing page in results and end up going there, hitting the 404.
  
   I don't think there is a direct way to do this with Nutch, is there?
   If there really is no direct way to do this, I was thinking I'd just put
   the URL of the recently removed page into the first next fetchlist and
   then somehow get Nutch to immediately remove that page/URL once it hits a
   404.  How does that sound?
  
   Is there a way to configure Nutch to delete the page after it gets a 404
   for it even just once?  I thought I saw the setting for that somewhere a
   few weeks ago, but now I can't find it.
  
   Thanks,
   Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
   Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
  
  
  
  
  
  
   -
   Take Surveys. Earn Cash. Influence the Future of IT
   Join SourceForge.net's Techsay panel and you'll get the chance to share
   your
   opinions on IT  business topics through brief surveys-and earn cash
   http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
   ___
   Nutch-general mailing list
   Nutch-general@lists.sourceforge.net
   https://lists.sourceforge.net/lists/listinfo/nutch-general
  
  
  
  
  
 
  --
  View this message in context: 
  http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 


 --
 Conscious decisions by conscious minds are what make reality real



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your

Re: [Nutch-general] Removing pages from index immediately

2007-04-27 Thread Briggs
Well, it looks like the link I sent you goes to the 0.9 version of the
nutch api.  There is a link error on the nutch project site because
the 0.7.2 doc link points to the 0.9 docs.



On 4/27/07, Briggs [EMAIL PROTECTED] wrote:
 Here is the link to the docs: 
 http://lucene.apache.org/nutch/apidocs/index.html

 You would then need to create a filter of 'pruned' urls to ignore if
 they are discovered again.  This list can get quite large, but I
 really don't know how else to do it.  It would be cool if we could
 hack the crawldb (or webdb I believe in your version) to include a
 flag of 'good/bad' or something.


 On 4/27/07, Briggs [EMAIL PROTECTED] wrote:
  Isn't this what you are looking for?
 
  org.apache.nutch.tools.PruneIndexTool.
 
 
 
  On 4/27/07, franklinb4u [EMAIL PROTECTED] wrote:
  
   hi Enis,
   This is franklin ..currently i m using nutch 0.7.2 for my crawling and
   indexing for my search engine...
   i read from ur message that u can delete a particular index directly?if so
   how its possible..i m desperately searching for a clue to do this one...
   my requirement is to delete the porn site's index from my crawled data...
   ur help is highly needed
  
   expecting u to help me in this regards ..
  
   Thanks in advance..
   Franklin.S
  
  
   ogjunk-nutch wrote:
   
Hi Enis,
   
Right, I can easily delete the page from the Lucene index, though I'd
prefer to follow the Nutch protocol and avoid messing something up by
touching the index directly.  However, I don't want that page to 
re-appear
in one of the subsequent fetches.  Well, it won't re-appear, because it
will remain missing, but it would be great to be able to tell Nutch to
forget it from everywhere.  Is that doable?
I could read and re-write the *Db Maps, but that's a lot of IO... just 
to
get a couple of URLs erased.  I'd prefer a friendly persuasion where 
Nutch
flags a given page as forget this page as soon as possible and it just
happens later on.
   
Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
   
- Original Message 
From: Enis Soztutar [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, April 5, 2007 3:29:55 AM
Subject: Re: [Nutch-general] Removing pages from index immediately
   
Since hadoop's map files are write once, it is not possible to delete
some urls from the crawldb and linkdb. The only thing you can do is to
create the map files once again without the deleted urls. But running
the crawl once more as you suggested seems more appropriate. Deleting
documents from the index is just lucene stuff.
   
In your case it seems that every once in a while, you crawl the whole
site, and create the indexes and db's and then just throw the old one
out. And between two crawls you can delete the urls from the index.
   
[EMAIL PROTECTED] wrote:
Hi,
   
I'd like to be able to immediately remove certain pages from Nutch
(index, crawldb, linkdb...).
The scenario is that I'm using Nutch to index a single site or a set of
internal sites.  Once in a while editors of the site remove a page from
the site.  When that happens, I want to update at least the index and
ideally crawldb, linkdb, so that people searching the index don't get 
the
missing page in results and end up going there, hitting the 404.
   
I don't think there is a direct way to do this with Nutch, is there?
If there really is no direct way to do this, I was thinking I'd just 
put
the URL of the recently removed page into the first next fetchlist and
then somehow get Nutch to immediately remove that page/URL once it 
hits a
404.  How does that sound?
   
Is there a way to configure Nutch to delete the page after it gets a 
404
for it even just once?  I thought I saw the setting for that somewhere 
a
few weeks ago, but now I can't find it.
   
Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
   
   
   
   
   
   
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share
your
opinions on IT  business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
   
   
   
   
   
  
   --
   View this message in context: 
   http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
   Sent from the Nutch - User

Re: [Nutch-general] Using nutch just for the crawler/fetcher

2007-04-25 Thread Briggs
If you are just looking to have a seed list of domains, and would like
to mirror their content for indexing, why not just use the unix tool
'wget'?  It will mirror the site on your system and then you can just
index that.




On 4/25/07, John Kleven [EMAIL PROTECTED] wrote:
 Hello,

 I am hoping crawl about 3000 domains using the nutch crawler +
 PrefixURLFilter, however, I have no need to actually index the html.
 Ideally, I would just like each domain's raw html pages saved into separate
 directories.  We already have a parser that converts the HTML into indexes
 for our particular application.

 Is there a clean way to accomplish this?

 My current idea is to create a python script (similar to the one already on
 the wiki) that essentially loops through the fetch, update cycles until
 depth is reached, and then simply never actually does the real lucene
 indexing and merging.  Now, here's the there must be a better way part ...
 I would then simply execute the bin/nutch readseg -dump tool via python to
 extract all the html and headers (for each segment) and then, via a regex,
 save each html output back into an html file, and store it in a directory
 according to the domain it came from.

 How stupid/slow is this?  Any better ideas?  I saw someone previously
 mentioned something like what I want to do, and someone responded that it
 was better to just roll your own crawler or something?  I doubt that for
 some reason.  Also, in the future we'd like to take advantage of the
 word/pdf downloading/parsing as well.

 Thanks for what appears to be a great crawler!

 Sincerely,
 John



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Index

2007-04-24 Thread Briggs
On the nutch wiki there is this tutorial:

http://wiki.apache.org/nutch/NutchHadoopTutorial

There is also (it is for version 0.8, but can still work with 0.9):

http://lucene.apache.org/nutch/tutorial8.html


On 4/24/07, ekoje ekoje [EMAIL PROTECTED] wrote:
 Hi Guys,

 I would like to create a new custom index.
 Do you know if there is any tutorial, document or web page which can help me
 ?

 Thanks,
 E



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Index

2007-04-24 Thread Briggs
Perhaps someone else can chime in on this.  I am not sure of exactly
what you are asking.  The indexing is based on Lucene. So, if you need
to understand how the indexing works you will need to look into the
Lucene documentation.   If you are only looking to add custom fields
and such to the index, you could look into the indexing filters of
Nutch.  There are examples on the wiki for that too.



On 4/24/07, ekoje ekoje [EMAIL PROTECTED] wrote:
 Thanks for your help but i think there is a misunderstanding. I was talking
 about creating a new index class in java based on specific parameters that i
 will defined.

 Do you if there is any web page which can give me more information in order
 to implement in Java this index ?

 E

  On the nutch wiki there is this tutorial:
 
  http://wiki.apache.org/nutch/NutchHadoopTutorial
 
  There is also (it is for version 0.8, but can still work with 0.9):
 
  http://lucene.apache.org/nutch/tutorial8.html
 
 
  On 4/24/07, ekoje ekoje [EMAIL PROTECTED] wrote:
  Hi Guys,
 
  I would like to create a new custom index.
  Do you know if there is any tutorial, document or web page which can
  help me
  ?
 
  Thanks,
  E
 
 
 
  --
  Conscious decisions by conscious minds are what make reality real
 



-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] How to delete already stored indexed fields???

2007-04-20 Thread Briggs
If you look into the BasicIndexingFilter.java plugin source you will
see that this is where those default fields get indexed.  So, you can
either create a new plugin that is configurable for the properties you
want to index, or remove this plugin.   Here is the snippet of code
that is in the filter:


   if (host != null) {
// add host as un-stored, indexed and tokenized
doc.add(new Field(host, host, Field.Store.NO,
Field.Index.TOKENIZED));
// add site as un-stored, indexed and un-tokenized
doc.add(new Field(site, host, Field.Store.NO,
Field.Index.UN_TOKENIZED));
}

// url is both stored and indexed, so it's both searchable and returned
doc.add(new Field(url, url.toString(), Field.Store.YES,
Field.Index.TOKENIZED));

// content is indexed, so that it's searchable, but not stored in index
doc.add(new Field(content, parse.getText(), Field.Store.NO,
Field.Index.TOKENIZED));

// anchors are indexed, so they're searchable, but not stored in index
try {
String[] anchors = (inlinks != null ? inlinks.getAnchors()
: new String[0]);
for (int i = 0; i  anchors.length; i++) {
doc.add(new Field(anchor, anchors[i],
Field.Store.NO, Field.Index.TOKENIZED));
}
} catch (IOException ioe) {
if (LOG.isWarnEnabled()) {
LOG.warn(BasicIndexingFilter: can't get anchors for 
+ url.toString());
}
}


On 4/3/07, Ratnesh,V2Solutions India
[EMAIL PROTECTED] wrote:

 exactly offcourse ,

 I want this only, Do you have any solution for this??

 looking forwards for your reply

 Thnx


 Siddharth Jonathan wrote:
 
  Do you mean how do you get rid of some of the fields that are indexed by
  default? eg. content, anchor text etc.
 
  Jonathan
  On 4/2/07, Ratnesh,V2Solutions India
  [EMAIL PROTECTED]
  wrote:
 
 
  Hi,
  I have written a plugin , which finds no. of Object tags in a html and
  corresponding urls.
  I am storing objects as fields and page url as values.
 
  And finally interested in seeing the search realted with objects
  indexed
  fields not those which is already stored as indexed fields.
 
  So how shall I delete those index fields which is already stored
 
  Looking forward towards your reply(Valuable
  inputs).
 
  Thnx to Nutch Community
  --
  View this message in context:
  http://www.nabble.com/How-to-delete-already-stored-indexed-fieldstf3504164.html#a9786377
  Sent from the Nutch - User mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context: 
 http://www.nabble.com/How-to-delete-already-stored-indexed-fieldstf3504164.html#a9803792
 Sent from the Nutch - User mailing list archive at Nabble.com.




-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] How to dump all the valid links which has been crawled?

2007-04-20 Thread Briggs
That one is a bit more complicated because it has to do with
complexities of the underlying scoring algorithm(s).  But, basically,
that means give me the top 35 links within the crawl db and put them
in the file called 'test'.  Top links are calculated by their
relevance when dealing with how many other other links, from other
pages/sites point to them.

Basically, when the crawler crawls, it stores all discovered links
within the db.   If the crawler finds the same link from multiple
resources (other pages) then that link's score goes up.

That is just a simple explanation, but I think it is close enough.

You may want to look more into the OPIC filter and how that algorithm
works, if you really want to get into the grit of the code.   You can
see how scoring is calculated by running the nutch example web
application and clicking on the 'explain' link on a result.




On 4/19/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
 Can you please tell me what is the meaning of this command? what is
 the top 35 links? how  nutch rank the top 35 links?

 bin/nutch readdb crawl/crawldb -topN 35 test

 On 4/19/07, Briggs [EMAIL PROTECTED] wrote:
  Those links are links that were discovered. It does not mean that they
  were fetched, they weren't.
 
  On 4/12/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
   I think I find out the answer to my previous question by doing this:
  
bin/nutch readlinkdb crawl/linkdb/ -dump test
  
  
   But my next question is why the result shows URLs with 'gif', 'js', 
   etc,etc
  
   I have this line in my craw-urlfilter.txt, so i don't except I will
   crawl things like images, javascript files,
  
   # skip image and other suffixes we can't yet parse
   -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$
  
  
   Can you please tell me how to fix my problem?
  
   Thank you.
  
   On 4/11/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
Hi,
   
I read this article about nutch crawling:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
   
How can I dumped out the valid links which has been crawled?
This command described in the article does not work in nutch 0.9. What
should I use instead?
   
bin/nutch readdb crawl-tinysite/db -dumplinks
   
Thank you for any help.
   
  
 
 
  --
  Conscious decisions by concious minds are what make reality real
 



-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Classpath and plugins question

2007-04-19 Thread Briggs
Look into org.apache.nutch.plugin.  The custom plugin classloader, and
the resource loadeer reside in there.

On 4/18/07, Antony Bowesman [EMAIL PROTECTED] wrote:
 I'm looking to use the Nutch parsing framework in a separate Lucene project.
 I'd like to be able to use the existing plugins directory structure as-is, so
 wondered Nutch sets up the class loading environment to find all the jar files
 in the plugins directories.

 Any pointers to the Nutch class(es) that do the work?

 Thanks
 Antony






-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Classpath and plugins question

2007-04-19 Thread Briggs
I'll add that the PluginRespository is the class that recurses through
your plugins directory, and loads each plugin's descriptor file then
loads all dependencies for each plugin within their own classloader.

On 4/19/07, Briggs [EMAIL PROTECTED] wrote:
 Look into org.apache.nutch.plugin.  The custom plugin classloader, and
 the resource loadeer reside in there.

 On 4/18/07, Antony Bowesman [EMAIL PROTECTED] wrote:
  I'm looking to use the Nutch parsing framework in a separate Lucene project.
  I'd like to be able to use the existing plugins directory structure as-is, 
  so
  wondered Nutch sets up the class loading environment to find all the jar 
  files
  in the plugins directories.
 
  Any pointers to the Nutch class(es) that do the work?
 
  Thanks
  Antony
 
 
 
 


 --
 Conscious decisions by concious minds are what make reality real



-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Nutch and Crawl Frequency

2007-04-19 Thread Briggs
Nutch 0.9

Anyone know if it is possible to be more granular regarding crawl
frequency?  Meaning, that I would like some sites to be crawled more
often then others. Like, a news site should be crawled every day, but
your average business website should be crawled every 30 days.  So, is
it possible to specify a crawl frequency for specific urls, or is it
only global for within the crawl db?  I suppose I could have several
crawldbs or something like that, and deal with it.. but, just curious.

Thanks
-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Nutch and Crawl Frequency

2007-04-19 Thread Briggs
Cool, cool.  Thanks!

On 4/19/07, Gal Nitzan [EMAIL PROTECTED] wrote:
 As it is right now... You answered the question yourself :-) ...

 Separate db's and the whole ceremony...


  -Original Message-
  From: Briggs [mailto:[EMAIL PROTECTED]
  Sent: Thursday, April 19, 2007 10:02 PM
  To: [EMAIL PROTECTED]
  Subject: Nutch and Crawl Frequency
 
  Nutch 0.9
 
  Anyone know if it is possible to be more granular regarding crawl
  frequency?  Meaning, that I would like some sites to be crawled more
  often then others. Like, a news site should be crawled every day, but
  your average business website should be crawled every 30 days.  So, is
  it possible to specify a crawl frequency for specific urls, or is it
  only global for within the crawl db?  I suppose I could have several
  crawldbs or something like that, and deal with it.. but, just curious.
 
  Thanks
  --
  Conscious decisions by conscious minds are what make reality real





-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Forcing update of some URLs

2007-04-19 Thread Briggs
From what I have gathered is that you may want to keep multiple
crawldbs for your crawls.  So, you could have a crawldb for more
frequent crawls and fire off nutch and read that db with the
appropriate configs for that job.   I was hoping for the same
mechanism, but it looks like we need to write this for ourselves.


On 4/12/07, Arie Karhendana [EMAIL PROTECTED] wrote:
 Hi all,

 I'm a new user of Nutch. I use Nutch primarily to crawl blog and news
 sites. But I noticed that Nutch fetches pages only on some refresh
 interval (30 days default).

 Blog and news sites have unique characteristic that some of their
 pages are updated very frequently (e.g. the main page) so they have to
 be refetched often, while other pages don't need to be refreshed /
 refetched at all (e.g. the news article pages, which eventually will
 become 'obsolete').

 Is there any way to force update some URLs? Can I just 're-inject' the
 URLs to set the next fetch date to 'immediately'?

 Thank you,
 --
 Arie Karhendana



-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] How to dump all the valid links which has been crawled?

2007-04-19 Thread Briggs
Those links are links that were discovered. It does not mean that they
were fetched, they weren't.

On 4/12/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
 I think I find out the answer to my previous question by doing this:

  bin/nutch readlinkdb crawl/linkdb/ -dump test


 But my next question is why the result shows URLs with 'gif', 'js', etc,etc

 I have this line in my craw-urlfilter.txt, so i don't except I will
 crawl things like images, javascript files,

 # skip image and other suffixes we can't yet parse
 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP|js|rss|swf)$


 Can you please tell me how to fix my problem?

 Thank you.

 On 4/11/07, Meryl Silverburgh [EMAIL PROTECTED] wrote:
  Hi,
 
  I read this article about nutch crawling:
  http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.html
 
  How can I dumped out the valid links which has been crawled?
  This command described in the article does not work in nutch 0.9. What
  should I use instead?
 
  bin/nutch readdb crawl-tinysite/db -dumplinks
 
  Thank you for any help.
 



-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Source of Outlink and how to get Outlinks in 0.9

2007-04-18 Thread Briggs
Is it possible to determine from which domain(s) an outlink was
located?  The only way I know how is to limit the crawl to a single
domain (so, I would know where the outlink came from). Also, I am
having difficultly trying to figure out how in 0.9 (probably the same
in 0.8) to easily get the outlinks for my segments.  In nutch 0.7.* we
use to do something like:

snippet

segmentReader = createSegmentReader(segment);

final FetcherOutput fetcherOutput = new FetcherOutput();
final Content content   = new Content();
final ParseData indexParseData   = new ParseData();
final ParseText parseText= new ParseText();

while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) {
extractOutlinksFromParseData(indexParseData, outlinks);
}

/snippet

snippet
private void extractOutlinksFromParseData(final ParseData
indexParseData, finalSetString outlinks) {

for (final Outlink outlink : indexParseData.getOutlinks()) {
if (null != outlink   outlink.getToUrl() != null) {
outlinks.add(outlink.getToUrl());
}
}
}
/snippet

I am finally making the plunge and attempting to get this thing (my
application) up to date with the latest and greatest!

Thanks for your time!  And once I really get through this code I
promise to start posting answers.

Briggs.

-- 
Conscious decisions by conscious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Source of Outlink and how to get Outlinks in 0.9

2007-04-18 Thread Briggs
I am adding more info to my post from what I have been looking into...

So, I have found the LinkDbReader and it seems to be able to dump text
out to a file. But, unfortunately, it dumps to a file and I need to
parse it (or I might have missed something).  So, if this is the
correct class, that will have to work... Here is a snippet of the
output of the LinkDbReader from a page that I crawled on one of my
test machines, which has apache documentation installed. The output of
the reader is:

snippet
http://httpd.apache.org/Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: HTTP Server

http://httpd.apache.org/docs-project/   Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: Documentation
 fromUrl: http://nutchdev-1/manual/ anchor:

http://www.apache.org/  Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: Apache

http://www.apache.org/foundation/preFAQ.htmlInlinks:
 fromUrl: http://nutchdev-1/ anchor: Apache web server

http://www.apache.org/licenses/LICENSE-2.0  Inlinks:
 fromUrl: http://nutchdev-1/manual/ anchor: Apache License, Version 2.0

/snippet

So, am I to assume that the format shows outlinks first, then the
Inlinks are where the links were found?  I'll just have to figure out
the format here so I can parse it.  I'll probably write a wrapper that
exports to xml or something to make transformation of this easier.

Anyway, am I on the right track?

Briggs.




On 4/18/07, Briggs [EMAIL PROTECTED] wrote:
 Is it possible to determine from which domain(s) an outlink was
 located?  The only way I know how is to limit the crawl to a single
 domain (so, I would know where the outlink came from). Also, I am
 having difficultly trying to figure out how in 0.9 (probably the same
 in 0.8) to easily get the outlinks for my segments.  In nutch 0.7.* we
 use to do something like:

 snippet

 segmentReader = createSegmentReader(segment);

 final FetcherOutput fetcherOutput = new FetcherOutput();
 final Content content   = new Content();
 final ParseData indexParseData   = new ParseData();
 final ParseText parseText= new ParseText();

 while (segmentReader.next(fetcherOutput, content, parseText, indexParseData)) 
 {
 extractOutlinksFromParseData(indexParseData, outlinks);
 }

 /snippet

 snippet
 private void extractOutlinksFromParseData(final ParseData
 indexParseData, finalSetString outlinks) {

 for (final Outlink outlink : indexParseData.getOutlinks()) {
 if (null != outlink   outlink.getToUrl() != null) {
 outlinks.add(outlink.getToUrl());
 }
 }
 }
 /snippet

 I am finally making the plunge and attempting to get this thing (my
 application) up to date with the latest and greatest!

 Thanks for your time!  And once I really get through this code I
 promise to start posting answers.

 Briggs.

 --
 Conscious decisions by conscious minds are what make reality real



-- 
Conscious decisions by concious minds are what make reality real

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Logger duplicates entries by the thousands

2007-03-23 Thread Briggs
Currently using 0.7.2.

We have a process that runs crawltool from within an application,
perhaps hundreds of times during the course of the day.  The problem I
am seeing is that over time the log statements from my application (I
am using commons logging and Log4j) are also being logged within the
nutch log.  But, the real problem is that over time each log statement
gets repeated by some factor that increases over time/calls.  So,
currently, if I have a debug statement after I call CrawlTool.main(),
I will get 7500 entries in the log for that one statement.  I see a
'memory leak' in the application as this happens because I eventually
run out of it (1.5GB).  Has anyone else seen this problem?  I have to
keep shutting down the app so I can continue.

Any clues?  Does nutch create log appenders in the crawler code, and
is this causing the problem?





-- 
Concious decisions by concious minds are what make reality real

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Logger duplicates entries by the thousands

2007-03-23 Thread Briggs
Status update...
So, I have the logging 'fixed', removed appenders and such. But I can
see that the logging issue was just a result of something else
happening underneath.  The memory consumption of the application still
grows until an OutOfHeapSpace error is thrown.

So, still trying to find where that is happening...  It's either Nutch
or ActiveMQ stuff.

Anyway,

Have fun and Cheers!

On 3/23/07, Briggs [EMAIL PROTECTED] wrote:
 Currently using 0.7.2.

 We have a process that runs crawltool from within an application,
 perhaps hundreds of times during the course of the day.  The problem I
 am seeing is that over time the log statements from my application (I
 am using commons logging and Log4j) are also being logged within the
 nutch log.  But, the real problem is that over time each log statement
 gets repeated by some factor that increases over time/calls.  So,
 currently, if I have a debug statement after I call CrawlTool.main(),
 I will get 7500 entries in the log for that one statement.  I see a
 'memory leak' in the application as this happens because I eventually
 run out of it (1.5GB).  Has anyone else seen this problem?  I have to
 keep shutting down the app so I can continue.

 Any clues?  Does nutch create log appenders in the crawler code, and
 is this causing the problem?





 --
 Concious decisions by concious minds are what make reality real



-- 
Concious decisions by concious minds are what make reality real

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] List Domains and adding Boost Values for Custom Fields

2007-01-31 Thread Briggs
So,

(nutch 0.7.2)

Does anyone know if there is such a query in nutch that I could
somehow return a full list of all unique domains that have been
crawled?  I was originally storing each domain's segment separately,
but that ended up being a nightmare when it came to creating search
beans, since the bean opens up each segment on init. So, I am working
on an incremental segment merge tool to handle the thousands of
segments I have and get em down to a few.

Also... What I really need is a pointer at how to do the following:

I have several custom attributes/fields, say business and
confidential,  added to a document when it was indexed.  I want to
assign a boost value to the custom fields and have nutch use those
values when it is searching.  Where might I look to find such a thing?
 I do not want to search by those fields, I only want them as part of
nutch's scoring so that if  there are high boost values for those
fields, they will be pushed to the top.

Thanks again!

Briggs




-- 
Concious decisions by concious minds are what make reality real

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Plugin ClassLoader issues...

2007-01-31 Thread Briggs
So, I am having ClassLoader issues with plugins. It seems that the
PluginRepository does some wierd class loading (PluginClassLoader)
when it starts up. Does this mean that my plugin will not inherit the
classpath of my web application that it is loaded within?

A simple example is that my webapp contains spring-2.0.jar. But when I
try to call a spring class from within my plugin, I get a
NoClassDefFound error.  So

But the real issue is that I need to have my plugins to have access to
some business classes that are deployed within my web application.
How does one go about this in a nice way?

-- 
Concious decisions by concious minds are what make reality real

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Plugin ClassLoader issues...

2007-01-31 Thread Briggs
Well, I found this:

http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading

Arrrgh.  Well, looks like I am going to use JMX to have my plugin talk
to my application.  That way I won't have have several copies of my
business jars around.



On 1/31/07, Briggs [EMAIL PROTECTED] wrote:
 So, I am having ClassLoader issues with plugins. It seems that the
 PluginRepository does some wierd class loading (PluginClassLoader)
 when it starts up. Does this mean that my plugin will not inherit the
 classpath of my web application that it is loaded within?

 A simple example is that my webapp contains spring-2.0.jar. But when I
 try to call a spring class from within my plugin, I get a
 NoClassDefFound error.  So

 But the real issue is that I need to have my plugins to have access to
 some business classes that are deployed within my web application.
 How does one go about this in a nice way?

 --
 Concious decisions by concious minds are what make reality real



-- 
Concious decisions by concious minds are what make reality real

-
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Briggs

Has anyone written an API that can merge thousands of segments?  The current
segment merge tool cannot handle this much data as there just isn't enough
RAM available on the box. So, I was wondering if there was a better,
incremental way to handle this.

Currently I have 1 segment for each domain that was crawled and I want to
merge them all into several large segments.  So, if anyone has any pointers
I would appreciate it.  Has anyone else attempted to keep segments at this
granularity?  This doesn't seem to work so well.


briggs /

Concious decisions by concious minds are what make reality real
-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Briggs
 Are you running this in a distributed setup, or in local mode? Local
 mode is not designed to cope with such large datasets, so it's likely
 that you will be getting OOM errors during sorting ... I can only
 recommend that you use a distributed setup with several machines, and
 adjust RAM consumption with the number of reduce tasks.

Currently we are running in local mode.  We do not have the setup for
distributing. That is why I want to merge these segments.  Would that
not help?  Insteand of having potentially tens of thousands of
segments, I want to create several large segments and index those.

Sorry for my ignorance, but not really sure how to scale nutch
correctly.  Do you know of a document, or some pointers as to how
segment/index data should be stored?

briggs /

Concious decisions by concious minds are what make reality real

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


Re: [Nutch-general] Merging large sets of segments, help.

2007-01-24 Thread Briggs
Cool, thanks for your responses!

Next time I should probably mention that we are using 0.7.2.  Not
quite sure if we can even think about moving to something 'more
current' as I don't really know the reasons to.

briggs /

 Most of this information is already available on the Nutch Wiki. All I
 can say is that there is certainly a limit to what you can do using the
 local mode - if you need to handle large numbers of pages you will
 need to migrate to the distributed setup.

 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Concious decisions by concious minds are what make reality real

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general