patch for nutch and nutch-daemon.sh
Hi: Due to a bug in the if statement its not possible to use the symlinks for the shell scripts. Below you will find the patch. Thanks Zaheed --- $ svn diff nutch Index: nutch === --- nutch (revision 371849) +++ nutch (working copy) @@ -17,7 +17,7 @@ while [ -h "$THIS" ]; do ls=`ls -ld "$THIS"` link=`expr "$ls" : '.*-> \(.*\)$'` - if expr "$link" : '.*/.*' > /dev/null; then + if expr "$link" : '/.*' > /dev/null; then THIS="$link" else THIS=`dirname "$THIS"`/"$link" $ svn diff nutch-daemon.sh Index: nutch-daemon.sh === --- nutch-daemon.sh (revision 371849) +++ nutch-daemon.sh (working copy) @@ -29,7 +29,7 @@ while [ -h "$this" ]; do ls=`ls -ld "$this"` link=`expr "$ls" : '.*-> \(.*\)$'` - if expr "$link" : '.*/.*' > /dev/null; then + if expr "$link" : '/.*' > /dev/null; then this="$link" else this=`dirname "$this"`/"$link" $
Re: lang identifier and nutch analyzer in trunk
Hi Is it reasonable to guess language info. from target servers geographical info.? /Jack On 1/23/06, Jérôme Charron <[EMAIL PROTECTED]> wrote: > > Any plan to implement this ? I mean move LanguageIdentifier class > > intto nutch core. > > As I already suggested it on this list, I really would like to move the > LanguageIdentifier class (and profiles) to > an independant Lucene sub-project (and the MimeType repository too). > I don't remember why but there were some objections about this... > > Here is a short status of what I have in mind for next improvements with the > LanguageIdentifier / MultiLanguage support : > * Enhance LanguageIdentifier APIs by returning something like an ordered > LangDetail[] array when guessing language (each LangDetail should contains > the language code and its score) - I have a prototype version of this on my > disk but I doesn't take time to finalize it > * I encountered some identification problems with some specific sites (with > blogger for instance), and I plan to investigate on this point. > * Another pending task : the analysis (and coding) of multilingual querying > support. > > Regards > > Jérôme > > -- > http://motrech.free.fr/ > http://www.frutch.org/ > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
xml-parser plugin contribution
Hi, I have developed an xml parser plugin. I have test it with nutch 0.7.2. The parser use namespaces and xpath to do the mapping between XML nodes and lucene fields. I'm trying to send the source of the plugin in a zip file but my message is always rejected (it is considered as a spam). How can I send the source code ? Best regards.
Re: lang identifier and nutch analyzer in trunk
[EMAIL PROTECTED] wrote: I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future. Does that sound ok? +1 from me. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: lang identifier and nutch analyzer in trunk
I would like to decouple Lang Id from Nutch and move it in Lucene contrib/ in the near future. Does that sound ok? Otis - Original Message From: Stefan Groschupf <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Mon 23 Jan 2006 02:55:46 PM EST Subject: Re: lang identifier and nutch analyzer in trunk >> As I already suggested it on this list, I really would like to >> move the >> LanguageIdentifier class (and profiles) to >> an independant Lucene sub-project (and the MimeType repository too). >> I don't remember why but there were some objections about this... >> >> > > I think most people agree that it would be worthwhile to un-tie > this component from Nutch internals. The only objections were > related not to the idea itself, but to the management aspects of > creating a full-blown sub-project, both wrt. to the initial setup > and the continuing maintenance. An alternative solution was > proposed (creating a contrib/ package). This would still help to > separate the code from Nutch internals, so that it can be used in > other projects, but it would require much less effort to set up and > maintain. +1, what's about lucene sandbox or jsut open a source forge project with Apache 2 license, than we can use just the jar. Stefan
Re: Patch for NDFS's df.java
+1 May you can clean up thinks by your self and create a jira issue where you can attach this patch. Thanks. Stefan Am 20.01.2006 um 16:33 schrieb Dominik Friedrich: This is a bugfixed version. I wasn't running a normal NDFS datanode but a jmx based version I'm working on and I didn't recorgnize that the first version always returned 0 when the NDFS datanode was started with no data dir. best regards, Dominik Dominik Friedrich schrieb: This is a patch for NDFS's df.java. With this patch I was able to run NDFS datanode on Windows systems without cygwin. I haven't found a way to get the partition size on Windows so it's always set to two times the available space. Maybe somebody can clean this up so it can be included into nutch trunk. best regards, Dominik Index: DF.java === --- DF.java (revision 370204) +++ DF.java (working copy) @@ -15,81 +15,167 @@ */ package org.apache.nutch.ndfs; -import java.io.File; +import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; -import java.io.BufferedReader; - import java.util.StringTokenizer; -import java.util.Iterator; -/** Filesystem disk space usage statistics. Uses the unix 'df' program. - * Tested on Linux, FreeBSD and Cygwin. */ +/** + * Filesystem disk space usage statistics. Uses the unix 'df' program. Tested on Linux, FreeBSD and Cygwin. + */ public class DF { - private String filesystem; - private long capacity; - private long used; - private long available; - private int percentUsed; - private String mount; - - public DF(String path) throws IOException { + private String filesystem; -Process process = Runtime.getRuntime().exec(new String[] {"df","-k",path}); + private long capacity; -try { - if (process.waitFor() == 0) { -BufferedReader lines = - new BufferedReader(new InputStreamReader (process.getInputStream())); + private long used; -lines.readLine(); // skip headings + private long available; -StringTokenizer tokens = - new StringTokenizer(lines.readLine(), " \t\n\r\f%"); - -this.filesystem = tokens.nextToken(); -if (!tokens.hasMoreTokens()) {// for long filesystem name - tokens = new StringTokenizer(lines.readLine(), " \t\n\r\f %"); -} -this.capacity = Long.parseLong(tokens.nextToken()) * 1024; -this.used = Long.parseLong(tokens.nextToken()) * 1024; -this.available = Long.parseLong(tokens.nextToken()) * 1024; -this.percentUsed = Integer.parseInt(tokens.nextToken()); -this.mount = tokens.nextToken(); + private int percentUsed; - } else { -throw new IOException - (new BufferedReader(new InputStreamReader (process.getErrorStream())) - .readLine()); - } -} catch (InterruptedException e) { - throw new IOException(e.toString()); -} finally { - process.destroy(); -} - } + private String mount; - /// ACCESSORS + public DF(String path) throws IOException { + String os = System.getProperty("os.name"); + if(os.startsWith("Windows")) + dfWidows(path); + else + dfUnix(path); + } - public String getFilesystem() { return filesystem; } - public long getCapacity() { return capacity; } - public long getUsed() { return used; } - public long getAvailable() { return available; } - public int getPercentUsed() { return percentUsed; } - public String getMount() { return mount; } - - public String toString() { -return - "df -k " + mount +"\n" + - filesystem + "\t" + - capacity / 1024 + "\t" + - used / 1024 + "\t" + - available / 1024 + "\t" + - percentUsed + "%\t" + - mount; - } + // / ACCESSORS - public static void main(String[] args) throws Exception { -System.out.println(new DF(args[0])); - } + /** +* @return Returns the filesystem. +* @uml.property name="filesystem" +*/ + public String getFilesystem() { + return filesystem; + } + + /** +* @return Returns the capacity. +* @uml.property name="capacity" +*/ + public long getCapacity() { + return capacity; + } + + /** +* @return Returns the used. +* @uml.property name="used" +*/ + public long getUsed() { + return used; + } + + /** +* @return Returns the available. +* @uml.property name="available" +*/ + public long getAvailable() { + return available; + } + + /** +* @return Returns the percentUsed. +* @uml.property name="percentUsed" +*/ + public int getPercentUsed() { +
Re: lang identifier and nutch analyzer in trunk
As I already suggested it on this list, I really would like to move the LanguageIdentifier class (and profiles) to an independant Lucene sub-project (and the MimeType repository too). I don't remember why but there were some objections about this... I think most people agree that it would be worthwhile to un-tie this component from Nutch internals. The only objections were related not to the idea itself, but to the management aspects of creating a full-blown sub-project, both wrt. to the initial setup and the continuing maintenance. An alternative solution was proposed (creating a contrib/ package). This would still help to separate the code from Nutch internals, so that it can be used in other projects, but it would require much less effort to set up and maintain. +1, what's about lucene sandbox or jsut open a source forge project with Apache 2 license, than we can use just the jar. Stefan
Re: protocol-httpclient; maximum total connections
Thanks for finding this bug, please open a bug report in jira and if you like I guess patches are always welcome. :-) Am 23.01.2006 um 15:00 schrieb [EMAIL PROTECTED]: Hi, Protocol-httpclient sets the maximum number of total connections to "fetcher.threads.fetch" configuration parameter for underlying commons-httpclient. However, if -threads argument is used with the fetcher it doesn't change fetcher.threads.fetch. Giving whatever number of threads to -threads argument, httpclient will use default value of number of total connections (10). This will affect the performance of crawling. It seems to be a bug. Any comment on this? Possible solution can be adding below line to setThreadCount function of Fetcher class. NutchConf.get().setInt("fetcher.threads.fetch", threadCount); Also, fetcher seems to be using lots of memory; maybe due to memory leak. It starts with %10~%15; after several hours Linux top command reports it's using %50~%70 of the whole memory. Anyone experiencing this behaviour? Thanks, -orkunt. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
[jira] Resolved: (NUTCH-127) uncorrect values using -du, or ls does not return items
[ http://issues.apache.org/jira/browse/NUTCH-127?page=all ] Stefan Groschupf resolved NUTCH-127: Resolution: Fixed I guess it is solved, thanks. If able to reproduce it again I will just reopen this or a new report. Thanks! > uncorrect values using -du, or ls does not return items > --- > > Key: NUTCH-127 > URL: http://issues.apache.org/jira/browse/NUTCH-127 > Project: Nutch > Type: Bug > Components: ndfs > Versions: 0.8-dev, 0.7.2-dev > Reporter: Stefan Groschupf > Priority: Blocker > > The ndfs client return uncorrect values by using du or ls does not return > items. > It looks like there is a problem with the virtual file strcuture, since -du > only reads the meta data, isn't it? > We had moved some data from folder to folder and after that we notice that a > folder with zero items has a size. > [EMAIL PROTECTED] bin/nutch ndfs -du indexes/ > 051118 092409 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml > 051118 092409 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml > 051118 092409 No FS indicated, using default:192.168.200.3:5 > 051118 092409 Client connection to 192.168.200.3:5: starting > Found 1 items > /user/nutch/indexes/20051022033721 974606348 > [EMAIL PROTECTED] bin/nutch ndfs -du indexes/20051022033721/ > 051118 092416 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml > 051118 092416 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml > 051118 092416 No FS indicated, using default:192.168.200.3:5 > 051118 092416 Client connection to 192.168.200.3:5: starting > Found 0 items > [EMAIL PROTECTED] bin/nutch ndfs -ls indexes/20051022033721 > 051118 093331 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-default.xml > 051118 093332 parsing file:/home/nutch/nutch-0.8-dev/conf/nutch-site.xml > 051118 093332 No FS indicated, using default:192.168.200.3:5 > 051118 093332 Client connection to 192.168.200.3:5: starting > Found 0 items > So may the mv tool has a problem, the du or the ls tool. :-O Any ideas where > to search for the problem? Dubugging ndfs is tricky. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
protocol-httpclient; maximum total connections
Hi, Protocol-httpclient sets the maximum number of total connections to "fetcher.threads.fetch" configuration parameter for underlying commons-httpclient. However, if -threads argument is used with the fetcher it doesn't change fetcher.threads.fetch. Giving whatever number of threads to -threads argument, httpclient will use default value of number of total connections (10). This will affect the performance of crawling. It seems to be a bug. Any comment on this? Possible solution can be adding below line to setThreadCount function of Fetcher class. NutchConf.get().setInt("fetcher.threads.fetch", threadCount); Also, fetcher seems to be using lots of memory; maybe due to memory leak. It starts with %10~%15; after several hours Linux top command reports it's using %50~%70 of the whole memory. Anyone experiencing this behaviour? Thanks, -orkunt.
Re: lang identifier and nutch analyzer in trunk
> +1. Other local modifications which I use frequently: > > * exporting a list of supported languages, > > * exporting an NGramProfile of the analyzed text, > > * allow processing of chunks of input (i.e. > LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is > very useful if the text to be analyzed is already present in memory, and > the choice of sections (chunks) is made elsewhere, e.g. for documents > with clearly outlined sections, or for multi-language documents. Thanks for these intereseting comments Andrzej => I add them to my todo list. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: lang identifier and nutch analyzer in trunk
Jérôme Charron wrote: Any plan to implement this ? I mean move LanguageIdentifier class intto nutch core. As I already suggested it on this list, I really would like to move the LanguageIdentifier class (and profiles) to an independant Lucene sub-project (and the MimeType repository too). I don't remember why but there were some objections about this... I think most people agree that it would be worthwhile to un-tie this component from Nutch internals. The only objections were related not to the idea itself, but to the management aspects of creating a full-blown sub-project, both wrt. to the initial setup and the continuing maintenance. An alternative solution was proposed (creating a contrib/ package). This would still help to separate the code from Nutch internals, so that it can be used in other projects, but it would require much less effort to set up and maintain. Here is a short status of what I have in mind for next improvements with the LanguageIdentifier / MultiLanguage support : * Enhance LanguageIdentifier APIs by returning something like an ordered LangDetail[] array when guessing language (each LangDetail should contains the language code and its score) - I have a prototype version of this on my disk but I doesn't take time to finalize it +1. Other local modifications which I use frequently: * exporting a list of supported languages, * exporting an NGramProfile of the analyzed text, * allow processing of chunks of input (i.e. LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is very useful if the text to be analyzed is already present in memory, and the choice of sections (chunks) is made elsewhere, e.g. for documents with clearly outlined sections, or for multi-language documents. * I encountered some identification problems with some specific sites (with blogger for instance), and I plan to investigate on this point. * Another pending task : the analysis (and coding) of multilingual querying support. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: lang identifier and nutch analyzer in trunk
> Any plan to implement this ? I mean move LanguageIdentifier class > intto nutch core. As I already suggested it on this list, I really would like to move the LanguageIdentifier class (and profiles) to an independant Lucene sub-project (and the MimeType repository too). I don't remember why but there were some objections about this... Here is a short status of what I have in mind for next improvements with the LanguageIdentifier / MultiLanguage support : * Enhance LanguageIdentifier APIs by returning something like an ordered LangDetail[] array when guessing language (each LangDetail should contains the language code and its score) - I have a prototype version of this on my disk but I doesn't take time to finalize it * I encountered some identification problems with some specific sites (with blogger for instance), and I plan to investigate on this point. * Another pending task : the analysis (and coding) of multilingual querying support. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/