Re: lang identifier and nutch analyzer in trunk

2006-01-24 Thread Jérôme Charron
> > I would like to decouple Lang Id from Nutch and move it in Lucene > contrib/ in the near future. > > Does that sound ok? > +1 from me. +1 from me too (if I can have a commit access to contrib code) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: xml-parser plugin contribution

2006-01-24 Thread Jérôme Charron
> Please use JIRA (http://issues.apache.org/jira/browse/NUTCH) - create a > new issue and attach the file. Perhaps you can use this already existing issue http://issues.apache.org/jira/browse/NUTCH-23 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
> +1. Other local modifications which I use frequently: > > * exporting a list of supported languages, > > * exporting an NGramProfile of the analyzed text, > > * allow processing of chunks of input (i.e. > LanguageIdentifier.identify(char[] buf, int start, int len) ) - this is > very useful if the

Re: lang identifier and nutch analyzer in trunk

2006-01-23 Thread Jérôme Charron
> Any plan to implement this ? I mean move LanguageIdentifier class > intto nutch core. As I already suggested it on this list, I really would like to move the LanguageIdentifier class (and profiles) to an independant Lucene sub-project (and the MimeType repository too). I don't remember why but t

Re: lang identifier and nutch analyzer in trunk

2006-01-20 Thread Jérôme Charron
> I am wondering Analyzer of nutch in svn trunk is chosen by > languageidentifer plugin or not? (I knew in nutch 0.7.1-dev it did). It's not really choosen by the languageidentifier, but coosen regarding the value of the lang attribute (for now, that's right, only the languageidentifier add this a

Re: ParserFactory test fail

2006-01-10 Thread Jérôme Charron
Hi Stefan, No in fact, I have refactored the code of protocol-http plugins, not html parser. So, I don't think the log4 error comes from this code. Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: HTMLMetaProcessor a bug?

2006-01-10 Thread Jérôme Charron
> the following code would fail in case the meta tags are in upper case > > Node nameNode = attrs.getNamedItem("name"); > Node equivNode = attrs.getNamedItem("http-equiv"); > Node contentNode = attrs.getNamedItem("content"); This code works well, because Nutch HTML Parser u

Re: svn commit: r367137 - in /lucene/nutch/trunk/src: java/org/apache/nutch/net/protocols/ plugin/ plugin/lib-http/ plugin/lib-http/src/ plugin/lib-http/src/java/ plugin/lib-http/src/java/org/ plugin/

2006-01-09 Thread Jérôme Charron
... in fact, not really... really unrelated !!! I remove it immediately. Thanks On 1/9/06, Doug Cutting <[EMAIL PROTECTED]> wrote: > > [EMAIL PROTECTED] wrote: > > --- lucene/nutch/trunk/src/plugin/build.xml (original) > > +++ lucene/nutch/trunk/src/plugin/build.xml Sun Jan 8 16:13:42 2006 > > @@

Re: test suite fails?

2006-01-09 Thread Jérôme Charron
I have the same problem too. I don't understand what happens. In fact, the CommandRunner returns a -1 exit code, but nothing in the error output and the good string in the standard output ("nutch rocks nutch rocks nutch rocks"). All seems to be ok but the exit code. Jérôme On 1/9/06, Piotr Kosior

Re: problems http-client

2006-01-06 Thread Jérôme Charron
> > A related issue is that these two plugins replicate a lot of code. At > > some point we should try to fix that. See: > > > > > http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html I have beginning working on this. Nobody else? Can I go on? Jérôme -- http://motrech.fr

Re: [VOTE] Commiter access for Stefan Groschupf

2006-01-05 Thread Jérôme Charron
> Not as late as I am! I'm still catching up on December email... Oups, I forgot to vote: For me, it's 0 I really like all Stefan's support efforts on mailing lists, all his brainstorming, and dev efforts. But I have still in mind a lot of aggressiveness from Stefan on-list and especially off-lis

Re: mapred crawling exception - Job failed!

2006-01-05 Thread Jérôme Charron
> >I gave it a next try this night and I still have troubles. > >This is the very end of my log (full version is attached) and you can > >see another nasty exception: Just a clue... Yesterday, I had exactly the same problem while working on NUTCH-139 issue. The reason was that some metadata were

Re: no static NutchConf

2006-01-05 Thread Jérôme Charron
> Another use case for eliminating the static uses of NutchConf is to > simplify the construction of a configuration gui. It would be nice to > have a web-based interface which permits one to configure parameters and > then have it run the system. Yes, it is a really needed feature. > This sh

Re: no static NutchConf

2006-01-04 Thread Jérôme Charron
> >Excuse me in advance, I probably missed something, but what are the use > >cases for having many NutchConf instances with different values? > Running many different tasks in parallel, each using different config, > inside the same JVM. Ok, I understand this Andrzej, but it is not really what I

Re: no static NutchConf

2006-01-04 Thread Jérôme Charron
> My idea is to be able using low level things outside of nutch also. > It is may a philosophically question in case of the map file writer > you pass a complete hashmap with a bunch of properties to the object, > but the objects only reads one int from this hashmap. I personal > don't like to use

Re: Static initializers

2005-12-20 Thread Jérôme Charron
Andrzej, How do you choose the NutchConf to use ? Here is a short discussion I had with Doug about a kind of dynamic NutchConf inside the same JVM: "... By looking at the mailing lists archives it seems that having some behavior depending on the documents URL is a recurrent problem (for instance

Re: Latest version of Mapred

2005-12-19 Thread Jérôme Charron
> Thanks for the fast response, > Do you know where I can find a compressed version? Here are the nightly builds: http://cvs.apache.org/dist/lucene/nutch/nightly/ Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: vote results.

2005-12-15 Thread Jérôme Charron
> Just continue voting I will continue with my tally sheet. :-) Why not creating a wiki page... so that you don't have to do this "bad work". Jérôme

Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Jérôme Charron
> What people think if we collect a list of issues and make a voting > iteration? +1

Re: [Fwd: Crawler submits forms?]

2005-12-13 Thread Jérôme Charron
+1 for a 0.7.2 release. Here are the issues/revisions I can merge to 0.7 branch. These changes mainly concern the parser-factory changes (NUTCH-88) http://issues.apache.org/jira/browse/NUTCH-112 http://issues.apache.org/jira/browse/NUTCH-135 http://svn.apache.org/viewcvs.cgi?rev=356532&view=rev ht

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
> I agree, too. Perhaps we should use the names as they appear in the > Dublin Core for those properties that are defined there A big YES! > - just prepended > them with "X-nutch-" in order to avoid name-clashes with other > properties (e.g. blindly copied from the protocol headers). Another bi

Re: Standard metadata property names in the ParseData metadata

2005-12-13 Thread Jérôme Charron
+1 A simple solution that provides a standard way to access common meta data. Great! -- http://motrech.free.fr/ http://www.frutch.org/

Hard-coded Content-type checks

2005-12-13 Thread Jérôme Charron
Hi, I would like to remove all the hard-coded content-type checks spread over all the parse plugins. In fact, the content-type/plugin-id mapping is now centralized in the parse-plugin.xml file, and there's no more needs for the parser to check the content-type. The basic idea was: 1. The developer

Re: Google performance bottlenecks ;-) (Re: Lucene performance bottlenecks)

2005-12-09 Thread Jérôme Charron
> The total number of hits (approx) is 2,780,000,000. BTW, I find it > curious that the last 3 or 6 digits always seem to be zeros ... there's > some clever guesstimation involved here. The fact that Google Suggest is > able to return results so quickly would support this suspicion. > For more info

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
> Right, but the URL filters run long before we know the mime type, in > order to try to keep us from fetching lots of stuff we can't process. > The mime type is not known until we've fetched it. Yes, the fetcher can't rely on the document mime-type. The only thing we can use for filtering is the

Re: [Nutch-dev] incremental crawling

2005-12-01 Thread Jérôme Charron
Sounds really good (and it is requested by a lot of nutch users!). +1 Jérôme On 12/1/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Matt Kangas wrote: > > #2 should be a pluggable/hookable parameter. "high-scoring" sounds like > > a reasonable default basis for choosing recrawl intervals, but

Re: Urlfilter Patch

2005-12-01 Thread Jérôme Charron
Suggestion: For consistency purpose, and easy of nutch management, why not filtering the extensions based on the activated plugins? By looking at the mime-types defined in the parse-plugins.xml file and the activated plugins, we know which content-types will be parsed. So, by getting the file exten

Re: [Nutch-dev] RE: [proposal] Generic Markup Language Parser

2005-11-25 Thread Jérôme Charron
> Do we talk about parsing rdf or do we discuss to store parsed html > text in rdf and convert it via xslt to pure text? > I may misunderstand something. I very like the idea of a general rdf > parser. Back in the days i played around with jena.sf.net > Parsing yes, replace nutch sequence file and

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Jérôme Charron
> Until last years there is one thing I notice that matters in a search > engine - minimalism. If you are honnest Stefan, take a closer look at the end of the proposal (here is a copy): Issues Create performance benchmarks and ensure that the new implementation gives at least the same performance

Re: [proposal] Generic Markup Language Parser

2005-11-24 Thread Jérôme Charron
Hi Stefan, And thanks for taking time to read the doc and giving us your feedback. -1! > Xsl is terrible slow! > Xml will blow up memory and storage usage. But there still something I don't understand... Regarding a previous discussion we had about the use of OpenSearch API to replace Servlet =>

[proposal] Generic Markup Language Parser

2005-11-23 Thread Jérôme Charron
Hi, We (Chris Mattmann, François Martelet, Sébastien Le Callonnec and me) just add a new proposal on the nutch Wiki: http://wiki.apache.org/nutch/MarkupLanguageParserProposal Here is the Summary of Issue: "Currently, Nutch provides some specific markup language parsing plugins: one for handling H

Re: Lucene or Nutch

2005-11-10 Thread Jérôme Charron
> I would be disappointed by this move - language identifier is an > important component in Nutch. Now the mere fact that it's bundled with > Nutch encourages its proper maintenance. If there is enough drive in > terms of willingness and long-term commitment it would make sense to > move it to a se

Re: Lucene or Nutch

2005-11-09 Thread Jérôme Charron
> Yes, Lucene is the best fit for what you're after. Nutch is built on > Lucene, and adds web crawling on top. You don't need a web crawler, > so using Lucene directly is the best fit - of course you'll have to > write code to integrate Lucene. Erik, I was thinking about it for a while, but don't

Re: standard version of log4j

2005-11-07 Thread Jérôme Charron
> hmmm.. so that means if we want to customize logging > it would be for every plugin potentially? > > Perhaps a common logger would atleast make some degree > of sense. I really think it make sense. When I fixed the issue about plugin dependencies, I began to create a log4j plugin in order to rem

Re: rel=nofollow

2005-10-20 Thread Jérôme Charron
> The attached patch adds support for rel=nofollow. Links which specify > this are ignored. Any objections to committing this? +1 I was recently thinking about adding support for rel="tag" attribute (à la technorati), and more genraly add an extension-point for micro-formats support (see http://d

Re: [jira] Created: (NUTCH-103) Vivisimo like treeview and url redirect

2005-10-06 Thread Jérôme Charron
> There is one potential problem that I see -- Nutch plugins require > explicit JAR references. If you want to switch between algorithms you'll > need to either put all Carrot2 JARs in the descriptor, put them in > CLASSPATH before Nutch starts or do some other trickery with class > loading. Only

Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
> I read about the MultiLingualSupport, but I didn't see it in the > repository, I think is cool. The analyzer extension point is defined by the Analyzer abstract class: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/analysis/NutchAnalyzer.java The default analyzer

Re: plugin analyzer

2005-10-04 Thread Jérôme Charron
> I think would be neat to have the NutchAnalyzer also as a plugin, from my > understanding right now if I want to analyze in a different way, I need to > hack the nutch source code, if we are going to have different plugins for > different analyzers that will help. Some specific application may us

Re: Nutch 0.7.1 and Nutch web site

2005-10-04 Thread Jérôme Charron
> The practice I've followed is to have the website reflect the latest > released version. Documentation for older releases can be found by > downloading those releases. Unreleased versions tend not to have good > documentation. > I think it's a good and widely used practice. If the trunk diverges

Re: java.net.MalformedURLException: no protocol for parse-plugins.xml

2005-10-03 Thread Jérôme Charron
> Likely missing file:/. If I get rid of lines 617-622 > > of conf/nutch-default.xml > > Resolved and committed: http://svn.apache.org/viewcvs.cgi?rev=293370&view=rev Thanks Earl. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: java.net.MalformedURLException: no protocol for parse-plugins.xml

2005-10-03 Thread Jérôme Charron
> Likely missing file:/. If I get rid of lines 617-622 > of conf/nutch-default.xml Oups, sorry. I made this last change just after testing the whole patch. And I doesn't test it once again since I was sure it was a minor change. I correct this right now. Sorry. Regards Jérôme -- http://motrech

Re: Classpath for HTML Parser Plugin

2005-09-27 Thread Jérôme Charron
> I noticed that HTML-Parser Plugin has references to xercesImpl.jar which > is plased in > src/plugin/parse-rss/lib/xercesImpl.jar Where do you find some references to xercesImpl .jar in HTML-Parser plugin? (If so, I don"t understand how it can compile since the build scripts never import any li

Re: saving log file

2005-09-20 Thread Jérôme Charron
> Following the tutorial, I redirect the log messages to a log file. But, > when crawling 1 million pages, this log file can become hugh and writing > log messages to a huge file can slow down the fetching process. Is > there a better way to manage the log? maybe saving it to a series of > smaller

Re: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-16 Thread Jérôme Charron
> > So ... feel free to provide a such plugin. > > If I remember well, Andrzej has already a piece of code to do that. no? > Yes, it comes from another package so I need to wrap it around in the > plugin interfaces, give me a day or two... Thanks Jérôme -- http://motrech.free.fr/ http://www.fru

Re: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-16 Thread Jérôme Charron
> What about a "default-plugin" as Andrzej proposed. The default plugin mechanism is integrated in the parse-plugins descriptor using the "*" content-type > It should behave like > the unix-command "strings". Does this make sense? Are you on it too? But we don't planned to develop it Otherwise

(NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-15 Thread Jérôme Charron
Hi, Chris, Sébastien and me have worked on a proposal for solving the NUTCH-88 issue. This proposal is available on the Nutch Wiki at http://wiki.apache.org/nutch/ParserFactoryImprovementProposal. Thanks for reading it, commenting it, and voting for it (+ or -). Best regards Jérôme -- http:/

Re: [Nutch-cvs] svn commit: r280179 - in /lucene/nutch/trunk/src/plugin: clustering-carrot2/ creativecommons/ index-basic/ index-more/ languageidentifier/ ontology/ parse-ext/ parse-html/ parse-js/ pa

2005-09-13 Thread Jérôme Charron
> Looks like something broke after this commit. When I run a "nutch crawl" > using the out-of-the-box configuration I get the following (with logging > turned to ALL): OK, I see the problem: I committed the nutch-site.xml file with the property plugin.autoactivation setted to false, whereas it mu

Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-09 Thread Jérôme Charron
> Jerome: Give me a shout if you need a hand on this. I'll be happy to > help and as it happens, I'll be available in the next few weeks. Sébastien, Great! As I mentioned in my last comment on JIRA, please synchronize with Chris on this point. I'm currently coding on other subjects and don't have

Re: [jira] Created: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-09-08 Thread Jérôme Charron
> I may have some time to work on this over the next few days if no one else > does. So, if you're taking the lead on this, I volunteer my help if you'd > like it. Hi Chris, Thanks for your help. It seems that nobody starts working on this for now (I planned to do it in the next weeks). First of

Re: RSS Parser Bug!?

2005-09-08 Thread Jérôme Charron
> I'm not necessarily sure that this is a "bug" per se: it's just the fact > that several different content types are potentially possible when any ol' > webserver returns an RSS file. To be honest, I performed a pretty detailed > crawl (100s of thousands of pages) when I originally wrote the pl

Re: RSS Parser Bug!?

2005-09-08 Thread Jérôme Charron
> But other than that, your analysis is correct, probably there should be > an "application/xml" added to the list of handled content types. But > this is further complicated by the fact, that Nutch doesn't do the right > thing now if you have more than one plugin handling the same mime type... I

Re: MS related plugins refactoring

2005-09-06 Thread Jérôme Charron
> > >> You may should discuss such things before you 'committed' a new > >> feature that already exists. > I normally ready most of the nutch mails. What was the date and subject? > I may overseen this one. I don't know, it's Stefan's sentence, not mine, so, please ask to Stefan. Regards Jérôme

Plugins dependencies enhancement proposal

2005-09-06 Thread Jérôme Charron
Since the plugins can specify some dependencies each over, it raises an administrator problem. For a Nutch administrator, it is not user-friendly to specify which plugins to activate/deactivate. With plugin inter-dependencies, the administrator need to know that a plugin depends on another one w

Re: Naming of lib-plugins, was: AW: MS related plugins refactoring

2005-09-06 Thread Jérôme Charron
> I think pre-fixing the plugin with "lib-" is a good idea to seperate such > plugins from "index-", "parse-" etc. > Because of that I would prefer a name like "lib-jakarta-poi". +1 > BTW, what do you think about adding more into to the lib-plugins like the > location of the license and the > d

Re: Nutch debugging log in Tomcat run time

2005-09-06 Thread Jérôme Charron
> The change doesn't reflect in the screen after I > re-compile the Nutch code and re-launch the tomcat. Do you re-deploy the web app? -- http://motrech.free.fr/ http://www.frutch.org/

Re: MS related plugins refactoring

2005-09-05 Thread Jérôme Charron
> > I have just committed some modifications that enable to have some > > dependencies between plugins. > This mechanism already works, since a plugin use jar urls from all > dependent plugins in its own class-loader. Ok. So, after a long private mail exchange with Stefan (thanks for your time an

Re: MS related plugins refactoring

2005-09-05 Thread Jérôme Charron
> You may should discuss such things before you 'committed' a new > feature that already exists. Stefan, I requested your help in a previous mail concerning this point. But you don't respond... Regards Jérôme

MS related plugins refactoring

2005-09-05 Thread Jérôme Charron
Hi, I have just committed some modifications that enable to have some dependencies between plugins. I would like to apply this mechanism to parse-ms* related plugins that both uses jakarta poi code. The idea is: instead of duplicating jakarta poi related jar in each lib directory of parse-ms* p

Re: svn commit: r265503 - in /lucene/nutch/trunk/src: java/org/apache/nutch/clustering/ java/org/apache/nutch/fs/ java/org/apache/nutch/mapReduce/ java/org/apache/nutch/parse/ java/org/apache/nutch/pr

2005-09-04 Thread Jérôme Charron
Hello Piotr, It looks like changes to language indentifer caused language identifier > test to fail on Windows again. > First, thanks for testing on windows. If no charset is given it assumes default > platform encoding but test files are probably "UTF-8" based. I have > changed TestLanguageIde

Re: regex-normalize.xml

2005-09-02 Thread Jérôme Charron
> > i think i expressed it wrong. The Question was if its a feature or a bug > that regex-normalize.xml is used only after this changes. the regex-normalize.xml is used only after you specify that you want to use the RegexUrlNormalizer implementation. So it is used only if you specify urlnormal

Re: regex-normalize.xml

2005-09-02 Thread Jérôme Charron
> > to get regex-normalize.xml to work i must put: > in nutch-site.xml > In nutch-default.xml there is set: > Is this a bug or a feature? =) nutch-site.xml overrides properties defined in nutch-default. So: * If you remove urlnormalizer.class property from nutch-default it must still uses the on

[info] Did You Mean: Lucene?

2005-09-02 Thread Jérôme Charron
An article in java.net about the Lucene Spell Checker: http://today.java.net/pub/a/today/2005/08/09/didyoumean.html?page=1 Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-09-01 Thread Jérôme Charron
> > There's already commons-logging, in nutch libs, so I think there's no > problem to add commons-lang. > Moreover it is under Apache License, so there's no prolem. > I will add it while committing your patch. > No objections for adding commons-lang to the nutch lib. As it is a generic lib, I

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-09-01 Thread Jérôme Charron
> > it works great (see the new function bellow). But we'll have to add > commons-lang (http://jakarta.apache.org/commons/lang/) to the libraries. > Are there any objections? How is the procedure to add it? There's already commons-logging, in nutch libs, so I think there's no problem to add comm

Re: [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-08-31 Thread Jérôme Charron
Michael, the solution is perhaps to use Jakarta Commons DateUtils.parseDate method: http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/time/DateUtils.html#parseDate(java.lang.String,%20java.lang.String[]) It will gives something like: Date parsedDate = DateUtils.parseDate(dates[i

Re: null lang bug? and patch?

2005-08-31 Thread Jérôme Charron
> I am a bit lost but just a quick check - shouldn't it also be committed > in Release-0.7 branch? No, the analyzer extension-point is commited only in trunk. It's a new feature, so I follow Committer's Rules ( http://wiki.apache.org/nutch/Committer's_Rules) ;-) Regards Jérôme -- http://motrech

Re: merge mapred to trunk

2005-08-31 Thread Jérôme Charron
On 8/31/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote: > > Doug Cutting wrote: > > Currently we have three versions of nutch: trunk, 0.7 and mapred. This > > increases the chances for conflicts. I would thus like to merge the > > mapred branch into trunk soon. The soonest I could actually start

Re: [Nutch Wiki] Update of "Committer's Rules" by AndrzejBialecki

2005-08-31 Thread Jérôme Charron
> > Glancing at other Apache projects in subversion, I see that httpd uses > branch names like "2.2.x" and tag names like "2.2.4". That's a little > cryptic. I propose that we use branch names like "branch-2.4" and tag > names like "release-2.4.1". What do folks think? +1 Jérôme -- http://motr

Re: [Nutch-cvs] svn commit: r240359 - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/indexer/ plugin/nutch-extensionpoints/

2005-08-31 Thread Jérôme Charron
> I see several instances of 'analySer' in comments/javadoc and some > > variables. That should probably be changed to american version - > > analyzer, for consistency's sake. > > Corrected/Committed (http://svn.apache.org/viewcvs.cgi?rev=265020&view=rev) Regards Jérôme -- http://motrech.free

Re: Language identifier plugin questions

2005-08-31 Thread Jérôme Charron
Tom, I have created the NUTCH-86 issue to report the needed changes in the LanguageIdentifier we discussed in this thread. The issue is available at http://issues.apache.org/jira/browse/NUTCH-86 Regards Jérôme

Re: Language identifier plugin questions

2005-08-31 Thread Jérôme Charron
> > I agree it is important to have the NGramProfile.getSimilarity() method. > However, I think it is also important that it is consistent with the > scoring > that LanguageIdentifier uses, even if LanguageIdentifier optimises the > implementation. Looking at the code I see that the two scoring m

Re: null lang bug? and patch?

2005-08-31 Thread Jérôme Charron
> I did a little digging and it appears that lang ends > up being null (couldn't quite track down where lang > should have been set). Not sure if it is a proper > fix, but changing doc.getField("lang").stringValue() > to doc.get("lang"), makes my little crawl complete. lang is null cause you don't

Re: Analysis plugins and lucene-analyzers

2005-08-30 Thread Jérôme Charron
> > I personal don't like the activation mechanism. I prefer to have the > 'activated' plugins in the plugin folder and to deactivate just > remove the plugins from the folder. > That is much easier to handle than to manage the plugins in the > folder AND setup them in the configuration file. +

Re: [Nutch-cvs] svn commit: r240359 - in /lucene/nutch/trunk/src: java/org/apache/nutch/analysis/ java/org/apache/nutch/indexer/ plugin/nutch-extensionpoints/

2005-08-28 Thread Jérôme Charron
> > I see several instances of 'analySer' in comments/javadoc and some > variables. That should probably be changed to american version - > analyzer, for consistency's sake. Yes, that's right. Thanks. Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Analysis plugins and lucene-analyzers

2005-08-28 Thread Jérôme Charron
> > > I do not object against putting lucene-analyzers-1.9-rc1-dev.jar in > > nutch core but I would like to give another option. I think it is > > possible to create a plugin which contains and exports this library > > and make other analysis plugin depend on it. Yes, that is possible and sure.

Analysis plugins and lucene-analyzers

2005-08-27 Thread Jérôme Charron
Hi, I would like to add some language specific analysis plugins. In this first approach, each plugin would be simply a wrapper of the lucene's analyzers. So each analysis- plugin need to import lucene-analyzers-1.9-rc1-dev.jar in its lib directory. In order to avoid adding this jar in many plug

Re: svn commit: r240254 - in /lucene/nutch/tags/Release-0.7/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang: HTMLLanguageParser.java LanguageIdentifier.java LanguageIndexingFilte

2005-08-26 Thread Jérôme Charron
> It looks like you have commited your changes to tags directory. You > should do it in branches. I think there is no way in SVN to force > immutability of tags :(. Oops, sorry. I commit my changes in the branches directory right now. Thks Piotr. Regards Jerome -- http://motrech.free.fr/ http:/

Re: Language identifier plugin questions

2005-08-25 Thread Jérôme Charron
Hi Tom, I've been using the language identifier plugin, which I think is very > nice. Tks ;-) > I have a few questions which I hope someone might be able to > answer: I will trying to... > 1. Why is the NGramProfile getSimilarity() method not called from > LanguageIdentifier? It was used in

Re: svn commit: r240097 - /lucene/nutch/branches/Release-0.7/

2005-08-25 Thread Jérôme Charron
> I can merge all changes done to the trunk into this branch (I will merge > my changes anyway) but I would like to know if others ( I think mainly > Jérôme) are sure they want all their changes mareged from trunk. > So should I go with merging evrything from trunk? Ok for me. Regards Jérôme --

Re: 0.7 branch

2005-08-23 Thread Jérôme Charron
> > I am going to create a Nutch 0.7 maintenance branch. I would like to > > make 0.7.1 release about 15th of September. > > I would like to add some simple bugfixes there (e.g. README.txt bad > > content, bad targets in plugin build.xml and maybe a couple of > > others). OK for 15.9 for me. What

Re: Failing JUnit test

2005-08-21 Thread Jérôme Charron
> I found it and commited the fix. It was not using UTF-8 encoding > sometimes. Thanks Piotr > But while looking at the code I feel a little bit worried about > LanguageIdentifier.identify(InputStream is) - as it reads bytes from > file in chunks and coverts each chunk to stink separatelly. If m

Re: Failing JUnit test

2005-08-20 Thread Jérôme Charron
> It works on my Linux box - with both JDK 1.4 and 1.5. ok. so it seems to be constent with my conf. > I will try to track it down. I assume it is an encoding problem of the Ngram profile files, but I have no time evening. Regards Jérôme

Re: Failing JUnit test

2005-08-19 Thread Jérôme Charron
> I am using JDK 1.5 on > Windows - I can test it on 1.4,1.5 on linux tomorrow - maybe this is the > problem. OK. Thanks Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: Failing JUnit test

2005-08-19 Thread Jérôme Charron
> As I suspect it is a result of latest updates to LanguageIdentifier > > plugin or its tests. > > Piotr, I have just commited a minor change in language identifier plugin unit test. Could you please update your local copy and test again? Jerome

Re: Failing JUnit test

2005-08-19 Thread Jérôme Charron
> expected: but was: > junit.framework.ComparisonFailure: expected: but was: As I suspect it is a result of latest updates to LanguageIdentifier > plugin or its tests. I am not deep into it I will not try to debug it > myslef at the moment - so just wanted you to know about the issue. You are rig

svn.apache.org down?

2005-08-19 Thread Jérôme Charron
svn.apache.org down, or the problem is on my side? Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Localized docs files

2005-08-18 Thread Jérôme Charron
Hi While commiting some changes in the /src/web/style/nutch-page.xml, I saw that the localized files generated in /docs directory were commited in SVN. Is there a reason for that? (It seems it is not needed since these files are generated from the xml files while building the war). If no, I wil

Re: Language Identifier in Nutch

2005-08-17 Thread Jérôme Charron
Hi Olena I'm currently starting my work with Nutch. My goal is to have a topic > specific (or at least language specific) crawler tool. Is it possible > to apply the LanguageIdentifier plugin on webpages that are not yet > fetched, so that e.g. only French or German pages are crawled? No. The rea

Re: VOTE: clustering plugin update for Rel 0.7

2005-08-15 Thread Jérôme Charron
-1 Maybe it would be a better idea to go for 0.7 branch and schedule a new > 0.7.1 release in short time? But +1 to include it in a 0.7.1 release !! Regards Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: [Nutch-dev] getDiscriptor

2005-07-21 Thread Jérôme Charron
> And also PluginRepository.dependencyIsAvailabel - rename, or > deprecate and correct? Erik, what a good code reviewer you are! You know what I think about deprecated methods (If the probability to be used outside of Nutch, then must be deprecated, if the impact is only on nutch internal code,

Re: svn commit: r220056 - /lucene/nutch/trunk/src/test/org/apache/nutch/plugin/TestPluginSystem.java

2005-07-21 Thread Jérôme Charron
> Do you really feel it is necessary to use deprecation in a pre-1.0 > version like this? I'd be happy to add back the old method signature > and deprecate it, but it seems unnecessary at this stage. Oh yes Erik. I'm wrong, the mispelled method is in the ExtensionPoint class. Since, I think it is

Re: svn commit: r220056 - /lucene/nutch/trunk/src/test/org/apache/nutch/plugin/TestPluginSystem.java

2005-07-21 Thread Jérôme Charron
> For grins I tried to see if I had commit access to fix the > misspelling myself. Lo and behold I do! I hope I didn't step on any > toes by committing this - if so let me know and I'll be more patient > and submit patches. I'm a newbie to Nutch and definitely don't want > to step in to committing

Re: LanguageIdentifier refactoring

2005-07-07 Thread Jérôme Charron
> Mhm. I'm not so sure. The NGramProfile load/save methods are safe, they > both use UTF-8. LanguageIdentifier.identify() seems to be safe, too - > because it only works with Strings, which are not encoded (native > Unicode). So, the only place where it would be problematic seems to be > in the com

Re: LanguageIdentifier refactoring

2005-07-05 Thread Jérôme Charron
> I have an issue with the language detection plugin, which I'm not sure > how to address. The plugin first tries to extract the language > identifier from meta tags. However, meta tag values people put there are > often completely wrong, or follow obscure pseudo-standards. > > Example: there is a

Re: LanguageIdentifier refactoring

2005-06-30 Thread Jérôme Charron
> > I monitor your work, and as soon as you say "go" I'm ready to apply the > patches - but I'd rather avoid doing this every couple of days. So, for > now, I'm waiting for a more or less stable situation... ;-) Ok Andrzej, the last patch seems to be stable. I perform some functional tests on ar

LanguageIdentifier refactoring

2005-06-29 Thread Jérôme Charron
Hi, In my last LanguageIndentifier patch, I splitted the code, so that the core of this plugin could now be viewed as a standalone lib. I think it could be a good idea to move this language identification lib from Nutch to Lucene (in order to be available in Lucene), and that the LanguageIdenti

<    1   2