date:20110725

Build failed in Jenkins: Nutch-trunk #1557

2011-07-25 Thread Apache Jenkins Server

See 

Changes:

[jnioche] NUTCH-1045 Mimeutil uses default Tika config unless overriden

--
[...truncated 924 lines...]
A 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaConfig.java
A src/plugin/parse-tika/plugin.xml
A src/plugin/parse-tika/build.xml
A src/plugin/lib-regex-filter
A src/plugin/lib-regex-filter/ivy.xml
A src/plugin/lib-regex-filter/src
A src/plugin/lib-regex-filter/src/test
A src/plugin/lib-regex-filter/src/test/org
A src/plugin/lib-regex-filter/src/test/org/apache
A src/plugin/lib-regex-filter/src/test/org/apache/nutch
A src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter
A src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api
AU
src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
A src/plugin/lib-regex-filter/src/java
A src/plugin/lib-regex-filter/src/java/org
A src/plugin/lib-regex-filter/src/java/org/apache
A src/plugin/lib-regex-filter/src/java/org/apache/nutch
A src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter
A src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api
AU
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
AU
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
AUsrc/plugin/lib-regex-filter/plugin.xml
AUsrc/plugin/lib-regex-filter/build.xml
A src/plugin/feed
A src/plugin/feed/sample
A src/plugin/feed/sample/rsstest.rss
A src/plugin/feed/ivy.xml
A src/plugin/feed/src
A src/plugin/feed/src/test
A src/plugin/feed/src/test/org
A src/plugin/feed/src/test/org/apache
A src/plugin/feed/src/test/org/apache/nutch
A src/plugin/feed/src/test/org/apache/nutch/parse
A src/plugin/feed/src/test/org/apache/nutch/parse/feed
A 
src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java
A src/plugin/feed/src/java
A src/plugin/feed/src/java/org
A src/plugin/feed/src/java/org/apache
A src/plugin/feed/src/java/org/apache/nutch
A src/plugin/feed/src/java/org/apache/nutch/parse
A src/plugin/feed/src/java/org/apache/nutch/parse/feed
A src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
A src/plugin/feed/src/java/org/apache/nutch/indexer
A src/plugin/feed/src/java/org/apache/nutch/indexer/feed
A 
src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
A src/plugin/feed/plugin.xml
A src/plugin/feed/build.xml
A src/plugin/subcollection
A src/plugin/subcollection/ivy.xml
A src/plugin/subcollection/src
A src/plugin/subcollection/src/test
A src/plugin/subcollection/src/test/org
A src/plugin/subcollection/src/test/org/apache
A src/plugin/subcollection/src/test/org/apache/nutch
A src/plugin/subcollection/src/test/org/apache/nutch/collection
A 
src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java
A src/plugin/subcollection/src/java
A src/plugin/subcollection/src/java/org
A src/plugin/subcollection/src/java/org/apache
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-25 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070930#comment-13070930
 ] 

Hudson commented on NUTCH-1045:
---

Integrated in Nutch-trunk #1557 (See 
[https://builds.apache.org/job/Nutch-trunk/1557/])
NUTCH-1045 Mimeutil uses default Tika config unless overriden

jnioche : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1150670
Files : 
* /nutch/trunk/conf/tika-mimetypes.xml
* /nutch/trunk/src/java/org/apache/nutch/util/MimeUtil.java
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/CHANGES.txt


> MimeUtil to rely on default config provided by Tika
> ---
>
> Key: NUTCH-1045
> URL: https://issues.apache.org/jira/browse/NUTCH-1045
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is 
> absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though 
> but if the user hasn't specified one or if it can't be loaded then we should 
> rely on Tika's default. This way we won't need to provide 
> conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one 
> whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1069) readlinkdb throws exception

2011-07-25 Thread Markus Jelsma (JIRA)

readlinkdb throws exception
---

 Key: NUTCH-1069
 URL: https://issues.apache.org/jira/browse/NUTCH-1069
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.4, 2.0


reading the linkdb doesn't work on Hadoop 0.20+. It believes data is to be read 
from the _SUCCESS file that is written by newer Hadoop version.

Quick fix is to remove the _SUCCESS file

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1034) Create Solr Velocity templates

2011-07-25 Thread Umar Shah (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070696#comment-13070696
 ] 

Umar Shah commented on NUTCH-1034:
--

use doc.vm.patch and facets.vm.patch to get search results in solr browse

the test steps would be
1. extarct solr 3.x to $solr_base
2. cd $solr_base/example
3. patch solr/conf/velocity/doc.vm
4. patch solr/conf/velocity/facets.vm
5. copy nutch schema in $NUTCH_HOME/conf/schema.xml to solr/conf/schema.xml
6. java -jar start.jar
7. run a test crawl against solr url http://localhost:8983/solr
8. search at  http://localhost:8983/solr/browse


> Create Solr Velocity templates
> --
>
> Key: NUTCH-1034
> URL: https://issues.apache.org/jira/browse/NUTCH-1034
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: doc.vm.patch, facets.vm.patch
>
>
> Solr has Velocity integration and provides an easy method for creating HTML 
> based front-ends for the search engine. This issue tracks the development of 
> Velocity templates specifically for Nutch users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates

2011-07-25 Thread Umar Shah (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umar Shah updated NUTCH-1034:
-

Attachment: facets.vm.patch

patch facets.vm in solr/conf/velocity to remove default solr example facets


> Create Solr Velocity templates
> --
>
> Key: NUTCH-1034
> URL: https://issues.apache.org/jira/browse/NUTCH-1034
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: doc.vm.patch, facets.vm.patch
>
>
> Solr has Velocity integration and provides an easy method for creating HTML 
> based front-ends for the search engine. This issue tracks the development of 
> Velocity templates specifically for Nutch users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates

2011-07-25 Thread Umar Shah (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umar Shah updated NUTCH-1034:
-

Attachment: doc.vm.patch

patch the doc.vm file in solr/conf/velocity
this will ensure that nutch results are displayed when you search using 
$SOLRHOST:$SOLRPORT/solr/browse

> Create Solr Velocity templates
> --
>
> Key: NUTCH-1034
> URL: https://issues.apache.org/jira/browse/NUTCH-1034
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: doc.vm.patch
>
>
> Solr has Velocity integration and provides an easy method for creating HTML 
> based front-ends for the search engine. This issue tracks the development of 
> Velocity templates specifically for Nutch users.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Automaton improvements

2011-07-25 Thread Kirby Bohling

https://issues.apache.org/jira/browse/NUTCH-1068

Issue created, patch attached.  Once I hear back from the author about
getting it included in the upstream library, I'll update the issue.  I'm
really not able to pursue directly, as I'm not much of a Nutch user at the
moment.  I've lurked on the list because there is some good info, and I
previously used Nutch as part of a R&D project at work.  I use Lucene and
the Automaton library quite a bit, and found out about the Automaton library
here.  It's been a great find for us, so hopefully this is a way I can
contribute back.  Either way, the ASF likely already has better code that
Nutch could just pick up.

I wish the Lucene guys would peel these utility parts out into a separate
library.  I have several places it'd be useful, where I really have no need
for all of the core Lucene (and also I use a 3.x version in my project, and
this code is only in the 4.x branch, until that's released, I've have to
maintain it myself.

Kirby


On Mon, Jul 25, 2011 at 3:35 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi Kirby,
>
> Thanks for sharing this. It is definitely relevant for Nutch and I am sure
> that there would be quite a few people interested in giving it a try.
> Let's hope that this patch gets into the original library or that the
> Lucene people ship it in a separate jar, in the meantime your patch would
> help comparing performances. Could you please open a new issue on JIRA and
> include the patch + description? It will be easier to comment and track its
> progress.
>
> Thanks a lot
>
> Julien
>
>
> On 25 July 2011 05:01, Kirby Bohling  wrote:
>
>> All,
>>
>>   Not sure how much you guys care, but the Lucene folks (specifically
>> rmuir and mikemcand), made some fairly significant performance speed
>> ups to the Automaton library while working on the Lucene Fuzzy
>> matching optimizations for the 4.0 release.  I've backported them to
>> the Automaton library and trying to get them integrated into the
>> mainline library (with permission from the Lucene devs).  I haven't
>> heard back from the Automaton author, but I figured that enough folks
>> have made noise about how nice performance boost of using Automaton
>> vs. RegEx, that Nutch itself might want to integrate these types of
>> changes, or re-use the ones from Lucene.
>>
>>   The best version of the code itself is here:
>>
>>
>> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>>
>> Nutch would likely only use 1/2-2/3 of those files (only the stuff
>> required to build RegExp).
>>
>> The patch I applied to the latest Automaton library is attached if
>> anybody wants to rebuild and test.  In some mainline code that does a
>> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
>> execution of the DFAs, I'm not sure how much faster it actually is (I
>> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
>> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
>> representation, and uses several Lucene internal implementations of
>> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
>> version isn't broken out into a utility jar to be re-used.  Lucene has
>> several really nice high performance non-trivial, but highly useful CS
>> data structure implementations.
>>
>> My patch itself applies to the latest Automaton library (1.11-7 as of
>> this writing).  If it is better to use the original Automaton library.
>>  One annoyance of the Automaton library is that you have to submit
>> personal info to get the source, but it is all BSD licensed.  No
>> public repo of source.
>>
>> It might be worth while to port the plugins using the automaton
>> library to use the version from Lucene or one with the patch applied
>> and test the performance.
>>
>> Thanks,
>>Kirby
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

[jira] [Updated] (NUTCH-1068) Automaton performance improvements based on Lucene code base

2011-07-25 Thread Kirby Bohling (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirby Bohling updated NUTCH-1068:
-

Attachment: automaton.diff

I am not the copyright holder, so I don't believe I can grant a license.  This 
is all based upon code used or written by the Lucene project.  Thus I believe 
it is eligible for inclusion in the ASF projects.

> Automaton performance improvements based on Lucene code base
> 
>
> Key: NUTCH-1068
> URL: https://issues.apache.org/jira/browse/NUTCH-1068
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Kirby Bohling
> Attachments: automaton.diff
>
>
> The Lucene team maintains a modified Automaton library cut down to precisely 
> what they need.  It can have significant performance enhancements.
> I am attempting to backport and shepherd a patch for the original Automaton 
> library.
> The original Lucene code is here:
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
> The Lucene code is likely slightly faster, as it includes several micro 
> optimizations I removed to avoid having to request re-license permission.  I 
> would definitely performance test using the Lucene RegEx vs. the patched 
> code.  The Lucene code also uses code points not characters, which might make 
> a difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene 
> code builds a UTF-32 clean DFA for accuracy, and then translates it to a 
> UTF-8 DFA for performance but I'm not 100% sure.  I don't need/use any of 
> that code, and currently really only worried about ASCII DFAs).
> When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.  
> It likely has a 1.5-2x speed up for regular expression execution from what I 
> can tell.  The Nutch backend uses this code in a couple of places, and it 
> likely would lead to performance benefits for those areas.
> I will attach my backported version for the Automaton 1.11-7 release.  While 
> I don't own any of the copyright, all of the code is copyrighted under the 
> BSD license, or the ASF 2.0 license.  It is pretty obviously approved for ASF 
> usage.  I am not checking that the patch is usable as I'm not the copyright 
> holder.  If that is an issue, I'll say "yes", I just don't believe I have any 
> legal standing to do so.  I don't want to create licensing issues for the ASF.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1068) Automaton performance improvements based on Lucene code base

2011-07-25 Thread Kirby Bohling (JIRA)

Automaton performance improvements based on Lucene code base

Key: NUTCH-1068
URL: https://issues.apache.org/jira/browse/NUTCH-1068
Project: Nutch
Issue Type: Improvement
Reporter: Kirby Bohling

The Lucene team maintains a modified Automaton library cut down to precisely
what they need. It can have significant performance enhancements.

I am attempting to backport and shepherd a patch for the original Automaton
library.

The original Lucene code is here:

http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/

The Lucene code is likely slightly faster, as it includes several micro
optimizations I removed to avoid having to request re-license permission. I
would definitely performance test using the Lucene RegEx vs. the patched code.
The Lucene code also uses code points not characters, which might make a
difference for UTF-16 vs. UTF-32 in obscure cases (I believe the Lucene code
builds a UTF-32 clean DFA for accuracy, and then translates it to a UTF-8 DFA
for performance but I'm not 100% sure. I don't need/use any of that code, and
currently really only worried about ASCII DFAs).

When making heavy use of the NFA-to-DFA transformation, I see a 4x speed up.
It likely has a 1.5-2x speed up for regular expression execution from what I
can tell. The Nutch backend uses this code in a couple of places, and it
likely would lead to performance benefits for those areas.

I will attach my backported version for the Automaton 1.11-7 release. While I
don't own any of the copyright, all of the code is copyrighted under the BSD
license, or the ASF 2.0 license. It is pretty obviously approved for ASF
usage. I am not checking that the patch is usable as I'm not the copyright
holder. If that is an issue, I'll say "yes", I just don't believe I have any
legal standing to do so. I don't want to create licensing issues for the ASF.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-07-25 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1044:
-

Attachment: NUTCH-1044-1.4.patch

Fixes the score of redirections by giving them the same score as the source of 
the redir

> Redirected URLs and possibly all of their outlinked URLs have invalid scores.
> -
>
> Key: NUTCH-1044
> URL: https://issues.apache.org/jira/browse/NUTCH-1044
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, parser
>Affects Versions: 1.3
>Reporter: Nutch User - 1
>Assignee: Julien Nioche
>Priority: Critical
> Fix For: 1.4
>
> Attachments: NUTCH-1044-1.4.patch
>
>
> 1.: 
> http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
> 2.: 
> http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
> Please note that also URLs redirected by meta refresh redirection do have 
> invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of 
> ParseOutputFormat.java 
> (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup).
>  The new CrawlDatum's score isn't set anywhere after the creation so it's 
> 1.0f as can be seen on the line 122 of CrawlDatum.java 
> (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
> It's another question whether the redirected URL's score should be just 
> passed to the new URL or should the redirection be considered as a link in 
> which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' 
> + 1).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

2011-07-25 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1045.
--

Resolution: Fixed
  Assignee: Julien Nioche

1.4 : Committed revision 1150669
trunk : Committed revision 1150670

Thanks Markus for testing


> MimeUtil to rely on default config provided by Tika
> ---
>
> Key: NUTCH-1045
> URL: https://issues.apache.org/jira/browse/NUTCH-1045
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.4, 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1045-1.4-v2.patch, NUTCH-1045-1.4.patch
>
>
> We currently provide conf/tika-mimetypes.xml despite the fact that it is 
> absolutely similar to the one found in tika-core.jar
> Having a mechanism for specifying a custom tika-mimetypes.xml is good though 
> but if the user hasn't specified one or if it can't be loaded then we should 
> rely on Tika's default. This way we won't need to provide 
> conf/tika-mimetypes.xml anymore and keep it in sync with the default Tika one 
> whenever we upgrade Tika.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-717) Make Nutch Solr integration easier

2011-07-25 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070455#comment-13070455
 ] 

Markus Jelsma commented on NUTCH-717:
-

Makes sense indeed! Same would be true for ES, bundle it in the 
plugi-to-be-made.

> Make Nutch Solr integration easier
> --
>
> Key: NUTCH-717
> URL: https://issues.apache.org/jira/browse/NUTCH-717
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Sami Siren
>Priority: Critical
> Fix For: 1.4, 2.0
>
>
> Erik Hatcher proposed we should provide a full solr config dir to be used 
> with Nutch-Solr. Now we only provide index schema. It would be considerably 
> easier to setup nutch-solr if we provided the whole conf dir that you could 
> use with solr like:
> java -Dsolr.solr.home= -jar start.jar

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1065) New mvn.template

2011-07-25 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070453#comment-13070453
 ] 

Julien Nioche commented on NUTCH-1065:
--

+1 thanks

> New mvn.template
> 
>
> Key: NUTCH-1065
> URL: https://issues.apache.org/jira/browse/NUTCH-1065
> Project: Nutch
>  Issue Type: Task
>  Components: build
>Affects Versions: 1.4, 2.0
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1065-mvn-template-new.patch, 
> NUTCH-1065-trunk-mvn-template-new.patch
>
>
> Removal of Otis from mvn.template file and addition of myself. This does not 
> alter functionality of any mvn or ivy tasks or files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-717) Make Nutch Solr integration easier

2011-07-25 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070452#comment-13070452
 ] 

Julien Nioche commented on NUTCH-717:
-

Maybe we could make the indexing backends pluggable first and move the 
SOLR-related stuff to a new plugin? The plugin would have a custom task (e.g. 
startSOLR) as you described but this would not affect the common build.xml + 
the various config files would be kept separated from the content of the main 
conf dir. Makes sense? 

> Make Nutch Solr integration easier
> --
>
> Key: NUTCH-717
> URL: https://issues.apache.org/jira/browse/NUTCH-717
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Sami Siren
>Priority: Critical
> Fix For: 1.4, 2.0
>
>
> Erik Hatcher proposed we should provide a full solr config dir to be used 
> with Nutch-Solr. Now we only provide index schema. It would be considerably 
> easier to setup nutch-solr if we provided the whole conf dir that you could 
> use with solr like:
> java -Dsolr.solr.home= -jar start.jar

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-717) Make Nutch Solr integration easier

2011-07-25 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070450#comment-13070450
 ] 

Markus Jelsma commented on NUTCH-717:
-

We can add a Solr instance with Jetty and deploy it in the runtime directory. 
If a user can simply go to runtime/solr directory and run with java -jar 
start.jar it greatly reduces the hassle for new users. We can then also move 
our schema.xml to the proper location.

> Make Nutch Solr integration easier
> --
>
> Key: NUTCH-717
> URL: https://issues.apache.org/jira/browse/NUTCH-717
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Sami Siren
>Priority: Critical
> Fix For: 1.4, 2.0
>
>
> Erik Hatcher proposed we should provide a full solr config dir to be used 
> with Nutch-Solr. Now we only provide index schema. It would be considerably 
> easier to setup nutch-solr if we provided the whole conf dir that you could 
> use with solr like:
> java -Dsolr.solr.home= -jar start.jar

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Automaton improvements

2011-07-25 Thread Dawid Weiss

It is actually Robert Muir and Mike McCandless doing the heavy lifting here,
so modesty has nothing to do with it :) I just think it'll stay inside
Lucene because it is often tweaked and tuned. Plus, there is the FSTBuilder
and associated classes which provide yet another way to build and traverse
automata in Lucene (this is not brics-dependent).

Dawid

On Mon, Jul 25, 2011 at 10:59 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi Dawid,
>
> This was a bit of wishful thinking indeed :-) With a bit of luck the
> improvements will be added to brics, but as you pointed out we can always
> use the lucene jar anyway.
>
> BTW you are too modest, you should have pointed to the video of your talk
> in Berlin http://vimeo.com/26517310 which is both informative and
> entertaining
>
> Thanks
>
> Julien
>
>
> On 25 July 2011 09:51, Dawid Weiss  wrote:
>
>>
>> I don't think this will make it into a separate library, Julien. It's a
>> port of brics and done specifically so that it fits Lucene's internal needs.
>> If anything, I would just make Nutch require Lucene as a dependency -- this
>> would provide more stable updates.
>>
>> Dawid
>>
>>
>> On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche <
>> lists.digitalpeb...@gmail.com> wrote:
>>
>>> Hi Kirby,
>>>
>>> Thanks for sharing this. It is definitely relevant for Nutch and I am
>>> sure that there would be quite a few people interested in giving it a try.
>>> Let's hope that this patch gets into the original library or that the
>>> Lucene people ship it in a separate jar, in the meantime your patch would
>>> help comparing performances. Could you please open a new issue on JIRA and
>>> include the patch + description? It will be easier to comment and track its
>>> progress.
>>>
>>> Thanks a lot
>>>
>>> Julien
>>>
>>>
>>> On 25 July 2011 05:01, Kirby Bohling  wrote:
>>>
 All,

   Not sure how much you guys care, but the Lucene folks (specifically
 rmuir and mikemcand), made some fairly significant performance speed
 ups to the Automaton library while working on the Lucene Fuzzy
 matching optimizations for the 4.0 release.  I've backported them to
 the Automaton library and trying to get them integrated into the
 mainline library (with permission from the Lucene devs).  I haven't
 heard back from the Automaton author, but I figured that enough folks
 have made noise about how nice performance boost of using Automaton
 vs. RegEx, that Nutch itself might want to integrate these types of
 changes, or re-use the ones from Lucene.

   The best version of the code itself is here:

 http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/

 Nutch would likely only use 1/2-2/3 of those files (only the stuff
 required to build RegExp).

 The patch I applied to the latest Automaton library is attached if
 anybody wants to rebuild and test.  In some mainline code that does a
 _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
 execution of the DFAs, I'm not sure how much faster it actually is (I
 think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
 the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
 representation, and uses several Lucene internal implementations of
 memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
 version isn't broken out into a utility jar to be re-used.  Lucene has
 several really nice high performance non-trivial, but highly useful CS
 data structure implementations.

 My patch itself applies to the latest Automaton library (1.11-7 as of
 this writing).  If it is better to use the original Automaton library.
  One annoyance of the Automaton library is that you have to submit
 personal info to get the source, but it is all BSD licensed.  No
 public repo of source.

 It might be worth while to port the plugins using the automaton
 library to use the version from Lucene or one with the patch applied
 and test the performance.

 Thanks,
Kirby

>>>
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>>
>>
>>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Automaton improvements

2011-07-25 Thread Julien Nioche

Hi Dawid,

This was a bit of wishful thinking indeed :-) With a bit of luck the
improvements will be added to brics, but as you pointed out we can always
use the lucene jar anyway.

BTW you are too modest, you should have pointed to the video of your talk in
Berlin http://vimeo.com/26517310 which is both informative and entertaining

Thanks

Julien

On 25 July 2011 09:51, Dawid Weiss  wrote:

>
> I don't think this will make it into a separate library, Julien. It's a
> port of brics and done specifically so that it fits Lucene's internal needs.
> If anything, I would just make Nutch require Lucene as a dependency -- this
> would provide more stable updates.
>
> Dawid
>
>
> On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche <
> lists.digitalpeb...@gmail.com> wrote:
>
>> Hi Kirby,
>>
>> Thanks for sharing this. It is definitely relevant for Nutch and I am sure
>> that there would be quite a few people interested in giving it a try.
>> Let's hope that this patch gets into the original library or that the
>> Lucene people ship it in a separate jar, in the meantime your patch would
>> help comparing performances. Could you please open a new issue on JIRA and
>> include the patch + description? It will be easier to comment and track its
>> progress.
>>
>> Thanks a lot
>>
>> Julien
>>
>>
>> On 25 July 2011 05:01, Kirby Bohling  wrote:
>>
>>> All,
>>>
>>>   Not sure how much you guys care, but the Lucene folks (specifically
>>> rmuir and mikemcand), made some fairly significant performance speed
>>> ups to the Automaton library while working on the Lucene Fuzzy
>>> matching optimizations for the 4.0 release.  I've backported them to
>>> the Automaton library and trying to get them integrated into the
>>> mainline library (with permission from the Lucene devs).  I haven't
>>> heard back from the Automaton author, but I figured that enough folks
>>> have made noise about how nice performance boost of using Automaton
>>> vs. RegEx, that Nutch itself might want to integrate these types of
>>> changes, or re-use the ones from Lucene.
>>>
>>>   The best version of the code itself is here:
>>>
>>>
>>> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>>>
>>> Nutch would likely only use 1/2-2/3 of those files (only the stuff
>>> required to build RegExp).
>>>
>>> The patch I applied to the latest Automaton library is attached if
>>> anybody wants to rebuild and test.  In some mainline code that does a
>>> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
>>> execution of the DFAs, I'm not sure how much faster it actually is (I
>>> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
>>> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
>>> representation, and uses several Lucene internal implementations of
>>> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
>>> version isn't broken out into a utility jar to be re-used.  Lucene has
>>> several really nice high performance non-trivial, but highly useful CS
>>> data structure implementations.
>>>
>>> My patch itself applies to the latest Automaton library (1.11-7 as of
>>> this writing).  If it is better to use the original Automaton library.
>>>  One annoyance of the Automaton library is that you have to submit
>>> personal info to get the source, but it is all BSD licensed.  No
>>> public repo of source.
>>>
>>> It might be worth while to port the plugins using the automaton
>>> library to use the version from Lucene or one with the patch applied
>>> and test the performance.
>>>
>>> Thanks,
>>>Kirby
>>>
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Automaton improvements

2011-07-25 Thread Dawid Weiss

I don't think this will make it into a separate library, Julien. It's a port
of brics and done specifically so that it fits Lucene's internal needs. If
anything, I would just make Nutch require Lucene as a dependency -- this
would provide more stable updates.

Dawid

On Mon, Jul 25, 2011 at 10:35 AM, Julien Nioche <
lists.digitalpeb...@gmail.com> wrote:

> Hi Kirby,
>
> Thanks for sharing this. It is definitely relevant for Nutch and I am sure
> that there would be quite a few people interested in giving it a try.
> Let's hope that this patch gets into the original library or that the
> Lucene people ship it in a separate jar, in the meantime your patch would
> help comparing performances. Could you please open a new issue on JIRA and
> include the patch + description? It will be easier to comment and track its
> progress.
>
> Thanks a lot
>
> Julien
>
>
> On 25 July 2011 05:01, Kirby Bohling  wrote:
>
>> All,
>>
>>   Not sure how much you guys care, but the Lucene folks (specifically
>> rmuir and mikemcand), made some fairly significant performance speed
>> ups to the Automaton library while working on the Lucene Fuzzy
>> matching optimizations for the 4.0 release.  I've backported them to
>> the Automaton library and trying to get them integrated into the
>> mainline library (with permission from the Lucene devs).  I haven't
>> heard back from the Automaton author, but I figured that enough folks
>> have made noise about how nice performance boost of using Automaton
>> vs. RegEx, that Nutch itself might want to integrate these types of
>> changes, or re-use the ones from Lucene.
>>
>>   The best version of the code itself is here:
>>
>>
>> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>>
>> Nutch would likely only use 1/2-2/3 of those files (only the stuff
>> required to build RegExp).
>>
>> The patch I applied to the latest Automaton library is attached if
>> anybody wants to rebuild and test.  In some mainline code that does a
>> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
>> execution of the DFAs, I'm not sure how much faster it actually is (I
>> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
>> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
>> representation, and uses several Lucene internal implementations of
>> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
>> version isn't broken out into a utility jar to be re-used.  Lucene has
>> several really nice high performance non-trivial, but highly useful CS
>> data structure implementations.
>>
>> My patch itself applies to the latest Automaton library (1.11-7 as of
>> this writing).  If it is better to use the original Automaton library.
>>  One annoyance of the Automaton library is that you have to submit
>> personal info to get the source, but it is all BSD licensed.  No
>> public repo of source.
>>
>> It might be worth while to port the plugins using the automaton
>> library to use the version from Lucene or one with the patch applied
>> and test the performance.
>>
>> Thanks,
>>Kirby
>>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: .BAT file for running nutch in Windows (no cygwin)

2011-07-25 Thread Julien Nioche

Hi Radim

yes please open a JIRA with a description of what you've done + attach the
script

Thanks

Julien

2011/7/23 Radim Kolar 

> I ported shell start-up script to standard windows .BAT file (tested in
> Windows XP).
>
> Where can i upload it? I need help with testing nutch under native windows.
> Should i open bug report and attach .BAT file to it?
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Automaton improvements

2011-07-25 Thread Julien Nioche

Hi Kirby,

Thanks for sharing this. It is definitely relevant for Nutch and I am sure
that there would be quite a few people interested in giving it a try.
Let's hope that this patch gets into the original library or that the Lucene
people ship it in a separate jar, in the meantime your patch would help
comparing performances. Could you please open a new issue on JIRA and
include the patch + description? It will be easier to comment and track its
progress.

Thanks a lot

Julien

On 25 July 2011 05:01, Kirby Bohling  wrote:

> All,
>
>   Not sure how much you guys care, but the Lucene folks (specifically
> rmuir and mikemcand), made some fairly significant performance speed
> ups to the Automaton library while working on the Lucene Fuzzy
> matching optimizations for the 4.0 release.  I've backported them to
> the Automaton library and trying to get them integrated into the
> mainline library (with permission from the Lucene devs).  I haven't
> heard back from the Automaton author, but I figured that enough folks
> have made noise about how nice performance boost of using Automaton
> vs. RegEx, that Nutch itself might want to integrate these types of
> changes, or re-use the ones from Lucene.
>
>   The best version of the code itself is here:
>
>
> http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/
>
> Nutch would likely only use 1/2-2/3 of those files (only the stuff
> required to build RegExp).
>
> The patch I applied to the latest Automaton library is attached if
> anybody wants to rebuild and test.  In some mainline code that does a
> _lot_ of NFA-to-DFA translation, it is a 4x speed up.  For the actual
> execution of the DFAs, I'm not sure how much faster it actually is (I
> think 1.5-2.0 as fast).  My patch doesn't include the UTF-32 fixes in
> the Lucene version (The Lucene code also converts the UTF-32 to UTF-8
> representation, and uses several Lucene internal implementations of
> memory growth, sorting, etc, etc).  It is unfortunate that the Lucene
> version isn't broken out into a utility jar to be re-used.  Lucene has
> several really nice high performance non-trivial, but highly useful CS
> data structure implementations.
>
> My patch itself applies to the latest Automaton library (1.11-7 as of
> this writing).  If it is better to use the original Automaton library.
>  One annoyance of the Automaton library is that you have to submit
> personal info to get the source, but it is all BSD licensed.  No
> public repo of source.
>
> It might be worth while to port the plugins using the automaton
> library to use the version from Lucene or one with the patch applied
> and test the performance.
>
> Thanks,
>Kirby
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Build failed in Jenkins: Nutch-trunk #1557

[jira] [Commented] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

[jira] [Created] (NUTCH-1069) readlinkdb throws exception

[jira] [Commented] (NUTCH-1034) Create Solr Velocity templates

[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates

[jira] [Updated] (NUTCH-1034) Create Solr Velocity templates

Re: Automaton improvements

[jira] [Updated] (NUTCH-1068) Automaton performance improvements based on Lucene code base

[jira] [Created] (NUTCH-1068) Automaton performance improvements based on Lucene code base

[jira] [Updated] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

[jira] [Resolved] (NUTCH-1045) MimeUtil to rely on default config provided by Tika

[jira] [Commented] (NUTCH-717) Make Nutch Solr integration easier

[jira] [Commented] (NUTCH-1065) New mvn.template

[jira] [Commented] (NUTCH-717) Make Nutch Solr integration easier

[jira] [Commented] (NUTCH-717) Make Nutch Solr integration easier

Re: Automaton improvements

Re: Automaton improvements

Re: Automaton improvements

Re: .BAT file for running nutch in Windows (no cygwin)

Re: Automaton improvements

20 matches

Site Navigation

Mail list logo

Footer information