date:20111003

Re: Choosing an efficient family configuration for GORA HBase

2011-10-03 Thread Ferdy Galema

Ok thanks. I was just wondering whether there were any developments on
this. I'm not sure yet what would be the fastest in the case of Nutch,
all I know from our own experience is that it is best practice to group
frequently-accessed columns together, but nevertheless store large
columns in a separate family. Joining multiple families in a Scan should
be no problem performance wise. (People reporting problems with too many
families will probably have a problem with their HBase deployment in
general).

In short, if the Parser needs Content but most other jobs don't (that do
need other columns from the Fetch family for example Generator or
DbUpdater), it might be beneficial to optimize the family configuration
to reflect this. This could make Parser jobs slightly slower, but
increase throughput of the other jobs so that perhaps total throughput
will be better. For now we will use the default configuration, but we
will report back on this when we have tried some alternatives.

On 10/01/2011 10:23 PM, Alexis wrote:

Dear Ferdy,

This mapping is user defined. It specifies where Avro fields required
by Nutch jobs are stored in HBase.

You can tweak the schema according to this kind of considerations by
editing the config file.

So content is populated by the Fetcher job (writes) that downloads the
web page. It is parsed by the Parser job (reads) that extracts the
links and the metadata.

For example, these are the fields that might need to be grouped in the
same column family (but they are not) because they are all required
for the parse step:
From
http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserJob.java?view=markup

58static {
59 FIELDS.add(WebPage.Field.STATUS);
60 FIELDS.add(WebPage.Field.CONTENT);
61 FIELDS.add(WebPage.Field.CONTENT_TYPE);
62 FIELDS.add(WebPage.Field.SIGNATURE);
63 FIELDS.add(WebPage.Field.MARKERS);
64 FIELDS.add(WebPage.Field.PARSE_STATUS);
65 FIELDS.add(WebPage.Field.OUTLINKS);
66 FIELDS.add(WebPage.Field.METADATA);
67}

It looks tricky. I've heard that on the contrary people usually don't
use more that 3 column famillies to avoid slowing down the scans as
you mentioned. Not sure though. If you manage to optimize the config
with big improvements in the processing times don't hesitate to edit
the wiki page...

On Fri, Sep 30, 2011 at 5:57 AM, Ferdy Galemaferdy.gal...@kalooga.com wrote:

Hi,

About the example GORA HBase mapping at:
http://wiki.apache.org/nutch/GORA_HBase

Are there any current developments on improving the configuration for the
column mappings? For example, at first glance it seems that it would be more
efficient to put the fairly big column 'content' in a completely separate
family. This way, doing scans over the smaller columns that do not need the
'content' column run much faster because the scan will completely skip
'content' on the regionserver level. (All columns in each family are stored
in the same file per region.)

Any thoughts on this?

Ferdy.

[jira] [Resolved] (NUTCH-1137) LinkDb / invertlinks: command line arguments ignored

2011-10-03 Thread Markus Jelsma (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1137.
--

Resolution: Fixed

Committed for 1.4 in rev. 1178376.

Reused crawldb code instead.
Thanks for opening this issue Sebastian.

 LinkDb / invertlinks: command line arguments ignored
 

 Key: NUTCH-1137
 URL: https://issues.apache.org/jira/browse/NUTCH-1137
 Project: Nutch
  Issue Type: Bug
  Components: linkdb
Affects Versions: 1.3
Reporter: Sebastian Nagel
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1137-1.5.patch


 If the tool invertlinks is called with option -dir segmentsDir all remaining
 arguments are ignored:
 {noformat}
 % $NUTCH_HOME/bin/nutch invertlinks linkdb -dir segments -noNormalize 
 -noFilter
 LinkDb: starting at 2011-09-28 23:24:07
 LinkDb: linkdb: linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 {noformat}
 (URLs are normalized and filtered despite -noNormalize/-noFilter)
 The patch also restricts the ordering of arguments according to the help text:
 Usage: LinkDb linkdb (-dir segmentsDir | seg1 seg2 ...) [-force] 
 [-noNormalize] [-noFilter]
 (segments must be given before the optional flags)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Providing a list of FAQ's with every new subscribe request

2011-10-03 Thread lewis john mcgibbney

Hi Sami,

At the moment I am not in a position to take on the role of mailing list
moderator. But I've found out that the list moderators should be able to
configure the nature of documentation on a per-list basis by emailing
${list}-help@ from their moderator address and following the instructions.

Would it be possible to send out a list of our official FAQ's when a new
user confirms their subscription to both user@ and dev@ lists.

What are your thoughts on this?
Thanks

Lewis

On Tue, Sep 27, 2011 at 9:24 PM, Sami Siren ssi...@gmail.com wrote:

 I think moderators can be changed by filing a jira issue (by one of the PMC
 members) to the infra project, for example see
 https://issues.apache.org/jira/browse/INFRA-3511

 Moderation is a simple task you just let good messages (usually|only coming
 from non subscribed senders) through and forget abut the rest.

 Julien: I am pretty sure I am still a moderator at dev  user - i just
 tried some of the moderator commands and they were successful.
 --
  Sami Siren


 On Tue, Sep 27, 2011 at 9:32 PM, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Hi Sami,

 Who is it that we are supposed to speak to regarding moderation. I tried
 to contact the infra@ team but still awaiting reply.

 What in included in moderation? I'm completely foreign to all of this, and
 as Julien stated I was not aware that there was anyone directly linked to
 Nutch list moderation. The info on the apache developers area is pretty
 vague and I haven't been able to get much further with this.

 Thanks


 On Tue, Sep 27, 2011 at 6:33 PM, Sami Siren ssi...@gmail.com wrote:

 I am getting moderation emails and I think that there's somebody else
 doing moderation too since the messages get sent to the list without me
 accepting them.

 I would like to step down from the moderator status and have someone else
 do moderation instead, because frankly I have not been doing a great job
 with it. Any volunteers?

 --
  Sami Siren


 On Tue, Sep 27, 2011 at 12:09 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 We don't have moderators for the user and dev lists


 On 26 September 2011 20:09, lewis john mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

 Thanks Markus,

 Who is mailing list moderator? If I can get this info before trying to
 contact infra it would be great.


 On Mon, Sep 26, 2011 at 7:37 PM, Markus Jelsma 
 markus.jel...@openindex.io wrote:

 SOunds like a good idea. I think you need to be ML moderator to make
 changes

 http://www.apache.org/dev/committers.html#mail-moderate

  Hi,
 
  I just signed up to the JUnit users lists and received a really well
  documented FAQ accompaniment when I subscribed. I think this would
 be a
  great resource for new Nutch users. Does anyone agree/disagree? How
 do we
  go about configuring this? Is this a request for the infra team?
 
  Thank you




 --
 *Lewis*




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com





 --
 *Lewis*





-- 
*Lewis*

[jira] [Updated] (NUTCH-1144) Filtering optional in WebGraph

2011-10-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1144:
-

Fix Version/s: (was: 1.5)

 Filtering optional in WebGraph
 --

 Key: NUTCH-1144
 URL: https://issues.apache.org/jira/browse/NUTCH-1144
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor

 There is no URL filtering mechanism in the web graph program. When a 
 CrawlDatum is removed from the CrawlDB by an URL filter is should be possible 
 to remove it from the web graph as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1144) Filtering optional in WebGraph

2011-10-03 Thread Markus Jelsma (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1144.
--

Resolution: Won't Fix

Decided to do filtering and normalizing in one issue.

 Filtering optional in WebGraph
 --

 Key: NUTCH-1144
 URL: https://issues.apache.org/jira/browse/NUTCH-1144
 Project: Nutch
  Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor

 There is no URL filtering mechanism in the web graph program. When a 
 CrawlDatum is removed from the CrawlDB by an URL filter is should be possible 
 to remove it from the web graph as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1142) Normalization and filtering in WebGraph

2011-10-03 Thread Markus Jelsma (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1142:
-

Description:
The WebGraph programs performs URL normalization. Since normalization of
outlinks is already performed during the parse it should become optional. There
is also no URL filtering mechanism in the web graph program. When a CrawlDatum
is removed from the CrawlDB by an URL filter is should be possible to remove it
from the web graph as well.

was:The WebGraph programs performs URL normalization. Since normalization of
outlinks is already performed during the parse it should become optional.

Patch Info: Patch Available
Summary: Normalization and filtering in WebGraph (was: Normalization
optional in WebGraph)

Normalization and filtering in WebGraph
---

Key: NUTCH-1142
URL: https://issues.apache.org/jira/browse/NUTCH-1142
Project: Nutch
Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.5

The WebGraph programs performs URL normalization. Since normalization of
outlinks is already performed during the parse it should become optional.
There is also no URL filtering mechanism in the web graph program. When a
CrawlDatum is removed from the CrawlDB by an URL filter is should be possible
to remove it from the web graph as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1143) Omit anchor in webgraph's LinkDatum

2011-10-03 Thread Markus Jelsma (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119281#comment-13119281
]

Markus Jelsma commented on NUTCH-1143:
--

It seems the anchor field was once used for indexing the best ranking anchor
for a given URL but the indexing code is legacy. With the current version users
must invert links and pass the linkdb and enable index-anchor to index anchors
so having an anchor in LinkDatum is obsolete for now.

Instead of completely removing the anchor code we should make it optional, by
doing that we can write indexing code later and pass the webgraph to the
indexer instead of a linkdb.

I opt for defaulting the setting to false (i.e. do not store anchors) since
they are unusable at the moment.

Omit anchor in webgraph's LinkDatum
---

Key: NUTCH-1143
URL: https://issues.apache.org/jira/browse/NUTCH-1143
Project: Nutch
Issue Type: Improvement
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
Fix For: 1.5

Anchors are stored unchecked in the webgraph. it looks like for cosmetic
reasons only. When dealing with hundreds of millions of records it takes up
significant space and I/O time.
This issue should add an option to omit the anchor.

[jira] [Resolved] (NUTCH-1058) Upgrade Solr schema to version 1.4

2011-10-03 Thread Markus Jelsma (Resolved) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1058.
--

Resolution: Fixed
  Assignee: Markus Jelsma

Committed for 1.4 in rev. 1178409 and for nutchgora in rev. 1178410.

 Upgrade Solr schema to version 1.4
 --

 Key: NUTCH-1058
 URL: https://issues.apache.org/jira/browse/NUTCH-1058
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.4, nutchgora


 The version of our Solr schema should be updated from 1.3 to the current 
 version. I propose to commit the change prior to 1.4 and 2.0 RC's, the Solr 
 schema version may have incremented more than once at the time of an RC.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier

2011-10-03 Thread Markus Jelsma (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-717:


Fix Version/s: (was: 1.4)
   1.5

 Make Nutch Solr integration easier
 --

 Key: NUTCH-717
 URL: https://issues.apache.org/jira/browse/NUTCH-717
 Project: Nutch
  Issue Type: New Feature
Reporter: Sami Siren
Priority: Critical
 Fix For: nutchgora, 1.5


 Erik Hatcher proposed we should provide a full solr config dir to be used 
 with Nutch-Solr. Now we only provide index schema. It would be considerably 
 easier to setup nutch-solr if we provided the whole conf dir that you could 
 use with solr like:
 java -Dsolr.solr.home=Nutch's Solr Home -jar start.jar

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1136) Ant pmd target is broken

2011-10-03 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119365#comment-13119365
 ] 

Lewis John McGibbney commented on NUTCH-1136:
-

Would like to commit before RC for 1.4 if possible. 

 Ant pmd target is broken
 

 Key: NUTCH-1136
 URL: https://issues.apache.org/jira/browse/NUTCH-1136
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.4, nutchgora

 Attachments: NUTCH-1136-nutchgora-20110930.patch, 
 NUTCH-1136-trunk-1.4-20110930.patch


 issuing an 'ant pmd' command results in a failure as follows
 {code}
 BUILD FAILED
 /home/lewis/ASF/trunk/build.xml:327: taskdef class 
 net.sourceforge.pmd.ant.PMDTask cannot be found
  using the classloader AntClassLoader[]
 {code}
 The resulting fix should address this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1109) Add Sonar targets to Ant build.xml

2011-10-03 Thread Lewis John McGibbney (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119367#comment-13119367
 ] 

Lewis John McGibbney commented on NUTCH-1109:
-

Would like to commit before RC for 1.4 if possible. 

 Add Sonar targets to Ant build.xml
 --

 Key: NUTCH-1109
 URL: https://issues.apache.org/jira/browse/NUTCH-1109
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.4, nutchgora
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
  Labels: build
 Fix For: 1.4, nutchgora

 Attachments: NUTCH-1109-branch-1.4-20110910.patch, 
 NUTCH-1109-trunk-1.4-20110927.patch, sonar-ant-task-1.1.jar


 Sonar [1] is an open platform to manage code quality. I was experimenting 
 today with what kind of analysis it allows us to do on a given codebase and 
 was pleasantly surprised with the results. For details on the documentation 
 please see here [2]. It can be easily integrated into our ant build.xml and 
 is an easy way to explicitly identify latent areas of code which we could 
 possibly improve upon. 
 At this stage I wish to highlight some of my statistics in findings...
 Running Sonar via the attached patch identifies (based upon the analysis 
 rules from Sonar) that the Branch-1.4 codebase contains issues as follows
 {code}
 Critical 28   
 Major 1,231   
 Minor 356 
 Info  119
 {code}
 These range from a catch statement being identified in o.a.n.crawl.Generator 
 which shouldn't be catching throwable since it includes errors, through to 
 trivial issues such as nested statements which could be combined in the same 
 class.
 Although on the face of it, this seems an excellent way to make code more 
 consistent across the board, which may in turn lead to 'better' code, I am by 
 no way saying that this is a step we should move towards without thinking it 
 through and discussing at length. I also think that there needs to be a good 
 deal of our own judgement to decide whether any issues flagged up by Sonar 
 should be marked as false positives.
 To conclude I would like to add that I onl decided to open this issue in an 
 attempt to gauge peoples views on the direction it takes us in.
 [1] http://www.sonarsource.org/
 [2] http://docs.codehaus.org/display/SONAR/Documentation

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Providing a list of FAQ's with every new subscribe request

2011-10-03 Thread Sami Siren

On Mon, Oct 3, 2011 at 3:48 PM, lewis john mcgibbney 
lewis.mcgibb...@gmail.com wrote:


 Would it be possible to send out a list of our official FAQ's when a new
 user confirms their subscription to both user@ and dev@ lists.


It seems this is possible. Can you craft a piece of text you would like to
be sent out on successful subscribe and I'll try to set it up.

This is the full list of files ezmlm lists as editable, just in case if
someone comes up with something else to customize:

FileUse

bottom  bottom of all responses. General command info.
digest  'administrivia' section of digests.
faq frequently asked questions specific to this list.
get_bad in place of messages not found in the archive.
helpgeneral help (between 'top' and 'bottom').
infolist info. First line should be meaningful on its own.
mod_helpspecific help for list moderators.
mod_reject  to sender of rejected post.
mod_request to message moderators together with post.
mod_sub to subscriber after moderator confirmed subscribe.
mod_sub_confirm to subscription mod to request subscribe confirm.
mod_timeout to sender of timed-out post.
mod_unsub_confirm   to remote admin to request unsubscribe confirm.
sub_bad to subscriber if confirm was bad.
sub_confirm to subscriber to request subscribe confirm.
sub_nop to subscriber after re-subscription.
sub_ok  to subscriber after successful subscription.
top top of all responses.
unsub_bad   to subscriber if unsubscribe confirm was bad.
unsub_confirm   to subscriber to request unsubscribe confirm.
unsub_nop   to non-subscriber after unsubscribe.
unsub_okto ex-subscriber after successful unsubscribe.

--
 Sami Siren

Build failed in Jenkins: Nutch-trunk #1623

2011-10-03 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-trunk/1623/changes

Changes:

[markus] NUTCH-1058 Upgrade Solr schema to version 1.4

[markus] NUTCH-1137 LinkDB other options ignored with -dir

--
[...truncated 937 lines...]
A 
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/sv.test
A 
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/nl.test
A 
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/it.test
AU
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/test-referencial.txt
A 
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/fi.test
AU
src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/el.test
A src/plugin/language-identifier/src/java
A src/plugin/language-identifier/src/java/org
A src/plugin/language-identifier/src/java/org/apache
A src/plugin/language-identifier/src/java/org/apache/nutch
A src/plugin/language-identifier/src/java/org/apache/nutch/analysis
A src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang
AU
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java
AU
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
AU
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/langmappings.properties
AU
src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/package.html
AUsrc/plugin/language-identifier/plugin.xml
AUsrc/plugin/language-identifier/build.xml
A src/plugin/feed
A src/plugin/feed/sample
A src/plugin/feed/sample/rsstest.rss
A src/plugin/feed/ivy.xml
A src/plugin/feed/src
A src/plugin/feed/src/test
A src/plugin/feed/src/test/org
A src/plugin/feed/src/test/org/apache
A src/plugin/feed/src/test/org/apache/nutch
A src/plugin/feed/src/test/org/apache/nutch/parse
A src/plugin/feed/src/test/org/apache/nutch/parse/feed
AU
src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java
A src/plugin/feed/src/java
A src/plugin/feed/src/java/org
A src/plugin/feed/src/java/org/apache
A src/plugin/feed/src/java/org/apache/nutch
A src/plugin/feed/src/java/org/apache/nutch/parse
A src/plugin/feed/src/java/org/apache/nutch/parse/feed
AUsrc/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
A src/plugin/feed/src/java/org/apache/nutch/indexer
A src/plugin/feed/src/java/org/apache/nutch/indexer/feed
AU
src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
AUsrc/plugin/feed/plugin.xml
AUsrc/plugin/feed/build.xml
A src/plugin/subcollection
A src/plugin/subcollection/ivy.xml
A src/plugin/subcollection/src
A src/plugin/subcollection/src/test
A src/plugin/subcollection/src/test/org
A src/plugin/subcollection/src/test/org/apache
A src/plugin/subcollection/src/test/org/apache/nutch
A src/plugin/subcollection/src/test/org/apache/nutch/collection
AU
src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java
A src/plugin/subcollection/src/java
A src/plugin/subcollection/src/java/org
A src/plugin/subcollection/src/java/org/apache
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
AU
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
AU
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
AU
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
AU
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
AUsrc/plugin/subcollection/README.txt
AUsrc/plugin/subcollection/plugin.xml
AUsrc/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
AU
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A

[jira] [Commented] (NUTCH-1058) Upgrade Solr schema to version 1.4

2011-10-03 Thread Hudson (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119888#comment-13119888
 ] 

Hudson commented on NUTCH-1058:
---

Integrated in Nutch-trunk #1623 (See 
[https://builds.apache.org/job/Nutch-trunk/1623/])
NUTCH-1058 Upgrade Solr schema to version 1.4

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1178409
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/schema.xml


 Upgrade Solr schema to version 1.4
 --

 Key: NUTCH-1058
 URL: https://issues.apache.org/jira/browse/NUTCH-1058
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.4, nutchgora


 The version of our Solr schema should be updated from 1.3 to the current 
 version. I propose to commit the change prior to 1.4 and 2.0 RC's, the Solr 
 schema version may have incremented more than once at the time of an RC.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1137) LinkDb / invertlinks: command line arguments ignored

2011-10-03 Thread Hudson (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119889#comment-13119889
 ] 

Hudson commented on NUTCH-1137:
---

Integrated in Nutch-trunk #1623 (See 
[https://builds.apache.org/job/Nutch-trunk/1623/])
NUTCH-1137 LinkDB other options ignored with -dir

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1178376
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/LinkDb.java


 LinkDb / invertlinks: command line arguments ignored
 

 Key: NUTCH-1137
 URL: https://issues.apache.org/jira/browse/NUTCH-1137
 Project: Nutch
  Issue Type: Bug
  Components: linkdb
Affects Versions: 1.3
Reporter: Sebastian Nagel
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4

 Attachments: NUTCH-1137-1.5.patch


 If the tool invertlinks is called with option -dir segmentsDir all remaining
 arguments are ignored:
 {noformat}
 % $NUTCH_HOME/bin/nutch invertlinks linkdb -dir segments -noNormalize 
 -noFilter
 LinkDb: starting at 2011-09-28 23:24:07
 LinkDb: linkdb: linkdb
 LinkDb: URL normalize: true
 LinkDb: URL filter: true
 {noformat}
 (URLs are normalized and filtered despite -noNormalize/-noFilter)
 The patch also restricts the ordering of arguments according to the help text:
 Usage: LinkDb linkdb (-dir segmentsDir | seg1 seg2 ...) [-force] 
 [-noNormalize] [-noFilter]
 (segments must be given before the optional flags)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1058) Upgrade Solr schema to version 1.4

2011-10-03 Thread Hudson (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13119890#comment-13119890
 ] 

Hudson commented on NUTCH-1058:
---

Integrated in Nutch-nutchgora #25 (See 
[https://builds.apache.org/job/Nutch-nutchgora/25/])
NUTCH-1058 Upgrade Solr schema to version 1.4

markus : 
http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=revroot=revision=1178410
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/schema.xml


 Upgrade Solr schema to version 1.4
 --

 Key: NUTCH-1058
 URL: https://issues.apache.org/jira/browse/NUTCH-1058
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Trivial
 Fix For: 1.4, nutchgora


 The version of our Solr schema should be updated from 1.3 to the current 
 version. I propose to commit the change prior to 1.4 and 2.0 RC's, the Solr 
 schema version may have incremented more than once at the time of an RC.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

Build failed in Jenkins: Nutch-nutchgora #25

2011-10-03 Thread Apache Jenkins Server

See https://builds.apache.org/job/Nutch-nutchgora/25/changes

Changes:

[markus] NUTCH-1058 Upgrade Solr schema to version 1.4

--
[...truncated 2491 lines...]
[ivy:resolve] :: loading settings :: file = 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/classes
[javac] Note: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.

jar:
  [jar] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/test
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/classes

jar:
  [jar] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/test
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/classes

jar:
  [jar] Building jar: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-basic

init:
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/classes
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/urlnormalizer-pass/test
[mkdir] Created dir: 
/x1/jenkins/jenkins-slave/workspace/Nutch-nutchgora/nutchgora/build/plugins/urlnormalizer-pass

init-plugin:

Re: Choosing an efficient family configuration for GORA HBase

[jira] [Resolved] (NUTCH-1137) LinkDb / invertlinks: command line arguments ignored

Re: Providing a list of FAQ's with every new subscribe request

[jira] [Updated] (NUTCH-1144) Filtering optional in WebGraph

[jira] [Resolved] (NUTCH-1144) Filtering optional in WebGraph

[jira] [Updated] (NUTCH-1142) Normalization and filtering in WebGraph

[jira] [Commented] (NUTCH-1143) Omit anchor in webgraph's LinkDatum

[jira] [Resolved] (NUTCH-1058) Upgrade Solr schema to version 1.4

[jira] [Updated] (NUTCH-717) Make Nutch Solr integration easier

[jira] [Commented] (NUTCH-1136) Ant pmd target is broken

[jira] [Commented] (NUTCH-1109) Add Sonar targets to Ant build.xml

Re: Providing a list of FAQ's with every new subscribe request

Build failed in Jenkins: Nutch-trunk #1623

[jira] [Commented] (NUTCH-1058) Upgrade Solr schema to version 1.4

[jira] [Commented] (NUTCH-1137) LinkDb / invertlinks: command line arguments ignored

[jira] [Commented] (NUTCH-1058) Upgrade Solr schema to version 1.4

Build failed in Jenkins: Nutch-nutchgora #25

17 matches

Site Navigation

Mail list logo

Footer information