from:"Julien Nioche"

[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-07-10 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1087:
-

Attachment: NUTCH-1087-2.1.patch

Similar patch for 2.x - NOT TESTED YET

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>    Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-07-10 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13410423#comment-13410423
 ] 

Julien Nioche commented on NUTCH-1087:
--

Trunk : committed revision 1359720.
2.x => still needs testing

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-07-10 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1087:
-

Fix Version/s: 2.1

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>    Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-19 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-1433:


 Summary: Upgrade to Tika 1.2
 Key: NUTCH-1433
 URL: https://issues.apache.org/jira/browse/NUTCH-1433
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.6, 2.1




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-19 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1433:
-

Attachment: NUTCH-1433-trunk.patch

patch for trunk - please test

> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1433-trunk.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-20 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1433:
-

Attachment: NUTCH-1433-trunk-2.patch

Dependency to juniversalchardet needed in root ivy.xml

> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419014#comment-13419014
 ] 

Julien Nioche commented on NUTCH-1433:
--

Markus : I can't reproduce this issue. Are you getting this with trunk?


> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1341) NotModified time set to now but page not modified

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419084#comment-13419084
 ] 

Julien Nioche commented on NUTCH-1341:
--

Looks like a reasonable thing to do

> NotModified time set to now but page not modified
> -
>
> Key: NUTCH-1341
> URL: https://issues.apache.org/jira/browse/NUTCH-1341
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1341-1.6-1.patch
>
>
> Servers tend to respond with incorrect or no value for LastModified. By 
> comparing signatures or when (fetch.getStatus() == 
> CrawlDatum.STATUS_FETCH_NOTMODIFIED) the reducer correctly sets the 
> db_notmodified status for the CrawlDatum. The modifiedTime value, however, is 
> not set accordingly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419083#comment-13419083
 ] 

Julien Nioche commented on NUTCH-1388:
--

don't really like the names fixedFetchInterval vs fetchInterval, that's 
confusing and unclear. What about having a single customFetchInterval instead 
that would be used during the injection and would take precedence when using 
the AdaptiveFetchSchedule? If the default fetchschedule is used then the custom 
value would be used obviously.

> Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
> ---
>
> Key: NUTCH-1388
> URL: https://issues.apache.org/jira/browse/NUTCH-1388
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch
>
>
> During injection a custom fetch interval can be configured but it is not 
> maintained with an AdaptiveFetchSchedule enabled. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419108#comment-13419108
 ] 

Julien Nioche commented on NUTCH-1388:
--

can't you define the default value in nutch-site.xml? 

> Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
> ---
>
> Key: NUTCH-1388
> URL: https://issues.apache.org/jira/browse/NUTCH-1388
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch
>
>
> During injection a custom fetch interval can be configured but it is not 
> maintained with an AdaptiveFetchSchedule enabled. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419135#comment-13419135
 ] 

Julien Nioche commented on NUTCH-1388:
--

OK got it, thanks

bq. We have to differentiate between a default interval and an interval that 
will never change.
actually between a default interval (nutch-site.xml), a custom interval that 
can change and an custom interval that never changes.

What about using 'nutch.fetchInterval.fixed' instead of 
nutch.fixedFetchInterval? Purely cosmetic of course ;-)



> Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
> ---
>
> Key: NUTCH-1388
> URL: https://issues.apache.org/jira/browse/NUTCH-1388
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch
>
>
> During injection a custom fetch interval can be configured but it is not 
> maintained with an AdaptiveFetchSchedule enabled. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1388) Optionally maintain custom fetch interval despite AdaptiveFetchSchedule

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419170#comment-13419170
 ] 

Julien Nioche commented on NUTCH-1388:
--

Looks fine +1

> Optionally maintain custom fetch interval despite AdaptiveFetchSchedule
> ---
>
> Key: NUTCH-1388
> URL: https://issues.apache.org/jira/browse/NUTCH-1388
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1388-1.6-1.patch, NUTCH-1388-1.6-2.patch, 
> NUTCH-1388-1.6-3.patch
>
>
> During injection a custom fetch interval can be configured but it is not 
> maintained with an AdaptiveFetchSchedule enabled. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419175#comment-13419175
 ] 

Julien Nioche commented on NUTCH-1433:
--

Committed in trunk : revision 1363794.

> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-20 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1433:
-

Attachment: NUTCH-1433.branch-2.patch

PAtch for 2.x- strangely the version of the dependencies is not the same as for 
trunk. Passes the tests

> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch, 
> NUTCH-1433.branch-2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419260#comment-13419260
 ] 

Julien Nioche commented on NUTCH-1433:
--

Anyone to test the patch for 2.x?

> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch, 
> NUTCH-1433.branch-2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1433) Upgrade to Tika 1.2

2012-07-20 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419258#comment-13419258
 ] 

Julien Nioche commented on NUTCH-1433:
--

Hmm, probably had a problem with the ivy cache unless the remote pom for Tika 
has changed. Anyway, now getting the same deps as 2.x
Committed the revised plugin.xml in revision 1363842

> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.6, 2.1
>
> Attachments: NUTCH-1433-trunk-2.patch, NUTCH-1433-trunk.patch, 
> NUTCH-1433.branch-2.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1445) Add ElasticIndexerJob that indexes to elasticsearch

2012-08-06 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429058#comment-13429058
 ] 

Julien Nioche commented on NUTCH-1445:
--

Ferdy - just to reiterate what was said on a previous issue : please give 
people time to review your contribs before committing your own stuff. I am sure 
your code is fine and it does not really affect existing code too much but I 
think it is a good practice that we should try and stick to.

Instead of having multiple commands for the indexing backends can't we have a 
single job and define what the backends (SOLR, ES) via configuration? There is 
an open issue on 'pluggable indexing backends' 
[https://issues.apache.org/jira/browse/NUTCH-1047] can we discuss this there?



> Add ElasticIndexerJob that indexes to elasticsearch
> ---
>
> Key: NUTCH-1445
> URL: https://issues.apache.org/jira/browse/NUTCH-1445
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: NUTCH-1445-addPropsToConfig.patch, 
> NUTCH-1445-addToNutchScript.patch, NUTCH-1445.patch
>
>
> We have created a new indexer job ElasticIndexerJob that indexes to 
> elasticsearch. It is orginally based upon 
> https://github.com/ctjmorgan/nutch-elasticsearch-indexer (Apache2 license), 
> but we have modified it greatly to make it integrate as good as possible into 
> Nutch. The greatest modification is that documents are asynchronously flushed 
> in bulk to elasticsearch.
> Elasticsearch rocks. Both performance and ease of confiugration is awesome. 
> You simply deploy a server by unpacking the tar, configure the clustername, 
> start the server and fire away indexing requests. Indices are automatically 
> created. Fields are automapped. (Of course it is recommended to create your 
> own optimized mapping, but that is beyond scope of this issue). Multiple 
> servers connect without extra configuration, simply by using the same 
> clustername. (By means of multicast). There a tons of advanced options, such 
> as sharding, replication, disk striping etc.
> To give an example of the performance: With 20+ nodes we are able to index 
> over 1M docs (average sized webdocuments) per minute. The best part is that 
> the added documents are almost instantly searchable, so there no hidden 
> commit costs that Solr has. This is with out-of-the-box configuration.
> (I will attach patch and commit for Nutch2. Feel free to adapt for trunk.)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2012-08-06 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429101#comment-13429101
 ] 

Julien Nioche commented on NUTCH-1047:
--

Thanks for your comments Ferdy

bq.  What I've changed in Nutch2.x is that IndexerOutputFormat does not extend 
from FileOutputFormat anymore.

would be good to do the same for 1.x

bq. "whether we will be able to use implementations of NutchIndexWriter from 
within a plugin"
bq. What do you mean with this?

I meant that we need to check whether we can have the NutchIndexWriter 
implementations available in a plugin, which would be nice as we'd have our 
generic commands + the indexing endpoints implementations in their respective 
plugins (e.g. indexer-SOLR, indexer-ES) etc... 


> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.6
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

2012-08-15 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434893#comment-13434893
 ] 

Julien Nioche commented on NUTCH-1434:
--

bq.  I haven't added the configuration because it's overridden by the command 
line switch regardless of the nutch-site.xml configuration.

I'd rather do like it's done in other parts of the code i.e take into account 
any value set in nutch-site.xml if nothing is set on the command line (see for 
instance fetcher.parse) and include in nutch-default.xml 

> Indexer to delete robots noIndex
> 
>
> Key: NUTCH-1434
> URL: https://issues.apache.org/jira/browse/NUTCH-1434
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.5.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does 
> is remove the title and content fields from the parsed data. It does not stop 
> those pages from being indexed, nor can it delete existing pages from the 
> index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1434) Indexer to delete robots noIndex

2012-08-15 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13434927#comment-13434927
 ] 

Julien Nioche commented on NUTCH-1434:
--

Well, let's do configuration only then. After all it can be set on the command 
line with -D just as well + it means that we don't have to change the code 
reading the params etc...

> Indexer to delete robots noIndex
> 
>
> Key: NUTCH-1434
> URL: https://issues.apache.org/jira/browse/NUTCH-1434
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Affects Versions: 1.5.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1434-1.6-1.patch, NUTCH-1434-1.6-2.patch
>
>
> Nutch does not treat pages with meta robots="noindex" properly. All it does 
> is remove the title and content fields from the parsed data. It does not stop 
> those pages from being indexed, nor can it delete existing pages from the 
> index if they change.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1233) Rely on Tika for outlink extraction

2012-08-23 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13440271#comment-13440271
 ] 

Julien Nioche commented on NUTCH-1233:
--

Would be good to add some tests to illustrate the difference in behaviour + 
make sure that we are getting what we want

> Rely on Tika for outlink extraction
> ---
>
> Key: NUTCH-1233
> URL: https://issues.apache.org/jira/browse/NUTCH-1233
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.6
>
> Attachments: NUTCH-1233-1.5-wip.patch, NUTCH-1233-1.6-1.patch, 
> NUTCH-1233-1.6-2.patch
>
>
> Tika provides outlink extraction features that are not used in Nutch. To be 
> able to use it in Nutch we need Tika to return the rel attr value of each 
> link, which it currently doesn't. There's a patch for Tika 1.1. If that patch 
> is included in Tika and we upgraded to that new version this issue can be 
> worked on. Here's preliminary code that does both Tika and current outlink 
> extraction. This also includes parts of the Boilerpipe code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-09-07 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450481#comment-13450481
 ] 

Julien Nioche commented on NUTCH-1459:
--

commit ref please Ferdy, thanks!

> Remove dead code (phase2) from InjectorJob
> --
>
> Key: NUTCH-1459
> URL: https://issues.apache.org/jira/browse/NUTCH-1459
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: nutch-1459.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-09-07 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450492#comment-13450492
 ] 

Julien Nioche commented on NUTCH-1459:
--

the branch reference but even more so the actual commit ref. You can do it for 
this one, can't you?

> Remove dead code (phase2) from InjectorJob
> --
>
> Key: NUTCH-1459
> URL: https://issues.apache.org/jira/browse/NUTCH-1459
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: nutch-1459.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1459) Remove dead code (phase2) from InjectorJob

2012-09-07 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450515#comment-13450515
 ] 

Julien Nioche commented on NUTCH-1459:
--

Nah, that's perfect! ;-)

> Remove dead code (phase2) from InjectorJob
> --
>
> Key: NUTCH-1459
> URL: https://issues.apache.org/jira/browse/NUTCH-1459
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.1
>
> Attachments: nutch-1459.txt
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-09-13 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454790#comment-13454790
 ] 

Julien Nioche commented on NUTCH-1467:
--

bq. I will work on it soon but i am thinking of working on tika parser so that 
it can get all the attributes by default, index them and send it to solr 
'attr_*' dynamic field, so that instead of specifying manually any attributes 
will be accepted. That would be helpful i think than the parse-metatags.

a big fat -1 from me. definitely not a good idea to index all the possible 
attributes by default. 

Adding a test illustrating the new behaviour for this issue would have been 
good. +1 to being able to store multiple values instead of relying on a 
separator by convention

Markus - my understanding is that committers mark an issue as resolved but it's 
up to the author of the issue to confirm that all is done by closing it.

> nutch 1.5.1 not able to parse mutliValued metatags
> --
>
> Key: NUTCH-1467
> URL: https://issues.apache.org/jira/browse/NUTCH-1467
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: kiran
>Priority: Minor
> Fix For: 1.6
>
> Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using 
> http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
> there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it 
> work ?
> When there are two tags with same name and different content, it takes the 
> value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA 
> (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-09-13 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454856#comment-13454856
 ] 

Julien Nioche commented on NUTCH-1467:
--

Hi Kiran

Thank you for your comments. Re-index all attributes : this could be done by 
adding the option to parse-metatags and allowing values to be set using regular 
expressions in index-metadata.

Don't worry about being slow, no one's in a hurry and we are all learning from 
each other

> nutch 1.5.1 not able to parse mutliValued metatags
> --
>
> Key: NUTCH-1467
> URL: https://issues.apache.org/jira/browse/NUTCH-1467
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: kiran
>Priority: Minor
> Fix For: 1.6
>
> Attachments: patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using 
> http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
> there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it 
> work ?
> When there are two tags with same name and different content, it takes the 
> value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA 
> (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1467) nutch 1.5.1 not able to parse mutliValued metatags

2012-10-03 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468436#comment-13468436
 ] 

Julien Nioche commented on NUTCH-1467:
--

Thanks Kiran. See http://wiki.apache.org/nutch/HowToContribute for info on 
patches 


> nutch 1.5.1 not able to parse mutliValued metatags
> --
>
> Key: NUTCH-1467
> URL: https://issues.apache.org/jira/browse/NUTCH-1467
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: kiran
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1467-trunk.patch, Patch_HTMLMetaProcessor.patch, 
> Patch_HTMLMetaTags.patch, Patch_MetadataIndexer.patch, 
> Patch_MetaTagsParser.patch, patch.txt
>
>
> Hi,
> I have been able to parse metatags in an html page using 
> http://wiki.apache.org/nutch/IndexMetatags. It does not work quite well when 
> there are two metatags with same name but two different contents. 
> Does anyone encounter this kind of issue ?  
> Are there any changes that need to be made to the config files to make it 
> work ?
> When there are two tags with same name and different content, it takes the 
> value of the later tag and saves it rather than creating a multiValue field.
> Edit: I have attached the patch for the file and it is provided by DLA 
> (Digital Library and Archives) http://scholar.lib.vt.edu/ of Virginia Tech. 
> Many Thanks,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1475:
-

Affects Version/s: (was: nutchgora)
   1.5.1

This is an issue for the 1.x branch as well 

> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> --
>
> Key: NUTCH-1475
> URL: https://issues.apache.org/jira/browse/NUTCH-1475
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1, 1.5.1
> Environment: All
>Reporter: James Sullivan
>Priority: Minor
>  Labels: index-more, plugins
> Attachments: index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" 
> and "date" field for the Solr index. The "last modified" field is the last 
> modified date from the http headers if available, if not available it is left 
> empty. Currently, the "date" field is the same as the "last modified" field 
> unless that field is empty in which case getFetchTime is used as a fall back. 
> I think getFetchTime is not a good fall back as it is the next fetch time and 
> often a month or more in the future which doesn't make sense for the date 
> field. Users do not expect webpages/documents with future dates. A more 
> sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of 
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>  from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" 
> field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1344) BasicURLNormalizer to normalize https same as http

2012-10-10 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473066#comment-13473066
 ] 

Julien Nioche commented on NUTCH-1344:
--

Good catch Sebastian. PLease commit to both trunk and 2.x

> BasicURLNormalizer to normalize https same as http 
> ---
>
> Key: NUTCH-1344
> URL: https://issues.apache.org/jira/browse/NUTCH-1344
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: nutchgora, 1.6
>Reporter: Sebastian Nagel
> Attachments: NUTCH-1344.patch
>
>
> Most of the normalization done by BasicURLNormalizer (lowercasing host, 
> removing default port, removal of page anchors, cleaning . and . in the path) 
> is not done for URLs with protocol https.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1475) Nutch 2.1 Index-More Plugin -- A better fall back value for date field

2012-10-11 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13474198#comment-13474198
 ] 

Julien Nioche commented on NUTCH-1475:
--

Nope, looks like a reasonable thing to do

> Nutch 2.1 Index-More Plugin -- A better fall back value for date field
> --
>
> Key: NUTCH-1475
> URL: https://issues.apache.org/jira/browse/NUTCH-1475
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1, 1.5.1
> Environment: All
>Reporter: James Sullivan
>Priority: Minor
>  Labels: index-more, plugins
> Fix For: 1.6, 2.2
>
> Attachments: index-more-1xand2x.patch, index-more-2x.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Among other fields, the more plugin for Nutch 2.x provides a "last modified" 
> and "date" field for the Solr index. The "last modified" field is the last 
> modified date from the http headers if available, if not available it is left 
> empty. Currently, the "date" field is the same as the "last modified" field 
> unless that field is empty in which case getFetchTime is used as a fall back. 
> I think getFetchTime is not a good fall back as it is the next fetch time and 
> often a month or more in the future which doesn't make sense for the date 
> field. Users do not expect webpages/documents with future dates. A more 
> sensible fallback would be current date at the time it is indexed. 
> This is possible by simply changing line 97 of 
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
>  from
> time = page.getFetchTime(); // use fetch time
> to
> time = new Date().getTime();
> Users interested in the getFetchTime value can still get it from the "tstamp" 
> field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-710) Support for rel="canonical" attribute

2012-10-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13477716#comment-13477716
 ] 

Julien Nioche commented on NUTCH-710:
-

Iwan : sure, feel free to send a patch if you want to help it happen

> Support for rel="canonical" attribute
> -
>
> Key: NUTCH-710
> URL: https://issues.apache.org/jira/browse/NUTCH-710
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 1.1
>Reporter: Frank McCown
>Priority: Minor
> Fix For: 1.6, 2.2
>
>
> There is a the new rel="canonical" attribute which is
> now being supported by Google, Yahoo, and Live:
> http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
> Adding support for this attribute value will potentially reduce the number of 
> URLs crawled and indexed and reduce duplicate page content.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-10-19 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13479919#comment-13479919
 ] 

Julien Nioche commented on NUTCH-1477:
--

Thanks Mike. I confirm the issue. 
Did you recompile the Webpage class from the AVRO defs when using the latest 
version of AVRO? Could be an incompatibility between the versions.
Going back to the original problem I don't think the problem comes from AVRO as 
we would have it with the other backends as well. As for the MemStore I don't 
think it is used for anything else than tests.

> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>Reporter: Mike Baranczak
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-10-20 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1087.
--

Resolution: Fixed

Nutch 2-x : Committed revision 1400390.

Can open a new issue if there are any problems with the script.Should be a good 
starting point

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>    Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1433) Upgrade to Tika 1.2

2012-10-20 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1433.
--

Resolution: Fixed

Committed revision 1400397.



> Upgrade to Tika 1.2
> ---
>
> Key: NUTCH-1433
> URL: https://issues.apache.org/jira/browse/NUTCH-1433
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1433.branch-2.patch, NUTCH-1433-trunk-2.patch, 
> NUTCH-1433-trunk.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-10-25 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1477:
-

Fix Version/s: 2.2
 Assignee: Julien Nioche

> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>Reporter: Mike Baranczak
>Assignee: Julien Nioche
> Fix For: 2.2
>
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-10-25 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1477:
-

Attachment: webpage.avsc

Modified avro schema which allows fields to be null

> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>Reporter: Mike Baranczak
>Assignee: Julien Nioche
> Fix For: 2.2
>
> Attachments: webpage.avsc
>
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-10-25 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484148#comment-13484148
 ] 

Julien Nioche commented on NUTCH-1477:
--

I found in 
http://mail-archives.apache.org/mod_mbox/avro-user/200910.mbox/%3c4ae78503.50...@apache.org%3E
 that we probably need to explicitly allow for null values in the schema (see 
attachment). 

I tried recompiling the schemas with {{ant compile-avro-schema}} but the 
classes generated do not compile and are nowhere near as complete as the 
original ones. More worryingly the same is true with the original schema. I 
assumed that the code in org.apache.nutch.storage could be generated from the 
schemas.

Any idea?

> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>Reporter: Mike Baranczak
>Assignee: Julien Nioche
> Fix For: 2.2
>
> Attachments: webpage.avsc
>
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-10-25 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1477:
-

Priority: Critical  (was: Major)

> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>Reporter: Mike Baranczak
>Assignee: Julien Nioche
>Priority: Critical
> Fix For: 2.2
>
> Attachments: webpage.avsc
>
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-10-25 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13484169#comment-13484169
 ] 

Julien Nioche commented on NUTCH-1477:
--

Found a clue in https://issues.apache.org/jira/browse/NUTCH-842. Not sure what 
the point of compile-avro-schema is but we need to compile the schemas with 
gora and not just avro. The generated classes now compile fine.

Using the modified schema fails at compilation as the generated objects don't 
have accessors e.g. getContentType()



> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>Reporter: Mike Baranczak
>Assignee: Julien Nioche
>Priority: Critical
> Fix For: 2.2
>
> Attachments: webpage.avsc
>
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-10-26 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485172#comment-13485172
 ] 

Julien Nioche commented on NUTCH-1477:
--

Hi Lewis

bq. Do you suggest we update the patch in NUTCH-842 with the correct package 
name for Gora in the Nutch build.xml file and remove the ant 
compile-avro-schema target?

yes, until someone can explain what that target is useful for?

bq. If no accessors are generated then is this not a problem with the Gora 
compiler? If so we should open a ticket over there and link the issues.

it is indeed. Looks like the gora compiler can't deal with the ["string", 
"null"] union. Will create an issue in GORA land



> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>    Reporter: Mike Baranczak
>Assignee: Julien Nioche
>Priority: Critical
> Fix For: 2.2
>
> Attachments: webpage.avsc
>
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1482) Rename HTMLParseFilter

2012-10-29 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-1482:


 Summary: Rename HTMLParseFilter
 Key: NUTCH-1482
 URL: https://issues.apache.org/jira/browse/NUTCH-1482
 Project: Nutch
  Issue Type: Task
  Components: parser
Affects Versions: 1.5.1
Reporter: Julien Nioche


See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
better reflect what it does and I think we should do the same for 1.x.

any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1482) Rename HTMLParseFilter

2012-10-31 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487624#comment-13487624
 ] 

Julien Nioche commented on NUTCH-1482:
--

Having 2 extension points would be a bit of an overkill IMHO - there aren't any 
changes in the methods and people just need to do a minor change to the core 
and xml config which I don't think is unreasonable when moving from one version 
to the next as long as it is mentioned in the Wiki.

BTW maybe we should organize the CHANGES.txt a bit differently and organise it 
by type of change (optimisation - bug fix - incompatible change) as done in 
other projects instead of simply listing the JIRAs

> Rename HTMLParseFilter
> --
>
> Key: NUTCH-1482
> URL: https://issues.apache.org/jira/browse/NUTCH-1482
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.5.1
>Reporter: Julien Nioche
>
> See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
> better reflect what it does and I think we should do the same for 1.x.
> any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488728#comment-13488728
 ] 

Julien Nioche commented on NUTCH-1480:
--

Hi Lewis

bq. Can I run multiple Solr servers in psudo distributed mode?

SOLR is completely separated from Hadoop and has nothing to do with local vs 
distrib. You can run serveral instances of SOLR on the same machine if that is 
your question. Just invoke a different port when starting it from the command 
line with a separate SOLR home.

Markus,

Just to make sure I understand - this sends ALL the documents to ALL the SOLR 
instances specified, right? 

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488738#comment-13488738
 ] 

Julien Nioche commented on NUTCH-1480:
--

OK thanks. What about having a mechanism for specifying a way of distributing 
the docs with the replicate-to-all being one of the options? Could do 
consistent hashing maybe? I expect that most people would want to shard.

off topic re-deduplication : I think we've hit the limits of the current 
mechanism which I assume was based on the one we had when Nutch was managing 
its own Lucene indices. It's not reasonable to pump ALL the docs from SOLR into 
Hadoop to dedup and I'd rather have map reduce jobs to find the duplicates 
based on the crawldb and send the deletion commands to SOLR. And this would 
work for ElasticSearch as well. Am pretty sure there is a JIRA for this 
somewhere 

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2012-11-01 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488786#comment-13488786
 ] 

Julien Nioche commented on NUTCH-1480:
--

nope. I meant implementing the distribution to the shards on the Nutch side 
without relying on the CloudSolrServer. Having said that we want to move to 
SOLR4 and if we get that from SOLR for cheap then that's even better

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.6
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1487) Nutch parse fails first time for PDF files and works on reparse

2012-11-01 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1487:
-

Component/s: storage
 parser

> Nutch parse fails first time for PDF files and works on reparse
> ---
>
> Key: NUTCH-1487
> URL: https://issues.apache.org/jira/browse/NUTCH-1487
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, storage
>Affects Versions: 2.1
>Reporter: kiran
>  Labels: mysql
>
> The parser is failing to parse pdf files at one go and working on re-parsing 
> command the number of times the total number of PDF files as discussed in the 
> mailing list here 
> (http://www.mail-archive.com/user%40nutch.apache.org/msg07952.html) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1487) Nutch parse fails first time for PDF files and works on reparse

2012-11-01 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1487:
-

Labels: mysql  (was: )

> Nutch parse fails first time for PDF files and works on reparse
> ---
>
> Key: NUTCH-1487
> URL: https://issues.apache.org/jira/browse/NUTCH-1487
> Project: Nutch
>  Issue Type: Bug
>  Components: parser, storage
>Affects Versions: 2.1
>Reporter: kiran
>  Labels: mysql
>
> The parser is failing to parse pdf files at one go and working on re-parsing 
> command the number of times the total number of PDF files as discussed in the 
> mailing list here 
> (http://www.mail-archive.com/user%40nutch.apache.org/msg07952.html) 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-747) inject&Index metadatas and inherit these metadatas to all matching suburls

2012-11-05 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-747.
-

Resolution: Implemented

This has been made possible since thanks to : 
- Metadata injection (https://issues.apache.org/jira/browse/NUTCH-655)
- urlmeta plugin
- index-metadata plugin


> inject&Index metadatas and inherit these metadatas to all matching suburls
> --
>
> Key: NUTCH-747
> URL: https://issues.apache.org/jira/browse/NUTCH-747
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, injector
>Reporter: Marko Bauhardt
> Attachments: index-metadata.patch, metadata.patch
>
>
> Hi.
> the following two patches supports
> + inject metadatas to url's into a metadatadb
> url.com   :  : 
>  ...
> ...
> + updates the parse_data metadata from a shard and write the metadatas to all 
> fetched urls that starts with an url from the metadatadb
> + this patch support's metadata to all matching suburls inheritance
> the second patch implements a index-metadata plugin.
> + this plugin extract all metadats from the parse_data of a shard and index 
> it. which metadats you can configure in the plugin.properties.
> + to index for example the lang you have to configure the plugin.properties: 
> lang=STORE,UNTOKENIZED
> + that means that the index plugin exract metadata values with key "lang". if 
> exists, all values are indexed stored and untokenized
> Example
> create start url's in "/tmp/urls/start/urls.txt"
> http://lucene.apache.org/nutch/apidocs-1.0/index.html
> http://lucene.apache.org/nutch/apidocs-0.9/index.html
> create metadata url's in "/tmp/urls/metadata/urls.txt"
> http://lucene.apache.org/nutch/apidocs-1.0/ version:1.0
> http://lucene.apache.org/nutch/apidocs-0.9/ version:0.9
> Inject Urls
> bin/nutch inject crawldb /tmp/urls/start/
> bin/nutch org.apache.nutch.crawl.metadata.MetadataInjector metadatadb 
> /tmp/urls/metadata/
> Fetch & Parse & Update
> bin/nutch generate crawldb segments
> bin/nutch fetch segments/20090806105717/
> bin/nutch org.apache.nutch.crawl.metadata.ParseDataUpdater metadatadb 
> segments/20090806105717
> bin/nutch updatedb crawldb/ segments/20090806105717/
> Fetch & Parse & Update Again
> ...
> Index
> bin/nutch invertlinks linkdb -dir segments/
> bin/nutch index index crawldb/ linkdb/ segments/20090806105717 
> segments/20090806110127
> Check your Index
> All urls starting with "http://lucene.apache.org/nutch/apidocs-1.0/ " are 
> indexed with "version:1.0".
> All urls starting with "http://lucene.apache.org/nutch/apidocs-0.9/ " are 
> indexed with "version:0.9".
> This issue is some related to NUTCH-655

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1477) NPE when injecting with DataFileAvroStore

2012-12-07 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13526256#comment-13526256
 ] 

Julien Nioche commented on NUTCH-1477:
--

Hi Alfonso. 
That's right. I must have missed it when writing the modified schema.

> NPE when injecting with DataFileAvroStore
> -
>
> Key: NUTCH-1477
> URL: https://issues.apache.org/jira/browse/NUTCH-1477
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Affects Versions: 2.1
> Environment: Java 1.6.0_35
>Reporter: Mike Baranczak
>Assignee: Julien Nioche
>Priority: Critical
> Fix For: 2.2
>
> Attachments: webpage.avsc
>
>
> Fresh installation of Nutch 2.1, configured to use DataFileAvroStore. 
> Injection job throws NullPointerException, see below. No error when I switch 
> to MemStore.
> java.lang.NullPointerException
>   at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:133)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:176)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:171)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:72)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:89)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:55)
>   at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:245)
>   at 
> org.apache.gora.avro.store.DataFileAvroStore.put(DataFileAvroStore.java:54)
>   at 
> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
>   at 
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:639)
>   at 
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>   at 
> org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:185)
>   at org.apache.nutch.crawl.InjectorJob$UrlMapper.map(InjectorJob.java:85)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Attachment: NUTCH-840-trunk.patch

Modified version of the patch to fix the tests post NUTCH-797

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 2.2
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
> NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527362#comment-13527362
 ] 

Julien Nioche commented on NUTCH-840:
-

The tests now run OK with the patch I just attached.

bq. There is a problem here where the new tests (for parse-tika) also seem to 
be executed against (within?) other plugin testing scenarios

can you give more detail on this please Lewis?

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.2
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
> NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-840) Port tests from parse-html to parse-tika

2012-12-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-840:


Affects Version/s: 1.6
Fix Version/s: 1.7

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1, 1.6
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
> NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-891) Nutch build should not depend on unversioned local deps

2012-12-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-891:


Affects Version/s: 2.1

Probably not an issue anymore. marking it as 2.x to triage unversioned issues, 
will check later

> Nutch build should not depend on unversioned local deps
> ---
>
> Key: NUTCH-891
> URL: https://issues.apache.org/jira/browse/NUTCH-891
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Andrzej Bialecki 
> Attachments: gora-49_v1.patch, gora.build.patch
>
>
> The fix in NUTCH-873 introduces an unknown variable to the build process. 
> Since local ivy artifacts are unversioned, different people that install Gora 
> jars at different points in time will use the same artifact id but in fact 
> the artifacts (jars) will differ because they will come from different 
> revisions of Gora sources. Therefore Nutch builds based on the same svn rev. 
> won't be repeatable across different environments.
> As much as it pains the ivy purists ;) until Gora publishes versioned 
> artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars 
> built from a known external rev. We can add a README that contains commit id 
> from Gora.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-807) JSParseFilter produces malformed URL

2012-12-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-807.
---

Resolution: Won't Fix

Closing old issues. The JSParseFilter is known to generate noisy URLS and is 
not used by default anymore. This won't get fixed

> JSParseFilter produces malformed URL
> 
>
> Key: NUTCH-807
> URL: https://issues.apache.org/jira/browse/NUTCH-807
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.0.0
> Environment: Redhat 2.6.18-128.1.6.el5PAE  i686 i686 i386 GNU/Linux
>Reporter: Minyao Zhu
>
> This is found when crawling site: http://zhidao.baidu.com/( a Chinese 
> language site )
> It appears this page contains javascripts which confused JSParseFilter, which 
> produced URL like this:
> http://zhidao.baidu.com/){if(A===46){baidu.hide(
> Not sure the impact/scope of this issue in general.  The observation for this 
> specific site is, much less pages got crawled.
> Thanks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-62) Add html META tag information into metaData in index-more plugin

2012-12-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-62?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-62.


Resolution: Implemented

This can be done in a more flexible way using index-metadata
https://issues.apache.org/jira/browse/NUTCH-1264

> Add html META tag information into metaData in index-more plugin
> 
>
> Key: NUTCH-62
> URL: https://issues.apache.org/jira/browse/NUTCH-62
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Jack Tang
>Priority: Trivial
> Attachments: index-more.patch.zip
>
>
> Now(version dev-0.7), only some metaData  in http response such as type, 
> date, content-length are available int the index-more plugin. And we cannot 
> index/sotre the meta data in html header ( exactly)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

2012-12-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1267:
-

Assignee: Julien Nioche

> urlmeta to delegate indexing to index-metadata
> --
>
> Key: NUTCH-1267
> URL: https://issues.apache.org/jira/browse/NUTCH-1267
> Project: Nutch
>  Issue Type: Sub-task
>  Components: indexer
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1267) urlmeta to delegate indexing to index-metadata

2012-12-08 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1267:
-

  Description: Ideally we should get rid of urlmeta altogether and add 
the transmission of the meta to the outlinks in the core classes - not as a 
plugin. URLMeta is also a terrible name :-(
Affects Version/s: 1.6

> urlmeta to delegate indexing to index-metadata
> --
>
> Key: NUTCH-1267
> URL: https://issues.apache.org/jira/browse/NUTCH-1267
> Project: Nutch
>  Issue Type: Sub-task
>  Components: indexer
>Affects Versions: 1.6
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
>
> Ideally we should get rid of urlmeta altogether and add the transmission of 
> the meta to the outlinks in the core classes - not as a plugin. URLMeta is 
> also a terrible name :-(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-412) plugin to parse the feed-url (rss/atom) of a blog

2012-12-10 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche closed NUTCH-412.
---

Resolution: Implemented

6 years later ;-) 
the feed and parse-tika plugins can handle feeds  

> plugin to parse the feed-url (rss/atom) of a blog
> -
>
> Key: NUTCH-412
> URL: https://issues.apache.org/jira/browse/NUTCH-412
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Renaud Richardet
>Priority: Minor
> Attachments: FeedUrlFilter.java, plugin_parse-feedUrl2.diff, 
> plugin_parse-feedUrl.diff
>
>
> A plugin that extracts the feed-url (rss/atom) of a blog by retrieving the 
> href from the  element (if found), and stores it in metadata. 
> The meta can be accessed with 
> parse.getData().getMeta("feedUrl");
> you can test this plugin with the main method of HtmlParser.
> Thanks for a feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-648) debian style autocomplete

2012-12-10 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-648.
-

Resolution: Won't Fix

see comments above

> debian style autocomplete
> -
>
> Key: NUTCH-648
> URL: https://issues.apache.org/jira/browse/NUTCH-648
> Project: Nutch
>  Issue Type: Improvement
> Environment: debian, and other linux
>Reporter: Jim
>Priority: Minor
>
> Here is a suggested improvement:  At the end of this file is a debian 
> style bash autocomplete script, just place into /etc/bash_complete.d/ with 
> filename nutch, and you can tab complete at the command prompt, ie
> bash> nutch [tab][tab]
>crawl readdb convdb mergedb readlinkdb inject generate freegen fetch 
> fetch2 parse
>readseg mergesegs updatedb invertlinks mergelinkdb index merge dedup 
> plugin server
> bash> nutch c[tab][tab]
>crawl convdb
> etc.
>This also includes optional parameters, and filename completion where it 
> can be used.  I really like having this when typing in long nutch commands, 
> and think it would be a great addition to the project.
>The file is heavily taken from the corresponding svn file that does the 
> same thing.
> File begins here:
> shopt -s extglob
> _nutch()
> {
>local cur cmds cmdOpts optsParam opt
>local i
>COMPREPLY=()
>cur=${COMP_WORDS[COMP_CWORD]}
># Possible expansions
>cmds='crawl readdb convdb mergedb readlinkdb inject generate freegen 
> fetch fetch2 parse readseg mergesegs updatedb invertlinks \
> mergelinkdb index merge dedup plugin server'
>if [[ $COMP_CWORD -eq 1 ]] ; then
>COMPREPLY=( $( compgen -W "$cmds" -- $cur ) )
>return 0
>fi
># options that require a parameter
># This needs to be filled in better
>optsParam="-topN|-depth"
># if not typing an option, or if the previous option required a
># parameter, then fallback on ordinary filename expansion
>if [[ "$cur" != -* ]] || \
>   [[ ${COMP_WORDS[COMP_CWORD-1]} == @($optsParam) ]] ; then
>return 0
>fi
># possible options for the command
>cmdOpts=
>case ${COMP_WORDS[1]} in
>crawl)
>cmdOpts="-dir -threads -depth -topN"
>;;
>readdb)
>cmdOpts="-stats -dump -topN -url"
>;;
>convdb)
>cmdOpts="-withMetadata"
>;;
>mergedb)
>cmdOpts="-normalize -filter"
>;;
>readlinkdb)
>cmdOpts="-dump -url"
>;;
>inject)
>cmdOpts=""
>;;
>generate)
>cmdOpts="-force -topN -numFetchers -adddays -noFilter"
>;;
>freegen)
>cmdOpts="-filter -normalize"
>;;
>fetch)
>cmdOpts="-threads -noParsing"
>;;
>fetch2)
>cmdOpts="-threads -noParsing"
>;;
>parse)
>cmdOpts=""
>;;
>readseg)
>cmdOpts="-dump -list -get -nocontent -nofetch -nogenerate 
> -noparse -noparsedata -noparsetext -dir"
>;;
>mergesegs)
>cmdOpts="-dir -filter -slice"
>;;
>updatedb)
>cmdOpts="-dir -force -normalize -filter -noAdditions"
>;;
>invertlinks)
>cmdOpts="-dir -force -noNormalize -noFilter"
>;;
>mergelinkdb)
>cmdOpts="-normalize -filter"
>;;
>index)
>cmdOpts=""
>;;
>merge)
>cmdOpts="-workingdir"
>;;
>dedup)
>cmdOpts=""
>;;
>plugin)
>cmdOpts=""
>;;
>server)
>cmdOpts=""
>;;
>*)
>;;
>esac
># take out options already given
>for (( i=2; i<=$COMP_CWORD-1; ++i )) ; do
>opt=${COMP_WORDS[$i]}
>cmdOpts=" $cmdOpts "
>cmdOpts=${cmdOpts/ ${opt} / }
># skip next option if this one requires a parameter
>if [[ $opt == @($optsParam) ]] ; then
>((++i))
>fi
>done
>COMPREPLY=( $( compgen -W "$cmdOpts" -- $cur ) )
>return 0
> }
> complete -F _nutch -o default nutch

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls

2012-12-10 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1314:
-

Fix Version/s: 2.2
   1.7

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1347) fetcher politeness related to map-reduce

2012-12-19 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1347.
--

Resolution: Not A Problem

> fetcher politeness related to map-reduce
> 
>
> Key: NUTCH-1347
> URL: https://issues.apache.org/jira/browse/NUTCH-1347
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.4
>Reporter: behnam nikbakht
>  Labels: fetch
>
> when Nutch is running on Hadoop , based on map-reduce concept, each map task 
> do some thing on it's owned data, so, each fetcher map-task work with it's 
> Queues and do not know any thing about other Queus. so, enforce delay between 
> successive requests and maximum concurrent requests policies on it's Queues. 
> but with a simple test we found that it's not good piliteness mechanism when 
> we have multiple map tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1331) limit crawler to defined depth

2012-12-19 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535970#comment-13535970
 ] 

Julien Nioche commented on NUTCH-1331:
--

Any objections or shall I commit this new plugin?

> limit crawler to defined depth
> --
>
> Key: NUTCH-1331
> URL: https://issues.apache.org/jira/browse/NUTCH-1331
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator, parser, storage
>Affects Versions: 1.4
>Reporter: behnam nikbakht
> Attachments: NUTCH-1331.patch, NUTCH-1331-v2.patch
>
>
> there is a need to limit crawler to some defined depth, and importance of 
> this option is to avoid crawling of infinite loops, with dynamic generated 
> urls, that occur in some sites, and to optimize crawler to select important 
> urls.
> an option is define a iteration limit on generate,fetch,parse,updatedb cycle, 
> but it works only if in each cycle, all of unfetched urls become fetched, 
> (without recrawling them and with some other considerations)
> we can define a new parameter in CrawlDatum, named depth, and like score-opic 
> algorithm, compute depth of a link after parse, and in generate, only select 
> urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2012-12-21 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1508:
-

Summary: Port limit crawler to defined depth to 2.x  (was: Port limit 
crawler to defined depth to 23)

> Port limit crawler to defined depth to 2.x
> --
>
> Key: NUTCH-1508
> URL: https://issues.apache.org/jira/browse/NUTCH-1508
> Project: Nutch
>  Issue Type: Improvement
>    Reporter: Julien Nioche
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1508) Port limit crawler to defined depth to 23

2012-12-21 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-1508:


 Summary: Port limit crawler to defined depth to 23
 Key: NUTCH-1508
 URL: https://issues.apache.org/jira/browse/NUTCH-1508
 Project: Nutch
  Issue Type: Improvement
Reporter: Julien Nioche




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2012-12-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537804#comment-13537804
 ] 

Julien Nioche commented on NUTCH-1508:
--

Need to port the scoring-depth plugin to Nutch 2.x

> Port limit crawler to defined depth to 2.x
> --
>
> Key: NUTCH-1508
> URL: https://issues.apache.org/jira/browse/NUTCH-1508
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: Julien Nioche
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2012-12-21 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1508:
-

Affects Version/s: 2.2

> Port limit crawler to defined depth to 2.x
> --
>
> Key: NUTCH-1508
> URL: https://issues.apache.org/jira/browse/NUTCH-1508
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>    Reporter: Julien Nioche
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1331) limit crawler to defined depth

2012-12-21 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-1331.
--

   Resolution: Fixed
Fix Version/s: 1.7

Thanks Markus

Committed in revision 1424875 for trunk and opened a separate issue for porting 
to 2.x

and documented in nutch-default.xml

{quote}

  scoring.depth.max
  1000
  Max depth value from seed allowed by default.
  Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE"
  as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  to track the distance from the seed it was found from. 
  The depth is used to prioritise URLs in the generation step so that
  shallower pages are fetched first.
  

{quote}

> limit crawler to defined depth
> --
>
> Key: NUTCH-1331
> URL: https://issues.apache.org/jira/browse/NUTCH-1331
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator, parser, storage
>Affects Versions: 1.4
>Reporter: behnam nikbakht
> Fix For: 1.7
>
> Attachments: NUTCH-1331.patch, NUTCH-1331-v2.patch
>
>
> there is a need to limit crawler to some defined depth, and importance of 
> this option is to avoid crawling of infinite loops, with dynamic generated 
> urls, that occur in some sites, and to optimize crawler to select important 
> urls.
> an option is define a iteration limit on generate,fetch,parse,updatedb cycle, 
> but it works only if in each cycle, all of unfetched urls become fetched, 
> (without recrawling them and with some other considerations)
> we can define a new parameter in CrawlDatum, named depth, and like score-opic 
> algorithm, compute depth of a link after parse, and in generate, only select 
> urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (NUTCH-1331) limit crawler to defined depth

2012-12-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537811#comment-13537811
 ] 

Julien Nioche edited comment on NUTCH-1331 at 12/21/12 11:37 AM:
-

Thanks Markus

Committed in revision 1424875 for trunk and opened a separate issue for porting 
to 2.x

and documented in nutch-default.xml

{noformat}

  scoring.depth.max
  1000
  Max depth value from seed allowed by default.
  Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE"
  as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  to track the distance from the seed it was found from. 
  The depth is used to prioritise URLs in the generation step so that
  shallower pages are fetched first.
  

{noformat}

  was (Author: jnioche):
Thanks Markus

Committed in revision 1424875 for trunk and opened a separate issue for porting 
to 2.x

and documented in nutch-default.xml

{quote}

  scoring.depth.max
  1000
  Max depth value from seed allowed by default.
  Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE"
  as a seed metadata. This plugin adds a "_depth_" metadatum to the pages
  to track the distance from the seed it was found from. 
  The depth is used to prioritise URLs in the generation step so that
  shallower pages are fetched first.
  

{quote}
  
> limit crawler to defined depth
> --
>
> Key: NUTCH-1331
> URL: https://issues.apache.org/jira/browse/NUTCH-1331
> Project: Nutch
>  Issue Type: New Feature
>  Components: generator, parser, storage
>Affects Versions: 1.4
>Reporter: behnam nikbakht
> Fix For: 1.7
>
> Attachments: NUTCH-1331.patch, NUTCH-1331-v2.patch
>
>
> there is a need to limit crawler to some defined depth, and importance of 
> this option is to avoid crawling of infinite loops, with dynamic generated 
> urls, that occur in some sites, and to optimize crawler to select important 
> urls.
> an option is define a iteration limit on generate,fetch,parse,updatedb cycle, 
> but it works only if in each cycle, all of unfetched urls become fetched, 
> (without recrawling them and with some other considerations)
> we can define a new parameter in CrawlDatum, named depth, and like score-opic 
> algorithm, compute depth of a link after parse, and in generate, only select 
> urls with valid depth.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1510) Upgrade to Hadoop 1.1.1

2012-12-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538134#comment-13538134
 ] 

Julien Nioche commented on NUTCH-1510:
--

can you test for 2.x as well? should work straight of the box

> Upgrade to Hadoop 1.1.1
> ---
>
> Key: NUTCH-1510
> URL: https://issues.apache.org/jira/browse/NUTCH-1510
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.6
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.7
>
> Attachments: NUTCH-1510-1.7-1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1507) Remove FetcherOutput

2012-12-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538137#comment-13538137
 ] 

Julien Nioche commented on NUTCH-1507:
--

Wouldn't that break the compatibility when trying to read from an existing 
crawlDB?

> Remove FetcherOutput
> 
>
> Key: NUTCH-1507
> URL: https://issues.apache.org/jira/browse/NUTCH-1507
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.6
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1507-1.7-1.patch
>
>
> The FetcherOutput class is not used anywhere and it and its references should 
> be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1507) Remove FetcherOutput

2012-12-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538151#comment-13538151
 ] 

Julien Nioche commented on NUTCH-1507:
--

bq. This code is used nowhere and only had two references from MapWritable and 
NutchWritable which means nothing by itself.

what I was wondering was whether the change to these 2 classes would change 
their signature and break things as they are used in the crawldb (correct me if 
I am wrong)



> Remove FetcherOutput
> 
>
> Key: NUTCH-1507
> URL: https://issues.apache.org/jira/browse/NUTCH-1507
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.6
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1507-1.7-1.patch
>
>
> The FetcherOutput class is not used anywhere and it and its references should 
> be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1507) Remove FetcherOutput

2012-12-21 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538158#comment-13538158
 ] 

Julien Nioche commented on NUTCH-1507:
--

Ok. Not entirely clear to me how this stuff works but it should be fairly easy 
to test anyway. Thanks! 

> Remove FetcherOutput
> 
>
> Key: NUTCH-1507
> URL: https://issues.apache.org/jira/browse/NUTCH-1507
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.6
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1507-1.7-1.patch
>
>
> The FetcherOutput class is not used anywhere and it and its references should 
> be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1508) Port limit crawler to defined depth to 2.x

2013-01-07 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545757#comment-13545757
 ] 

Julien Nioche commented on NUTCH-1508:
--

Hi Ferdy

I did not see NUTCH-1431 at all :-( 

NUTCH-1331 does the same but in a less intrusive way in terms of code changes + 
allows to specify a max distance per seed as well as a global one. Does  
NUTCH-1431 do that as well?

Not sure what the best course of action is. I'd rather we kept the same 
approach in both branches. WDYT?

> Port limit crawler to defined depth to 2.x
> --
>
> Key: NUTCH-1508
> URL: https://issues.apache.org/jira/browse/NUTCH-1508
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.2
>Reporter: Julien Nioche
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-07 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13545958#comment-13545958
 ] 

Julien Nioche commented on NUTCH-1031:
--

well we have 2 separate params : http.agent.name which is a single value sent 
to the servers when fetching and http.robots.agents which can have multiple 
values and is used for parsing robots. The value of this parameter SHOULD be 
split based on commas.

I don't think CC supports multiple values for http.robots.agents, but I'll ask 
Ken to be sure.

> Delegate parsing of robots.txt to crawler-commons
> -
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>Priority: Minor
>  Labels: robots.txt
> Fix For: 1.7
>
> Attachments: NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika

2013-01-09 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547783#comment-13547783
 ] 

Julien Nioche commented on NUTCH-840:
-

Thanks Lewis. Will commit shortly unless someone has any objections

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1, 1.6
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
> NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2013-01-10 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1047:
-

Attachment: NUTCH-1047-1.x-v1.patch

This is work in progress.
This patch creates a new endpoint (IndexWriter) that plugins can implement. 
Comes with one such plugin (indexer-solr) and generic code for replacing the 
index and delete jobs. Haven't tested very much. The main difference is that 
the SOLR URL must be passed as a Hadoop param e.g. -D solr.server.url. It could 
also be put in the nutch-site.xml once and for all. 
There will be some cleaning to do once this is stable to remove the SOLR stuff 
in the core code etc...
Please have a look and let me know your thoughts on this

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2013-01-11 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1047:
-

Attachment: NUTCH-1047-1.x-v2.patch

new version of the patch which removes all SOLR related stuff from the core.
The crawl class assumes that solr is used (but this can be changed) and does 
not do the SOLR dedup anymore. We'll need a better mechanism for the dedup as 
the existing one is SOLR centric and not very scalable.
Quite a drastic modification of the code, but should be for the best.
Please give it a try and let me know your thoughts.
PS: you might need to delete the index.solr package by hand

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>    Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1517) CloudSearch indexer

2013-01-11 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-1517:


 Summary: CloudSearch indexer
 Key: NUTCH-1517
 URL: https://issues.apache.org/jira/browse/NUTCH-1517
 Project: Nutch
  Issue Type: New Feature
  Components: indexer
Reporter: Julien Nioche
 Fix For: 1.7


Once we have made the indexers pluggable, we should add a plugin for Amazon 
CloudSearch. See http://aws.amazon.com/cloudsearch/. Apparently it uses a JSON 
based representation Search Data Format (SDF), which we could reuse for a file 
based indexer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1371) Replace Ivy with Maven Ant tasks

2013-01-14 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13552513#comment-13552513
 ] 

Julien Nioche commented on NUTCH-1371:
--

Hi Lewis. Yep the plugins need to be managed in the same way + cleanup the ivy 
stuff etc... 

> Replace Ivy with Maven Ant tasks
> 
>
> Key: NUTCH-1371
> URL: https://issues.apache.org/jira/browse/NUTCH-1371
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Julien Nioche
>Assignee: Lewis John McGibbney
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-1371.patch
>
>
> We might move to Maven altogether but a good intermediate step could be to 
> rely on the maven ant tasks for managing the dependencies. Ivy does a good 
> job but we need to have a pom file anyway for publishing the artefacts which 
> means keeping the pom.xml and ivy.xml contents in sync. Most devs are also 
> more familiar with Maven, and it is well integrated in IDEs. Going the 
> ANT+MVN way also means that we don't have to rewrite the whole building 
> process and can rely on our existing script

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2013-01-14 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1047:
-

Attachment: NUTCH-1047-1.x-v3.patch

Cleaner version of the patch which removes the content from the solr package, 
adds the dependencies to the indexer-solr plugin in the plugin.xml definition 
and changes the nutch script so that the SOLR related commands work in the same 
way but using the plugin under the bonnet. A few more things to do e.g. 
management of the commits when indexing but we are getting there

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2013-01-15 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554369#comment-13554369
 ] 

Julien Nioche commented on NUTCH-1087:
--

Hi Sebastian

bq. SEGMENT=`ls $CRAWL_PATH/segments/ | sort -n | tail -n 1`

is not a good option as it won't work in deploy mode, only in local whereas 
using 'hadoop fs -ls' works in both cases.

Julien

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2013-01-16 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554862#comment-13554862
 ] 

Julien Nioche commented on NUTCH-1087:
--

Apologies Seb, I should (a) not read emails late in the evening after a long 
day (b) check the code before commenting ;-) 

> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>    Assignee: Julien Nioche
>Priority: Minor
> Fix For: 1.6, 2.2
>
> Attachments: NUTCH-1087-1.6-2.patch, NUTCH-1087-1.6-3.patch, 
> NUTCH-1087-2.1-2.patch, NUTCH-1087-2.1.patch, NUTCH-1087.patch
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556026#comment-13556026
 ] 

Julien Nioche commented on NUTCH-1047:
--

Good point Markus, thanks.
The main issue I am struggling with at the moment is what to do with the SOLR 
deduplication. I don't think we can run a MapReduce job from a plugin so it's 
not going to work. One (temporary) option would be to leave it as is so that 
the crawl command works as expected as well as the crawl script and the nutch 
command and we then get rid of it when we have a generic deduplication job. 

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556041#comment-13556041
 ] 

Julien Nioche commented on NUTCH-1047:
--

We definitely need a better mechanism for deduplication. +1 to leave as is for 
now until we have a better option. Slightly annoying for this issue is that it 
means adding it back to the main classes as well as SOLR as dependency, not a 
big deal though.

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556054#comment-13556054
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tried, failed. 
Re- other issues : wouldn't it make sense to do NUTCH-1047 first before you 
improve the SOLR-backends?

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556079#comment-13556079
 ] 

Julien Nioche commented on NUTCH-1047:
--

Should not be a big deal as the classes affected by NUTCH-1480 are not modified 
that much by NUTCH-1047 and it also means that you'll get to look at the code 
for this issue which is a good way of reviewing it :-)

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556090#comment-13556090
 ] 

Julien Nioche commented on NUTCH-1480:
--

I'd rather it was implemented as an extension of NUTCH-945 where we'd have a 
partitioner that sends to all SOLR instances, which is I believe what 
NUTCH-1480 is about. There are many cases where we'd want to shard according to 
other criteria and NUTCH-945 would provide a more generic framework. Does this 
make sense?

> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556091#comment-13556091
 ] 

Julien Nioche commented on NUTCH-1047:
--

my suggestion was that you give NUTCH-1047 a try, wait until it is committed 
then commit your changes to it, not that I'd patch it to include your changes.

BTW have commented on NUTCH-1480

thanks

Julien



 

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1480) SolrIndexer to write to multiple servers.

2013-01-17 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556100#comment-13556100
 ] 

Julien Nioche commented on NUTCH-1480:
--

probably depends on whether we want to support both SOLR 3.x and SOLR 4.x. Got 
your point about indexing to multiple clouds, thanks! 


> SolrIndexer to write to multiple servers.
> -
>
> Key: NUTCH-1480
> URL: https://issues.apache.org/jira/browse/NUTCH-1480
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.7
>
> Attachments: NUTCH-1480-1.6.1.patch
>
>
> SolrUtils should return an array of SolrServers and read the SolrUrl as a 
> comma delimited list of URL's using Configuration.getString(). SolrWriter 
> should be able to handle this list of SolrServers.
> This is useful if you want to send documents to multiple servers if no 
> replication is available or if you want to send documents to multiple NOCs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-840) Port tests from parse-html to parse-tika

2013-01-18 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557163#comment-13557163
 ] 

Julien Nioche commented on NUTCH-840:
-

Trunk => Committed revision 1435101.

Anyone to port to 2x?

> Port tests from parse-html to parse-tika
> 
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.1, 1.6
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.7, 2.2
>
> Attachments: NUTCH-840.patch, NUTCH-840.patch, NUTCH-840-trunk.patch, 
> NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1047) Pluggable indexing backends

2013-01-18 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1047:
-

Attachment: NUTCH-1047-1.x-v4.patch

First working patch!
Added the SOLRDedup back into the core classes as it does not seem to be 
possible to run a MapReduce class from within a plugin.
Added 2 new methods to the IndexWriter interface (commit, update) + fixed 
CleaningJob and nutch script.
Tried on a small crawl with the crawl script and it worked as expected

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>    Reporter: Julien Nioche
>    Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-19 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558195#comment-13558195
 ] 

Julien Nioche commented on NUTCH-1031:
--

bq. 1. Continue to have the legacy code for parsing robots file. 
bq. 2. As an add-in, crawler-commons can be employed for the parsing. User can 
pick based on a config parameter with a note indicating that #2 wont work with 
multiple HTTP agents.

2 is an overkill IMHO. the existing code works fine and the point in moving to 
CC was to get rid of some of our code, not make it bigger with yet another 
configuration. 

Lewis : donating out code is a good idea but in the case of the robots parsing 
it's more about modifying the existing one in CC. I haven't had time to look at 
robot parsing in CC and am not familiar with it but it would be a good thing to 
improve it. In the meantime let's go for option 1. Thanks!


> Delegate parsing of robots.txt to crawler-commons
> -
>
> Key: NUTCH-1031
> URL: https://issues.apache.org/jira/browse/NUTCH-1031
> Project: Nutch
>  Issue Type: Task
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>Priority: Minor
>  Labels: robots.txt
> Fix For: 1.7
>
> Attachments: NUTCH-1031.v1.patch
>
>
> We're about to release the first version of Crawler-Commons 
> [http://code.google.com/p/crawler-commons/] which contains a parser for 
> robots.txt files. This parser should also be better than the one we currently 
> have in Nutch. I will delegate this functionality to CC as soon as it is 
> available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1522) Upgrade to Tika 1.3

2013-01-23 Thread Julien Nioche (JIRA)

Julien Nioche created NUTCH-1522:


 Summary: Upgrade to Tika 1.3
 Key: NUTCH-1522
 URL: https://issues.apache.org/jira/browse/NUTCH-1522
 Project: Nutch
  Issue Type: Task
  Components: parser
Reporter: Julien Nioche
Priority: Minor
 Fix For: 1.7, 2.2


http://www.apache.org/dist/tika/CHANGES-1.3.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1482) Rename HTMLParseFilter

2013-01-23 Thread Julien Nioche (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1482:
-

Fix Version/s: 1.7

> Rename HTMLParseFilter
> --
>
> Key: NUTCH-1482
> URL: https://issues.apache.org/jira/browse/NUTCH-1482
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Affects Versions: 1.5.1
>    Reporter: Julien Nioche
> Fix For: 1.7
>
>
> See NUTCH-861 for a background discussion. We have changed the name in 2.x to 
> better reflect what it does and I think we should do the same for 1.x.
> any objections?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-25 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562558#comment-13562558
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Lufeng. 

The solrindex command in the nutch script works just as before. You can also 
invoke the IndexingJob command and pass it the SOLR URL as a Hadoop parameter 
e.g. {{-D solr.server.url=xx}}

SolrUtils is duplicated indeed because of DeleteDuplicates, which is a 
SOLR-specific implementation. We need to build a generic deduplicator at some 
point and it will use the pluggable backends. I decided to leave the SOLR-based 
one in for now, but if most people don't use it then we should probably shelve 
it. This is a separate issue though.

Thanks for your comments

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1250) parse-html does not parse links with empty anchor

2013-01-25 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13562577#comment-13562577
 ] 

Julien Nioche commented on NUTCH-1250:
--

See comment in DOMContentUtils

{quote}
   * Links without inner structure (tags, text, etc) are discarded, as
   * are links which contain only single nested links and empty text
   * nodes (this is a common DOM-fixup artifact, at least with
   * nekohtml).
{quote}

the solution you suggested would probably generate quite a lot of noise by not 
filtering the links added by Neko. I agree that outlinks without anchors should 
not be filtered. What about testing that they have a href attribute instead of 
testing for the presence of a child node?



> parse-html does not parse links with empty anchor
> -
>
> Key: NUTCH-1250
> URL: https://issues.apache.org/jira/browse/NUTCH-1250
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
>Reporter: Andreas Janning
> Fix For: 1.7, 2.2
>
> Attachments: DOMContentUtils_v1.patch
>
>
> The parse-html plugin does not generate an outlink if the link has no anchor
> For example the following HTML-Code does not create an Outlink:
> {code:html} 
>   
> {code}
> The JUnit-Test TestDOMContentUtils tries to test this but fails since there 
> is a comment inside the -Tag.
> {code:title=TestDOMContentUtils.java|borderStyle=solid}
> new String(" title "
> + ""
> + ""
> + "   "
> + "   "
> + ""), 
> {code}
> When you remove the comment the test fails.
> {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid}
> new String(" title "
> + ""
> + "" // no anchor
> + "   "
> + "   "
> + ""), 
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564173#comment-13564173
 ] 

Julien Nioche commented on NUTCH-1047:
--

@tejasp can reproduce the issue and am looking into it, thanks. Somehow the 
configuration does not get passed on properly when using the crawl command. 
Thanks.

Lufeng 
{quote}
But i don't know why not add an option to set IndexerUrl such as bin/nutch 
solrindex -indexurl http://localhost:8983/solr/.
{quote}

whether it is passed as a parameter or via configuration should not make much 
of a difference. Your suggestion also assumes that the indexing backend can be 
reached via a single URL which is not necessarily the case as it could not need 
a URL at all or at the opposite need multiple URLs. Better to leave that logic 
in the configuration and assume that the backends will find whatever they need 
there.

{quote}
 the corrent command to invoke the IndexingJob command is "bin/nutch solrindex 
http://localhost:8983/solr/ crawldb/ segments/20130121115214/ -filter".
{quote}

as explained above we want to keep compatibility with the existing sorlindex 
command and not change its syntax. Underneath it uses the new code based on 
plugins but sets the value of the solr config. There is no shortcut for the 
generic indexing job command in the nutch script yet but we could add one. For 
now it has to be called in full e.g. bin/nutch 
org.apache.nutch.indexer.IndexingJob ... which will make sense when we have 
other indexing backends and not just SOLR.

Think about 'nutch solrindex' as a shortcut for the generic command.







> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>      Components: indexer
>        Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564196#comment-13564196
 ] 

Julien Nioche commented on NUTCH-1047:
--

Hi Tejas

It will work everytime you set it in nutch-site.xml. As for setting it with -D 
in the crawl command - you definitely should not have to do that and this is 
where the bug is. The problem is that for some reason we value we take from the 
crawl command is correctly set in the configuration object however the later is 
reloaded or overridden during the call to JobClient.runJob(job) (IndexingJob 
line 120).

BTW the crawl command is deprecated and should be removed at some point as we 
have the crawl script. Could you try using the SOLRIndex command as well as the 
crawl script while I try and solve the problem with the crawl command?

Thanks

Julien



> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-28 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13564263#comment-13564263
 ] 

Julien Nioche commented on NUTCH-1047:
--

Tejas

The crawl script and the solr index should work without setting 
"solr.server.url" in nutch-site.xml or using -D as this is handled for you in 
the nutch script. Can you please test without specifying "solr.server.url" in 
nutch-site.xml?

Thanks

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1047) Pluggable indexing backends

2013-01-30 Thread Julien Nioche (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566291#comment-13566291
 ] 

Julien Nioche commented on NUTCH-1047:
--

[~wastl-nagel] a text based indexer is a good idea. Having one generating data 
at the format used by CloudSearch see [NUTCH-1517] would be cool as well. As 
for your concerns : most people currently use the SOLR indexer which will still 
be the one activated by default. I expect a minority of people will try and use 
something else and if they do then checking which one is activated is no big 
deal, either via config file or from logs. Passing the options via the config 
with -D is not very different from using a standard parameter, with the added 
benefit though that it gives us the possibility to set things in nutch-site.xml 
once and for all and hence make the commands much simpler. As for the list of 
properties, they would vary from backend to backend anyway. Each plugin could 
have a README describing what its options are, compared to having everything in 
nutch-default.xml at least the descriptions will be contained within the 
related plugin.

[~tejasp] good catch for the number of args, will fix it. Re-usage message : we 
could add a getUsage()  method to each backend that the generic command will 
call for all the active indexing plugins. I think the solrindex shortcut is 
just a temporary measure though until the documentation is up to scratch and 
the user base has got used to the generic commands.

Thanks for taking the time to share your thoughts, guys. 

 

 

> Pluggable indexing backends
> ---
>
> Key: NUTCH-1047
> URL: https://issues.apache.org/jira/browse/NUTCH-1047
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Julien Nioche
>    Assignee: Julien Nioche
>  Labels: indexing
> Fix For: 1.7
>
> Attachments: NUTCH-1047-1.x-v1.patch, NUTCH-1047-1.x-v2.patch, 
> NUTCH-1047-1.x-v3.patch, NUTCH-1047-1.x-v4.patch
>
>
> One possible feature would be to add a new endpoint for indexing-backends and 
> make the indexing plugable. at the moment we are hardwired to SOLR - which is 
> OK - but as other resources like ElasticSearch are becoming more popular it 
> would be better to handle this as plugins. Not sure about the name of the 
> endpoint though : we already have indexing-plugins (which are about 
> generating fields sent to the backends) and moreover the backends are not 
> necessarily for indexing / searching but could be just an external storage 
> e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this 
> could be pertaining to the storage in GORA. 'indexing-backend' is the best 
> name that came to my mind so far - please suggest better ones.
> We should come up with generic map/reduce jobs for indexing, deduplicating 
> and cleaning and maybe add a Nutch extension point there so we can easily 
> hook up indexing, cleaning and deduplicating for various backends.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 1807 matches

Mail list logo