date:20081202

Build failed in Hudson: Nutch-trunk #649

2008-12-02 Thread Apache Hudson Server

See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/649/changes

Changes:

[kubes] NUTCH-667: Input Format for working with Content in Hadoop Streaming

[kubes] NUTCH-665: Search Load Testing Tool

[kubes] NUTCH-647: Resolve URLs tool

[kubes] NUTCH-647: Resolve URLs tool

[kubes] NUTCH-663: Upgrade Nutch to use Hadoop 0.19

[kubes] NUTCH-662: Upgrade Nutch to use Lucene 2.4

--
[...truncated 2151 lines...]
A src/plugin/protocol-http/src/test/org/apache
A src/plugin/protocol-http/src/test/org/apache/nutch
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http
A src/plugin/protocol-http/src/java
A src/plugin/protocol-http/src/java/org
A src/plugin/protocol-http/src/java/org/apache
A src/plugin/protocol-http/src/java/org/apache/nutch
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http
AU
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
A 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
A 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/package.html
AUsrc/plugin/protocol-http/plugin.xml
AUsrc/plugin/protocol-http/build.xml
A bin
AUbin/nutch
A docs
A docs/ms
A docs/ms/search.html
A docs/ms/help.html
A docs/ms/about.html
A docs/zh
A docs/zh/search.html
A docs/zh/help.html
A docs/zh/about.html
A docs/ca
A docs/ca/search.html
A docs/ca/help.html
A docs/ca/about.html
A docs/pt
A docs/pt/search.html
A docs/pt/help.html
A docs/pt/about.html
A docs/sr
AUdocs/sr/search.html
AUdocs/sr/help.html
AUdocs/sr/about.html
A docs/sv
A docs/sv/search.html
A docs/sv/help.html
A docs/sv/about.html
A docs/de
A docs/de/search.html
A docs/de/help.html
A docs/de/about.html
A docs/fi
A docs/fi/search.html
A docs/fi/help.html
A docs/fi/about.html
A docs/en
A docs/en/search.html
A docs/en/help.html
A docs/en/about.html
A docs/es
A docs/es/search.html
A docs/es/help.html
A docs/es/about.html
A docs/fr
A docs/fr/search.html
AUdocs/fr/help.html
A docs/fr/about.html
A docs/jp
A docs/jp/search.html
A docs/jp/help.html
A docs/jp/about.html
A docs/nl
A docs/nl/search.html
A docs/nl/help.html
A docs/nl/about.html
A docs/sh
AUdocs/sh/search.html
AUdocs/sh/help.html
AUdocs/sh/about.html
A docs/th
A docs/th/search.html
A docs/th/help.html
A docs/th/about.html
A docs/pl
A docs/pl/search.html
A docs/pl/help.html
A docs/pl/about.html
A docs/it
AUdocs/it/search.html
AUdocs/it/help.html
AUdocs/it/about.html
A docs/img
A docs/img/lang
AUdocs/img/lang/romanian.png
AUdocs/img/lang/bulgarian.png
AUdocs/img/lang/spanish.png
AUdocs/img/lang/danish.png
AUdocs/img/lang/dutch.png
AUdocs/img/lang/icelandic.png
AUdocs/img/lang/hungarian.png
AUdocs/img/lang/russian.png
AUdocs/img/lang/japanese.png
AUdocs/img/lang/turkish.png
AUdocs/img/lang/suomi.png
AUdocs/img/lang/lithuanian.png
AUdocs/img/lang/czech.png
AUdocs/img/lang/greek.png
AUdocs/img/lang/galego.png
AUdocs/img/lang/polish.png
AUdocs/img/lang/latvian.png
AUdocs/img/lang/croatian.png
AUdocs/img/lang/portuguese.png
AUdocs/img/lang/french.png
AUdocs/img/lang/swedish.png
AUdocs/img/lang/german.png
AUdocs/img/lang/chinese.png
AUdocs/img/lang/malaysian.png
AUdocs/img/lang/korean.png
AUdocs/img/lang/arabic.png
AUdocs/img/lang/italian.png
AUdocs/img/lang/brazil.png
AUdocs/img/lang/catala.png
AUdocs/img/lang/thai.png
AUdocs/img/lang/indonesian.png
AUdocs/img/lang/norwegian.png
AUdocs/img/lang/english.png
AUdocs/img/poweredbynutch_01.gif
AUdocs/img/poweredbynutch_02.gif
A docs/img/reiter
AUdocs/img/reiter/reiter_inactive_le.gif
AUdocs/img/reiter/_spacer_cc.gif
AUdocs/img/reiter/reiter_inactive_le1.gif
AUdocs/img/reiter/bg_subnavi.gif
AUdocs/img/reiter/002bg_fle.gif
AUdocs/img/reiter/spacer_66.gif
AUdocs/img/reiter/ul.gif
AUdocs/img/reiter/_bg_reiter.gif
AUdocs/img/reiter/logo_nutch.gif
AU

[jira] Updated: (NUTCH-668) Domain URL Filter

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-668:
---

Attachment: NUTCH-668-1-20081202.patch

Includes the DomainURLFilter and test files.  Domains can either be filtered by 
top level domains ignoring subdomains, or by hostnames through configuration.  
There is a configuration file where valid domains are placed one per line.  
Those domains are used to create valid domain set against which we validate 
urls at runtime.  Only urls which match domains in the domain set are 
considered valid.

> Domain URL Filter
> -
>
> Key: NUTCH-668
> URL: https://issues.apache.org/jira/browse/NUTCH-668
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-668-1-20081202.patch
>
>
> A URLFilter that adds the ability to filter out URLs by top level domain or 
> by hostname.  A configuration file with a listing of URLs is used to denote 
> accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (NUTCH-668) Domain URL Filter

2008-12-02 Thread Dennis Kubes (JIRA)

Domain URL Filter
-

 Key: NUTCH-668
 URL: https://issues.apache.org/jira/browse/NUTCH-668
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


A URLFilter that adds the ability to filter out URLs by top level domain or by 
hostname.  A configuration file with a listing of URLs is used to denote 
accepted urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-667.


Resolution: Fixed

Committed with revision 722483

> Input Format for working with Content in Hadoop Streaming
> -
>
> Key: NUTCH-667
> URL: https://issues.apache.org/jira/browse/NUTCH-667
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-667-1-20081126.patch
>
>
> This is a ContextAsText input format that removes line endings with spaces 
> that allow Nutch content to be used more effectively inside of Hadoop 
> streaming jobs that allow MapReduce jobs to be written in any language that 
> can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-667.
--


> Input Format for working with Content in Hadoop Streaming
> -
>
> Key: NUTCH-667
> URL: https://issues.apache.org/jira/browse/NUTCH-667
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-667-1-20081126.patch
>
>
> This is a ContextAsText input format that removes line endings with spaces 
> that allow Nutch content to be used more effectively inside of Hadoop 
> streaming jobs that allow MapReduce jobs to be written in any language that 
> can communicate with stdin and stdout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-665) Search Load Testing Tool

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-665.
--


> Search Load Testing Tool
> 
>
> Key: NUTCH-665
> URL: https://issues.apache.org/jira/browse/NUTCH-665
> Project: Nutch
>  Issue Type: New Feature
>  Components: searcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-665-20081126-1.patch
>
>
> A tool which spawn a number of threads and executes searches against 
> configured search servers.  This is used for light load testing of search 
> servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-665) Search Load Testing Tool

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-665.


Resolution: Fixed

Committed with revision 722481

> Search Load Testing Tool
> 
>
> Key: NUTCH-665
> URL: https://issues.apache.org/jira/browse/NUTCH-665
> Project: Nutch
>  Issue Type: New Feature
>  Components: searcher
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-665-20081126-1.patch
>
>
> A tool which spawn a number of threads and executes searches against 
> configured search servers.  This is used for light load testing of search 
> servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-647) Resolve URLs tool

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-647.


   Resolution: Fixed
Fix Version/s: 1.0.0

Committed with revision 722478

> Resolve URLs tool
> -
>
> Key: NUTCH-647
> URL: https://issues.apache.org/jira/browse/NUTCH-647
> Project: Nutch
>  Issue Type: New Feature
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch
>
>
> A tool that takes a listing of urls and attempts to resolve their IP 
> addresses.  Useful for running after the fetcher has run to determine if DNS 
> problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-647) Resolve URLs tool

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-647.
--


> Resolve URLs tool
> -
>
> Key: NUTCH-647
> URL: https://issues.apache.org/jira/browse/NUTCH-647
> Project: Nutch
>  Issue Type: New Feature
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-647-1-20080818.patch, NUTCH-647-2-20081126.patch
>
>
> A tool that takes a listing of urls and attempts to resolve their IP 
> addresses.  Useful for running after the fetcher has run to determine if DNS 
> problems exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-663.
--


> Upgrade Nutch to use Hadoop 0.19
> 
>
> Key: NUTCH-663
> URL: https://issues.apache.org/jira/browse/NUTCH-663
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
> NUTCH-663-1-20081126.patch
>
>
> Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
> performance improvements, bug fixes, and new functionality.  Changes some 
> current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-663.


Resolution: Fixed

Committed with revision 722477

> Upgrade Nutch to use Hadoop 0.19
> 
>
> Key: NUTCH-663
> URL: https://issues.apache.org/jira/browse/NUTCH-663
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: hadoop-0.19-native.tar.gz, hadoop-0.19.0-core.jar, 
> NUTCH-663-1-20081126.patch
>
>
> Upgrade Nutch to use a newer hadoop, version 0.18.2.  This includes 
> performance improvements, bug fixes, and new functionality.  Changes some 
> current APIs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes resolved NUTCH-662.


Resolution: Fixed

Committed with revision 722475

> Upgrade Nutch to use Lucene 2.4
> ---
>
> Key: NUTCH-662
> URL: https://issues.apache.org/jira/browse/NUTCH-662
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, 
> lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch
>
>
> Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
> format.  New indexes created by this lucene version will NOT be readable by 
> older versions.  Lucene 2.4 can read and update older index formats although 
> updating an older format will convert it to the new format.  There are also 
> some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

2008-12-02 Thread Dennis Kubes (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes closed NUTCH-662.
--


closed

> Upgrade Nutch to use Lucene 2.4
> ---
>
> Key: NUTCH-662
> URL: https://issues.apache.org/jira/browse/NUTCH-662
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: lucene-analyzers-2.4.0.jar, lucene-core-2.4.0.jar, 
> lucene-misc-2.4.0.jar, NUTCH-662-20081121-1.patch
>
>
> Upgrade nutch to use Lucene 2.4.  This release changes the lucene file 
> format.  New indexes created by this lucene version will NOT be readable by 
> older versions.  Lucene 2.4 can read and update older index formats although 
> updating an older format will convert it to the new format.  There are also 
> some performance and functionality improvments.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Pending Commits for Nutch Issues

2008-12-02 Thread Susam Pal

I agree with John too. Probably you meant $ 0.02, since 0.02 cents is too
less. It is usually 2 cents. :-P

Regards,
Susam Pal

On Tue, Dec 2, 2008 at 6:09 PM, John Martyniak <[EMAIL PROTECTED]> wrote:

> Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/Solr
> integration would be a huge.
>
> just my .02 cents.
>
> -John
>
> On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:
>
>  And here is a list of issues from me that needs more discussion/review:
>>
>> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
>> review for people, for now we can just write a SolrIndexer like Sami
>> Siren's and deal with 442 after 1.0. I would be happy to provide such
>> a patch.
>>
>> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
>> don't know how to fix this one but indexing almost always fails with
>> index-more enabled.
>>
>> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
>> fetch interval correctly: I botched it once so now I am afraid to
>> commit it :D
>>
>> NUTCH-626 - fetcher2 breaks out the domain with
>> db.ignore.external.links set at cross domain redirects: I am going to
>> update the patch and commit it if no objections.
>>
>> Also, I think NUTCH-658 would be a nice feature for 1.0.
>>
>> There are some others but these are the most recent and we really
>> should push 1.0 out the door already :D
>>
>> Oh and finally we should do a review of all libraries in nutch
>> (libraries in plugins included) and update them to latest versions. I
>> am going to open an issue with the intenton of updating all the
>> libraries that do not require code changes.
>>
>> --
>> Doğacan Güney
>>
>
>

Re: Pending Commits for Nutch Issues

2008-12-02 Thread Julien Nioche

I agree with John. NUTCH-442 is by far the most popular/watched item in JIRA
and, I think, has been already used by quite a lot of different people to be
deemed reliable.

Julien


2008/12/2 John Martyniak <[EMAIL PROTECTED]>

> Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/Solr
> integration would be a huge.
>
> just my .02 cents.
>
> -John
>
>
> On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:
>
>  And here is a list of issues from me that needs more discussion/review:
>>
>> NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
>> review for people, for now we can just write a SolrIndexer like Sami
>> Siren's and deal with 442 after 1.0. I would be happy to provide such
>> a patch.
>>
>> NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
>> don't know how to fix this one but indexing almost always fails with
>> index-more enabled.
>>
>> NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
>> fetch interval correctly: I botched it once so now I am afraid to
>> commit it :D
>>
>> NUTCH-626 - fetcher2 breaks out the domain with
>> db.ignore.external.links set at cross domain redirects: I am going to
>> update the patch and commit it if no objections.
>>
>> Also, I think NUTCH-658 would be a nice feature for 1.0.
>>
>> There are some others but these are the most recent and we really
>> should push 1.0 out the door already :D
>>
>> Oh and finally we should do a review of all libraries in nutch
>> (libraries in plugins included) and update them to latest versions. I
>> am going to open an issue with the intenton of updating all the
>> libraries that do not require code changes.
>>
>> --
>> Doğacan Güney
>>
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Pending Commits for Nutch Issues

2008-12-02 Thread John Martyniak

Is NUTCH-442 going to be part of the 1.0 release?  I hope so, Nutch/ 
Solr integration would be a huge.


just my .02 cents.

-John

On Nov 27, 2008, at 12:10 PM, Doğacan Güney wrote:

And here is a list of issues from me that needs more discussion/ 
review:


NUTCH-442 - Integrate Nutch/Solr: If NUTCH-442 is too complex to
review for people, for now we can just write a SolrIndexer like Sami
Siren's and deal with 442 after 1.0. I would be happy to provide such
a patch.

NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException: I
don't know how to fix this one but indexing almost always fails with
index-more enabled.

NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
fetch interval correctly: I botched it once so now I am afraid to
commit it :D

NUTCH-626 - fetcher2 breaks out the domain with
db.ignore.external.links set at cross domain redirects: I am going to
update the patch and commit it if no objections.

Also, I think NUTCH-658 would be a nice feature for 1.0.

There are some others but these are the most recent and we really
should push 1.0 out the door already :D

Oh and finally we should do a review of all libraries in nutch
(libraries in plugins included) and update them to latest versions. I
am going to open an issue with the intenton of updating all the
libraries that do not require code changes.

--
Doğacan Güney

named parameters in crawl command

2008-12-02 Thread Koch Martina

Hi all,

I've defined a couple of custom parameters for the usage of bin/nutch like for 
example the parameter "-conf" to set the conf dir from the command line.
To be able to use the crawl command, I have to adjust the for-loop and if/else 
statements for the command line arguments args[] in the crawl.java in order to 
make my new parameters known to the class, because otherwise it takes the last 
"unknown" parameter as URL input directory (last else if statement). Wouldn't 
it be better to use a named parameter for the URL directory like for all the 
other parameters? By this, one wouldn't have to change Nutch core classes to 
use custom input parameters because they would simply be discarded, if the JAVA 
program has no use for them.
What do you think? In my opinion the change to version 1.0 would be a good 
point in time to introduce a slightly different usage of the standard crawl 
command.

Kind regards,
Martina

[jira] Issue Comment Edited: (NUTCH-664) Possibility to update already stored documents.

2008-12-02 Thread Sergey Khilkov (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651458#action_12651458
 ] 

skhil edited comment on NUTCH-664 at 12/2/08 1:29 AM:
---

Good news! So, I'll wait until 1.0 and prepare project for hbase-solr!

  was (Author: skhil):
Good news! So, I'll wait until 1.0 and prepare project for 
hbase-solr/katta/etc!
  
> Possibility to update already stored documents.
> ---
>
> Key: NUTCH-664
> URL: https://issues.apache.org/jira/browse/NUTCH-664
> Project: Nutch
>  Issue Type: Wish
>Reporter: Sergey Khilkov
>Priority: Minor
>
> We have huge index of stored documents. It is high cost procedure to fetch 
> page, merge indexes any time we update some information about page. The 
> information can be changed 1-3 times per day. At this moment we have to store 
> changed info in database, but in this case we have lots of problems with 
> sorting, search restricions and so on. Lucene itself allows delete single 
> document and add new one into existing index. But there is a problem with 
> hadoop... As I understand hadoop filesystem has no possibility to write in 
> random positions. But it will be great feature if nutch will be able to 
> update created index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Build failed in Hudson: Nutch-trunk #649

[jira] Updated: (NUTCH-668) Domain URL Filter

[jira] Created: (NUTCH-668) Domain URL Filter

[jira] Resolved: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

[jira] Closed: (NUTCH-667) Input Format for working with Content in Hadoop Streaming

[jira] Closed: (NUTCH-665) Search Load Testing Tool

[jira] Resolved: (NUTCH-665) Search Load Testing Tool

[jira] Resolved: (NUTCH-647) Resolve URLs tool

[jira] Closed: (NUTCH-647) Resolve URLs tool

[jira] Closed: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

[jira] Resolved: (NUTCH-663) Upgrade Nutch to use Hadoop 0.19

[jira] Resolved: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

[jira] Closed: (NUTCH-662) Upgrade Nutch to use Lucene 2.4

Re: Pending Commits for Nutch Issues

Re: Pending Commits for Nutch Issues

Re: Pending Commits for Nutch Issues

named parameters in crawl command

[jira] Issue Comment Edited: (NUTCH-664) Possibility to update already stored documents.

18 matches

Site Navigation

Mail list logo

Footer information