date:20150911

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

2015-09-11 Thread Nadeem Douba (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741933#comment-14741933
 ] 

Nadeem Douba commented on NUTCH-1084:
-

I think I found the issue and I don't think it's related to Nutch. 
AbstractMapWritable uses the Class.forName method which throws the CNFE. This 
is because Class.forName uses the system class loader which is different than 
the current thread's class loader in that it does not include the job jar as 
part of its class path. I recompiled hadoop-common to see if it would fix the 
issue by replacing the Class.forName call with 
Thread.currentThread().getContextClassLoader().loadClass(class). This seems to 
fix the issue.

> ReadDB url throws exception
> ---
>
> Key: NUTCH-1084
> URL: https://issues.apache.org/jira/browse/NUTCH-1084
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-1084.patch
>
>
> Readdb -url suffers from two problems:
> 1. it trips over the _SUCCESS file generated by newer Hadoop version
> 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???)
> The first problem can be remedied by not allowing the injector or updater to 
> write the _SUCCESS file. Until now that's the solution implemented for 
> similar issues. I've not been successful as to make the Hadoop readers simply 
> skip the file.
> The second issue seems a bit strange and did not happen on a local check out. 
> I'm not yet sure whether this is a Hadoop issue or something being corrupt in 
> the CrawlDB. Here's the stack trace:
> {code}
> Exception in thread "main" java.io.IOException: can't find class: 
> org.apache.nutch.protocol.ProtocolStatus because 
> org.apache.nutch.protocol.ProtocolStatus
> at 
> org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204)
> at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146)
> at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278)
> at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
> at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524)
> at 
> org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105)
> at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383)
> at 
> org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389)
> at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-11 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741901#comment-14741901
 ] 

Chris A. Mattmann commented on NUTCH-2094:
--

no problem just switch to branch-2.3 (should be a branch in the Github repo)

> When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
> already stopped crawl and then stop it again. 
> ---
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2096) Explicitly indicate broswer binary to use when selecting selenium remote option in config

2015-09-11 Thread Kim Whitehall (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741693#comment-14741693
 ] 

Kim Whitehall commented on NUTCH-2096:
--

I can't seem to figure out how to assign this task to myself. Anyhows, I'm 
working on it. 


> Explicitly indicate broswer binary to use when selecting selenium remote 
> option in config
> -
>
> Key: NUTCH-2096
> URL: https://issues.apache.org/jira/browse/NUTCH-2096
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Kim Whitehall
>Priority: Minor
> Fix For: 1.11
>
>
> When using the selenium grid, not defining the binary version on nodes that 
> have multiple versions of browsers can lead to errors. 
> The solution proposed is to extend the DesiredCapabilities capabilities 
> provided in the "remote" case of 
> $NUTCH_HOME/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
>  provided in NUTCH-2083 to explicitly indicate the browser path.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2096) Explicitly indicate broswer binary to use when selecting selenium remote option in config

2015-09-11 Thread Kim Whitehall (JIRA)

Kim Whitehall created NUTCH-2096:


 Summary: Explicitly indicate broswer binary to use when selecting 
selenium remote option in config
 Key: NUTCH-2096
 URL: https://issues.apache.org/jira/browse/NUTCH-2096
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Kim Whitehall
Priority: Minor
 Fix For: 1.11


When using the selenium grid, not defining the binary version on nodes that 
have multiple versions of browsers can lead to errors. 
The solution proposed is to extend the DesiredCapabilities capabilities 
provided in the "remote" case of 
$NUTCH_HOME/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
 provided in NUTCH-2083 to explicitly indicate the browser path.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-11 Thread Jorge Luis Betancourt Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2095:
--
Attachment: NUTCH-2095.patch

> WARC exporter for the CommonCrawlDataDumper
> ---
>
> Key: NUTCH-2095
> URL: https://issues.apache.org/jira/browse/NUTCH-2095
> Project: Nutch
>  Issue Type: Improvement
>  Components: commoncrawl, tool
>Affects Versions: 1.11
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: tools, warc
> Attachments: NUTCH-2095.patch
>
>
> Adds the possibility of exporting the nutch segments to a WARC files.
> From the usage point of view a couple of new command line options are 
> available:
> {{-warc}}: enables the functionality to export into WARC files, if not 
> specified the default JACKSON formatter is used.
> {{-warcSize}}: enable the option to define a max file size for each WARC 
> file, if not specified a default of 1GB per file is used as recommended by 
> the WARC ISO standard.
> The usual {{-gzip}} flag can be used to enable compression on the WARC files.
> Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
> some changes to the Factory and to the Formats. This changes avoid creating a 
> new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

2015-09-11 Thread Jorge Luis Betancourt Gonzalez (JIRA)

Jorge Luis Betancourt Gonzalez created NUTCH-2095:
-

 Summary: WARC exporter for the CommonCrawlDataDumper
 Key: NUTCH-2095
 URL: https://issues.apache.org/jira/browse/NUTCH-2095
 Project: Nutch
  Issue Type: Improvement
  Components: commoncrawl, tool
Affects Versions: 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor


Adds the possibility of exporting the nutch segments to a WARC files.

>From the usage point of view a couple of new command line options are 
>available:

{{-warc}}: enables the functionality to export into WARC files, if not 
specified the default JACKSON formatter is used.
{{-warcSize}}: enable the option to define a max file size for each WARC file, 
if not specified a default of 1GB per file is used as recommended by the WARC 
ISO standard.

The usual {{-gzip}} flag can be used to enable compression on the WARC files.

Some changes to the default {{CommonCrawlDataDumper}} were done, essentially 
some changes to the Factory and to the Formats. This changes avoid creating a 
new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-11 Thread Prerna Satija (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741384#comment-14741384
 ] 

Prerna Satija commented on NUTCH-2094:
--

Hi [~chrismattmann] I opened the git link that you shared. But the nutch 
repository in the clone link is for nutch 1.0 but my fix is for a bug in the 
Nutch 2.3 version. Can you send the clone link for nutch 2.3 ?

> When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
> already stopped crawl and then stop it again. 
> ---
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-11 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741292#comment-14741292
 ] 

Chris A. Mattmann commented on NUTCH-2094:
--

Hi [~prernasatija] would you be willing to submit a Pull Request/Patch for this 
per http://github.com/apache/nutch/#contributing (for this issue?) I would be 
happy to commit it.

> When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
> already stopped crawl and then stop it again. 
> ---
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-11 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2094 started by Chris A. Mattmann.

> When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
> already stopped crawl and then stop it again. 
> ---
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-11 Thread Chris A. Mattmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reopened NUTCH-2094:
--
  Assignee: Chris A. Mattmann

> When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
> already stopped crawl and then stop it again. 
> ---
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-11 Thread Prerna Satija (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prerna Satija resolved NUTCH-2094.
--
Resolution: Fixed

I fixed this issue on line 57 of NutchServerPoolExecutor.java.

Instead of this line
  runningWorkers.remove(((JobWorker) runnable).getInfo());

I have put 
  runningWorkers.remove(((JobWorker) runnable));

This was a bug in Nutch 2.3 code as runningWorkers is a queue of JobWorker type 
so only an object of type JobWorker should be removed from the queue and not 
jobWorker.getInfo() because that will remove JobInfo type of object from 
runningWorkers queue.

> When stopping a crawl in Nutch 2.3, I was having trouble when I start an 
> already stopped crawl and then stop it again. 
> ---
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>Reporter: Prerna Satija
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

2015-09-11 Thread Prerna Satija (JIRA)

Prerna Satija created NUTCH-2094:


 Summary: When stopping a crawl in Nutch 2.3, I was having trouble 
when I start an already stopped crawl and then stop it again. 
 Key: NUTCH-2094
 URL: https://issues.apache.org/jira/browse/NUTCH-2094
 Project: Nutch
  Issue Type: Bug
Reporter: Prerna Satija


I have created a stop button in Nutch webapp to stop a running crawl from the 
UI on click of a "stop" button. While testing, I found that I am able to stop a 
crawl successfully but when I restart a stopped crawl and try to stop it, it 
doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-11 Thread jorgelbg

GitHub user jorgelbg reopened a pull request:

https://github.com/apache/nutch/pull/55

WARC exporter for the CommonCrawlDataDumper

This adds the possibility of exporting the nutch segments to a WARC files. 

From the usage point of view a couple of new command line options are 
available: 

* `-warc`: enables the functionality to export into WARC files, if not 
specified the default JACKSON formatter is used.
* `-warcSize`: enable the option to define a max file size for each WARC 
file, if not specified a default of 1GB per file is used as recommended by the 
WARC ISO standard.

The usual `-gzip` flag can be used to enable compression on the WARC files, 
which allow to compress the output files. 

Some changes to the default CommonCrawlDataDumper were done, essentially 
some changes to the Factory and to the Formats. This changes avoid creating a 
new instance of a CommmonCrawlFormat on each URL read from the segments. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DigitalPebble/nutch warc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/55.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #55


commit 0a627e5a5098a2ad4818b594fe567ea7fdd2c131
Author: Jorge Luis Betancourt 
Date:   2015-09-08T13:21:04Z

Initial version of the CommonCrawlWARCFormat, generates valid metadata, 
response and request records. The request
records only provide partial information, roughly the same as the 
CommonCrawl Data Dumper at the moment.

commit 1889a0b64d48005499f4de01ed18724087feb0f7
Author: Jorge Luis Betancourt 
Date:   2015-09-08T16:37:27Z

Adding the WARCUtils class and the dependency to the ivy.xml file to avoid 
the fetching of another hadoop dependency

commit 169e5a4a4172424b31c91e232bb69056b10827c7
Author: Jorge Luis Betancourt 
Date:   2015-09-08T18:21:47Z

Removing the transitive property of the ivy.xml file to avoid any future 
troubles

commit ede35d1aa767741ec5206de7990910fc661983e8
Author: Jorge Luis Betancourt 
Date:   2015-09-10T17:57:11Z

Doing some refactoring on the existing code, essentially trying to avoid 
creating an instance of each CommonCrawlFormat
per URL processed, since the format is content indepdendent at the momento 
the factory should allow to create a format
without this data.

Added a close method to the the CommonCrawlFormat interface for those cases 
when the format needs some closing
statement.

commit 44beb74172364556f70b6f08d0a8ee511c99eff4
Author: Jorge Luis Betancourt 
Date:   2015-09-11T14:34:42Z

Adding the changes to the main CCDataDumper class to call the WARC exporter 
tool.
Changes to the Jackson format to work with the new structure.
Changes to the FormatFactory to create the right Jacson/WARC instance.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-11 Thread jorgelbg

Github user jorgelbg closed the pull request at:

https://github.com/apache/nutch/pull/55


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

2015-09-11 Thread jorgelbg

GitHub user jorgelbg opened a pull request:

https://github.com/apache/nutch/pull/55

WARC exporter for the CommonCrawlDataDumper

This adds the possibility of exporting the nutch segments to a WARC files. 

From the usage point of view a couple of new command line options are 
available: 

* `-warc`: enables the functionality to export into WARC files, if not 
specified the default JACKSON formatter is used.
* `-warcSize`: enable the option to define a max file size for each WARC 
file, if not specified a default of 1GB per file is used as recommended by the 
WARC ISO standard.

The usual `-gzip` flag can be used to enable compression on the WARC files, 
which allow to compress the output files. 

Some changes to the default CommonCrawlDataDumper were done, essentially 
some changes to the Factory and to the Formats. This changes avoid creating a 
new instance of a CommmonCrawlFormat on each URL read from the segments. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/DigitalPebble/nutch warc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/55.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #55


commit 0a627e5a5098a2ad4818b594fe567ea7fdd2c131
Author: Jorge Luis Betancourt 
Date:   2015-09-08T13:21:04Z

Initial version of the CommonCrawlWARCFormat, generates valid metadata, 
response and request records. The request
records only provide partial information, roughly the same as the 
CommonCrawl Data Dumper at the moment.

commit 1889a0b64d48005499f4de01ed18724087feb0f7
Author: Jorge Luis Betancourt 
Date:   2015-09-08T16:37:27Z

Adding the WARCUtils class and the dependency to the ivy.xml file to avoid 
the fetching of another hadoop dependency

commit 169e5a4a4172424b31c91e232bb69056b10827c7
Author: Jorge Luis Betancourt 
Date:   2015-09-08T18:21:47Z

Removing the transitive property of the ivy.xml file to avoid any future 
troubles

commit ede35d1aa767741ec5206de7990910fc661983e8
Author: Jorge Luis Betancourt 
Date:   2015-09-10T17:57:11Z

Doing some refactoring on the existing code, essentially trying to avoid 
creating an instance of each CommonCrawlFormat
per URL processed, since the format is content indepdendent at the momento 
the factory should allow to create a format
without this data.

Added a close method to the the CommonCrawlFormat interface for those cases 
when the format needs some closing
statement.

commit 44beb74172364556f70b6f08d0a8ee511c99eff4
Author: Jorge Luis Betancourt 
Date:   2015-09-11T14:34:42Z

Adding the changes to the main CCDataDumper class to call the WARC exporter 
tool.
Changes to the Jackson format to work with the new structure.
Changes to the FormatFactory to create the right Jacson/WARC instance.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[jira] [Created] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-11 Thread Markus Jelsma (JIRA)

Markus Jelsma created NUTCH-2093:


 Summary: Indexing filters have no signature in CrawlDatum if 
crawled via FreeGenerator
 Key: NUTCH-2093
 URL: https://issues.apache.org/jira/browse/NUTCH-2093
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.10
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.11
 Attachments: NUTCH-2093.patch

In IndexerMapReduce, a fetchDatum is passed to the indexing filters. However, 
when this fetchDatum was created via FreeGenerator, it has no signature 
attached, and indexing filters don't see it.

This patch copies the signature from the dbDatum just before passed to indexing 
filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

2015-09-11 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2093:
-
Attachment: NUTCH-2093.patch

Patch for trunk.

> Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator
> -
>
> Key: NUTCH-2093
> URL: https://issues.apache.org/jira/browse/NUTCH-2093
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-2093.patch
>
>
> In IndexerMapReduce, a fetchDatum is passed to the indexing filters. However, 
> when this fetchDatum was created via FreeGenerator, it has no signature 
> attached, and indexing filters don't see it.
> This patch copies the signature from the dbDatum just before passed to 
> indexing filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1084) ReadDB url throws exception

[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

[jira] [Commented] (NUTCH-2096) Explicitly indicate broswer binary to use when selecting selenium remote option in config

[jira] [Created] (NUTCH-2096) Explicitly indicate broswer binary to use when selecting selenium remote option in config

[jira] [Updated] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

[jira] [Created] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper

[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

[jira] [Work started] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

[jira] [Reopened] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

[jira] [Resolved] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

[jira] [Created] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper

[jira] [Created] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

[jira] [Updated] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator

17 matches

Site Navigation

Mail list logo

Footer information