[GitHub] nutch pull request: Fix the issue of the bad tstamp

2016-02-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/94


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2213.
--
   Resolution: Fixed
Fix Version/s: 1.12

> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.12
>
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-29 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173188#comment-15173188
 ] 

Chris A. Mattmann commented on NUTCH-2213:
--

Fixed thanks [~jnioche]!

{noformat}
[chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master
Counting objects: 132, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (20/20), 1.98 KiB | 0 bytes/s, done.
Total 20 (delta 10), reused 0 (delta 0)
To https://git-wip-us.apache.org/repos/asf/nutch.git
   15c583e..a3e7420  master -> master
Branch master set up to track remote branch master from origin.
[chipotle:~/tmp/nutch1.12] mattmann% 
{noformat}


> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: easyfix
> Fix For: 1.12
>
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2213 : do not store the headers verbatim...

2016-02-29 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/88


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173182#comment-15173182
 ] 

ASF GitHub Bot commented on NUTCH-2213:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/88


> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2213) CommonCrawlDataDumper saves gzipped body in extracted form

2016-02-29 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2213 started by Chris A. Mattmann.

> CommonCrawlDataDumper saves gzipped body in extracted form
> --
>
> Key: NUTCH-2213
> URL: https://issues.apache.org/jira/browse/NUTCH-2213
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl, dumpers
>Reporter: Joris Rau
>Assignee: Chris A. Mattmann
>Priority: Critical
>  Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix the issue of the bad tstamp

2016-02-29 Thread jeremie70
GitHub user jeremie70 opened a pull request:

https://github.com/apache/nutch/pull/94

Fix the issue of the bad tstamp

The tstamp was everytime equal to "1970-01-01T00:00:00.000Z" cause of this.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jeremie70/nutch branch-tstamp

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/94.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #94


commit c98bb7030c4d128b8bcde65df7be922789ed0b02
Author: Jérémie Bourseau 
Date:   2016-02-29T16:02:21Z

fix the issue of the bad tstamp




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-29 Thread Adnane B. (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171889#comment-15171889
 ] 

Adnane B. commented on NUTCH-:
--

Thank you very match!

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-29 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171725#comment-15171725
 ] 

Tien Nguyen Manh commented on NUTCH-2236:
-

No problem, just to make it run on Hadoop 2.7.1

> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-29 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171705#comment-15171705
 ] 

Markus Jelsma commented on NUTCH-2236:
--

Hello Tien - what problem does this patch solve?
Thanks!

> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-29 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reassigned NUTCH-2236:


Assignee: Markus Jelsma

> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-29 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated NUTCH-2236:

Fix Version/s: 1.12

> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Fix For: 1.12
>
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)