[jira] [Resolved] (NUTCH-1921) Optionally disable HTTP if-modified-since header

2015-03-03 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1921.
--
Resolution: Fixed

Committed for trunk in rev. 1663698.
thanks! 

 Optionally disable HTTP if-modified-since header
 

 Key: NUTCH-1921
 URL: https://issues.apache.org/jira/browse/NUTCH-1921
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.9
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.10

 Attachments: NUTCH-1921-trunk.patch, NUTCH-1921-trunk.patch


 Records with fetch_not_modified are not parsed and are not passed through 
 parse filters, index filters and are not being indexed. This is a huge 
 problem if you modified parser filter, indexing filter or whatever behaviour 
 in the pipe line because changes never show up in the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-03-03 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1946:

Attachment: NUTCH-1946v3.patch

You need to build Gora master branch locally for this patch to work.

 Upgrade to Gora 0.6.1
 -

 Key: NUTCH-1946
 URL: https://issues.apache.org/jira/browse/NUTCH-1946
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.3.1

 Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, 
 NUTCH-1946v2.patch, NUTCH-1946v3.patch


 Apache Gora was released recently.
 We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
 for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1946) Upgrade to Gora 0.6.1

2015-03-03 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1946:

Summary: Upgrade to Gora 0.6.1  (was: Upgrade to Gora 0.6)

 Upgrade to Gora 0.6.1
 -

 Key: NUTCH-1946
 URL: https://issues.apache.org/jira/browse/NUTCH-1946
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.3.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 2.3.1

 Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, 
 NUTCH-1946v2.patch


 Apache Gora was released recently.
 We should upgrade before pushing Nutch 2.3.1 as it will come in very handy 
 for the new Docker containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1949:
-
Component/s: tool
 storage
 linkdb
 crawldb

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, linkdb, storage, tool
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Fix For: 1.10

 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1949:
-
Assignee: Lewis John McGibbney  (was: Giuseppe Totaro)

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, linkdb, storage, tool
Reporter: Giuseppe Totaro
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1949:
-
Fix Version/s: 1.10

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, linkdb, storage, tool
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Fix For: 1.10

 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1950) File name too long when bin/nutch dump

2015-03-03 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-1950.
--
Resolution: Fixed
  Assignee: Chris A. Mattmann

Great work you guys looks good to me! Seb, looks like they addressed your 
comment! Committed to trunk in r1663847 and this closes #9.

 File name too long when bin/nutch dump
 --

 Key: NUTCH-1950
 URL: https://issues.apache.org/jira/browse/NUTCH-1950
 Project: Nutch
  Issue Type: Bug
  Components: segment
Affects Versions: 1.10
Reporter: Chong Li
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.10

   Original Estimate: 48h
  Remaining Estimate: 48h

 When bin/dump in version 1.10-trunk, there will be an exception saying File 
 name too long. When crawling, the length of the url may be longer than 255 
 bytes and nutch save the file using the url as file name. It can be saved in 
 segments but when dumping the files to local file system, the length of the 
 filename can not be longer than 255 bytes. 
 The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346258#comment-14346258
 ] 

Lewis John McGibbney commented on NUTCH-1949:
-

Review undertaken by [~jnioche], [~chrismattmann] and [~lewismc] on this patch.
There is a roadmap to make this an indexing plugin. I will commit EoB tomorrow 
unless objections and we can open another issue to get it ported to an indexing 
plugin.

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for NUTCH-1950 contributed by xzjh

2015-03-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/9


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-03-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346254#comment-14346254
 ] 

ASF GitHub Bot commented on NUTCH-1950:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/9


 File name too long when bin/nutch dump
 --

 Key: NUTCH-1950
 URL: https://issues.apache.org/jira/browse/NUTCH-1950
 Project: Nutch
  Issue Type: Bug
  Components: segment
Affects Versions: 1.10
Reporter: Chong Li
Priority: Minor
 Fix For: 1.10

   Original Estimate: 48h
  Remaining Estimate: 48h

 When bin/dump in version 1.10-trunk, there will be an exception saying File 
 name too long. When crawling, the length of the url may be longer than 255 
 bytes and nutch save the file using the url as file name. It can be saved in 
 segments but when dumping the files to local file system, the length of the 
 filename can not be longer than 255 bytes. 
 The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346364#comment-14346364
 ] 

Chris A. Mattmann commented on NUTCH-1949:
--

+1

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format

2015-03-03 Thread Jorge Luis Betancourt Gonzalez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346265#comment-14346265
 ] 

Jorge Luis Betancourt Gonzalez commented on NUTCH-1949:
---

+1 

 Dump out the Nuth data into the Common Crawl format
 ---

 Key: NUTCH-1949
 URL: https://issues.apache.org/jira/browse/NUTCH-1949
 Project: Nutch
  Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, 
 CommonCrawlDataDumper_v02.pdf


 We are going to develop a {{CommonCrawlDataDumper.java}} class. The 
 {{CommonCrawlDataDumper}} is a tool able to perfom the following steps:  
 # deserialize the crawled data from Nutch
 # map serialized data on the proper JSON structure
 # serialize the data into [CBOR|http://cbor.io] format
 # optionally, compress the serialized data using {{gzip}}
 This tool has to be able to work with either single Nutch segments or 
 directory including segments as input data.
 Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support 
 and code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump

2015-03-03 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346320#comment-14346320
 ] 

Hudson commented on NUTCH-1950:
---

SUCCESS: Integrated in Nutch-trunk #2999 (See 
[https://builds.apache.org/job/Nutch-trunk/2999/])
Fix for NUTCH-1950 File name too long contributed by xzjh jsx...@gmail.com 
and Chong Li. This closes #9. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1663847)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java


 File name too long when bin/nutch dump
 --

 Key: NUTCH-1950
 URL: https://issues.apache.org/jira/browse/NUTCH-1950
 Project: Nutch
  Issue Type: Bug
  Components: segment
Affects Versions: 1.10
Reporter: Chong Li
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.10

   Original Estimate: 48h
  Remaining Estimate: 48h

 When bin/dump in version 1.10-trunk, there will be an exception saying File 
 name too long. When crawling, the length of the url may be longer than 255 
 bytes and nutch save the file using the url as file name. It can be saved in 
 segments but when dumping the files to local file system, the length of the 
 filename can not be longer than 255 bytes. 
 The FileDumper.java need to be changed to handle such exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)