[jira] [Resolved] (NUTCH-1921) Optionally disable HTTP if-modified-since header
[ https://issues.apache.org/jira/browse/NUTCH-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1921. -- Resolution: Fixed Committed for trunk in rev. 1663698. thanks! Optionally disable HTTP if-modified-since header Key: NUTCH-1921 URL: https://issues.apache.org/jira/browse/NUTCH-1921 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.9 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.10 Attachments: NUTCH-1921-trunk.patch, NUTCH-1921-trunk.patch Records with fetch_not_modified are not parsed and are not passed through parse filters, index filters and are not being indexed. This is a huge problem if you modified parser filter, indexing filter or whatever behaviour in the pipe line because changes never show up in the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1946) Upgrade to Gora 0.6.1
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1946: Attachment: NUTCH-1946v3.patch You need to build Gora master branch locally for this patch to work. Upgrade to Gora 0.6.1 - Key: NUTCH-1946 URL: https://issues.apache.org/jira/browse/NUTCH-1946 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.3.1 Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, NUTCH-1946v2.patch, NUTCH-1946v3.patch Apache Gora was released recently. We should upgrade before pushing Nutch 2.3.1 as it will come in very handy for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1946) Upgrade to Gora 0.6.1
[ https://issues.apache.org/jira/browse/NUTCH-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1946: Summary: Upgrade to Gora 0.6.1 (was: Upgrade to Gora 0.6) Upgrade to Gora 0.6.1 - Key: NUTCH-1946 URL: https://issues.apache.org/jira/browse/NUTCH-1946 Project: Nutch Issue Type: Improvement Components: storage Affects Versions: 2.3.1 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Fix For: 2.3.1 Attachments: NUTCH-1946.patch, NUTCH-1946_Gora_fixes.patch, NUTCH-1946v2.patch Apache Gora was released recently. We should upgrade before pushing Nutch 2.3.1 as it will come in very handy for the new Docker containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format
[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-1949: - Component/s: tool storage linkdb crawldb Dump out the Nuth data into the Common Crawl format --- Key: NUTCH-1949 URL: https://issues.apache.org/jira/browse/NUTCH-1949 Project: Nutch Issue Type: New Feature Components: crawldb, linkdb, storage, tool Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Fix For: 1.10 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, CommonCrawlDataDumper_v02.pdf We are going to develop a {{CommonCrawlDataDumper.java}} class. The {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: # deserialize the crawled data from Nutch # map serialized data on the proper JSON structure # serialize the data into [CBOR|http://cbor.io] format # optionally, compress the serialized data using {{gzip}} This tool has to be able to work with either single Nutch segments or directory including segments as input data. Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format
[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-1949: - Assignee: Lewis John McGibbney (was: Giuseppe Totaro) Dump out the Nuth data into the Common Crawl format --- Key: NUTCH-1949 URL: https://issues.apache.org/jira/browse/NUTCH-1949 Project: Nutch Issue Type: New Feature Components: crawldb, linkdb, storage, tool Reporter: Giuseppe Totaro Assignee: Lewis John McGibbney Fix For: 1.10 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, CommonCrawlDataDumper_v02.pdf We are going to develop a {{CommonCrawlDataDumper.java}} class. The {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: # deserialize the crawled data from Nutch # map serialized data on the proper JSON structure # serialize the data into [CBOR|http://cbor.io] format # optionally, compress the serialized data using {{gzip}} This tool has to be able to work with either single Nutch segments or directory including segments as input data. Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format
[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-1949: - Fix Version/s: 1.10 Dump out the Nuth data into the Common Crawl format --- Key: NUTCH-1949 URL: https://issues.apache.org/jira/browse/NUTCH-1949 Project: Nutch Issue Type: New Feature Components: crawldb, linkdb, storage, tool Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Fix For: 1.10 Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, CommonCrawlDataDumper_v02.pdf We are going to develop a {{CommonCrawlDataDumper.java}} class. The {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: # deserialize the crawled data from Nutch # map serialized data on the proper JSON structure # serialize the data into [CBOR|http://cbor.io] format # optionally, compress the serialized data using {{gzip}} This tool has to be able to work with either single Nutch segments or directory including segments as input data. Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-1950. -- Resolution: Fixed Assignee: Chris A. Mattmann Great work you guys looks good to me! Seb, looks like they addressed your comment! Committed to trunk in r1663847 and this closes #9. File name too long when bin/nutch dump -- Key: NUTCH-1950 URL: https://issues.apache.org/jira/browse/NUTCH-1950 Project: Nutch Issue Type: Bug Components: segment Affects Versions: 1.10 Reporter: Chong Li Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.10 Original Estimate: 48h Remaining Estimate: 48h When bin/dump in version 1.10-trunk, there will be an exception saying File name too long. When crawling, the length of the url may be longer than 255 bytes and nutch save the file using the url as file name. It can be saved in segments but when dumping the files to local file system, the length of the filename can not be longer than 255 bytes. The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format
[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346258#comment-14346258 ] Lewis John McGibbney commented on NUTCH-1949: - Review undertaken by [~jnioche], [~chrismattmann] and [~lewismc] on this patch. There is a roadmap to make this an indexing plugin. I will commit EoB tomorrow unless objections and we can open another issue to get it ported to an indexing plugin. Dump out the Nuth data into the Common Crawl format --- Key: NUTCH-1949 URL: https://issues.apache.org/jira/browse/NUTCH-1949 Project: Nutch Issue Type: New Feature Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, CommonCrawlDataDumper_v02.pdf We are going to develop a {{CommonCrawlDataDumper.java}} class. The {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: # deserialize the crawled data from Nutch # map serialized data on the proper JSON structure # serialize the data into [CBOR|http://cbor.io] format # optionally, compress the serialized data using {{gzip}} This tool has to be able to work with either single Nutch segments or directory including segments as input data. Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: fix for NUTCH-1950 contributed by xzjh
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/9 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346254#comment-14346254 ] ASF GitHub Bot commented on NUTCH-1950: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/9 File name too long when bin/nutch dump -- Key: NUTCH-1950 URL: https://issues.apache.org/jira/browse/NUTCH-1950 Project: Nutch Issue Type: Bug Components: segment Affects Versions: 1.10 Reporter: Chong Li Priority: Minor Fix For: 1.10 Original Estimate: 48h Remaining Estimate: 48h When bin/dump in version 1.10-trunk, there will be an exception saying File name too long. When crawling, the length of the url may be longer than 255 bytes and nutch save the file using the url as file name. It can be saved in segments but when dumping the files to local file system, the length of the filename can not be longer than 255 bytes. The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format
[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346364#comment-14346364 ] Chris A. Mattmann commented on NUTCH-1949: -- +1 Dump out the Nuth data into the Common Crawl format --- Key: NUTCH-1949 URL: https://issues.apache.org/jira/browse/NUTCH-1949 Project: Nutch Issue Type: New Feature Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, CommonCrawlDataDumper_v02.pdf We are going to develop a {{CommonCrawlDataDumper.java}} class. The {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: # deserialize the crawled data from Nutch # map serialized data on the proper JSON structure # serialize the data into [CBOR|http://cbor.io] format # optionally, compress the serialized data using {{gzip}} This tool has to be able to work with either single Nutch segments or directory including segments as input data. Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1949) Dump out the Nuth data into the Common Crawl format
[ https://issues.apache.org/jira/browse/NUTCH-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346265#comment-14346265 ] Jorge Luis Betancourt Gonzalez commented on NUTCH-1949: --- +1 Dump out the Nuth data into the Common Crawl format --- Key: NUTCH-1949 URL: https://issues.apache.org/jira/browse/NUTCH-1949 Project: Nutch Issue Type: New Feature Reporter: Giuseppe Totaro Assignee: Giuseppe Totaro Attachments: CommonCrawlDataDumper.pdf, CommonCrawlDataDumper.xlsx, CommonCrawlDataDumper_v02.pdf We are going to develop a {{CommonCrawlDataDumper.java}} class. The {{CommonCrawlDataDumper}} is a tool able to perfom the following steps: # deserialize the crawled data from Nutch # map serialized data on the proper JSON structure # serialize the data into [CBOR|http://cbor.io] format # optionally, compress the serialized data using {{gzip}} This tool has to be able to work with either single Nutch segments or directory including segments as input data. Thanks [~lewismc] and [~chrismattmann] for your great suggestions, support and code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1950) File name too long when bin/nutch dump
[ https://issues.apache.org/jira/browse/NUTCH-1950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14346320#comment-14346320 ] Hudson commented on NUTCH-1950: --- SUCCESS: Integrated in Nutch-trunk #2999 (See [https://builds.apache.org/job/Nutch-trunk/2999/]) Fix for NUTCH-1950 File name too long contributed by xzjh jsx...@gmail.com and Chong Li. This closes #9. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1663847) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/tools/FileDumper.java File name too long when bin/nutch dump -- Key: NUTCH-1950 URL: https://issues.apache.org/jira/browse/NUTCH-1950 Project: Nutch Issue Type: Bug Components: segment Affects Versions: 1.10 Reporter: Chong Li Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.10 Original Estimate: 48h Remaining Estimate: 48h When bin/dump in version 1.10-trunk, there will be an exception saying File name too long. When crawling, the length of the url may be longer than 255 bytes and nutch save the file using the url as file name. It can be saved in segments but when dumping the files to local file system, the length of the filename can not be longer than 255 bytes. The FileDumper.java need to be changed to handle such exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)