[jira] [Created] (NUTCH-1557) File extraction and classification for any MIME types from segments
Chao Yan created NUTCH-1557: --- Summary: File extraction and classification for any MIME types from segments Key: NUTCH-1557 URL: https://issues.apache.org/jira/browse/NUTCH-1557 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.6 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15 Reporter: Chao Yan Priority: Minor Basic idea is to implement a file dumper as a plugin to extra files from Nutch SequenceFiles. The file dumper should detect the content type and dump them into different directories based on content type. The extracted file will be renamed based on information from URL, metadata, and even content. File name should be globally unique with the correct file extension. The file dumper should also allow user to specify the formats of the files they want, and can be extended to specify any criteria on the extracted files. A more advanced goal is to implement it with MapReduce. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1556: Fix Version/s: 2.2 enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.2 Attachments: NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId
[ https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630657#comment-13630657 ] Lewis John McGibbney commented on NUTCH-1556: - Nice one Kaveh. I will check this out soon. enabling updatedb to accept batchId Key: NUTCH-1556 URL: https://issues.apache.org/jira/browse/NUTCH-1556 Project: Nutch Issue Type: Improvement Affects Versions: 2.2 Reporter: kaveh minooie Fix For: 2.2 Attachments: NUTCH-1556.patch So the idea here is to be able to run updatedb and fetch for different batchId simultaneously. I put together a patch. it seems to be working ( it does skip the rows that do not match the batchId), but I am worried if and how it might affect the sorting in the reduce part. anyway check it out. it also change the command line usage to this: Usage: DbUpdaterJob (batchId | -all) [-crawlId id] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: so why does solrindex-mapping.xml get ignored?
Hi Kaveh, On Thu, Apr 11, 2013 at 11:53 PM, dev-digest-h...@nutch.apache.org wrote: so why does solrindex-mapping.xml get ignored? 23089 by: kaveh minooie why are we doing this? I have no idea. What is wrong?
[jira] [Commented] (NUTCH-1557) File extraction and classification for any MIME types from segments
[ https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630674#comment-13630674 ] Lewis John McGibbney commented on NUTCH-1557: - Hi Chao, Do you have any patch proposal for this? What is your requirement behind this issue? File extraction and classification for any MIME types from segments --- Key: NUTCH-1557 URL: https://issues.apache.org/jira/browse/NUTCH-1557 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.6 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15 Reporter: Chao Yan Priority: Minor Basic idea is to implement a file dumper as a plugin to extra files from Nutch SequenceFiles. The file dumper should detect the content type and dump them into different directories based on content type. The extracted file will be renamed based on information from URL, metadata, and even content. File name should be globally unique with the correct file extension. The file dumper should also allow user to specify the formats of the files they want, and can be extended to specify any criteria on the extracted files. A more advanced goal is to implement it with MapReduce. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: so why does solrindex-mapping.xml get ignored?
the code put the value under the original key anyway. there is no 'mapping'. it just copies. we have other instruction for copying fields. i think the code should strictly follow the mapping file. i think that whole if statement should not be there. On 04/12/2013 02:54 PM, Lewis John Mcgibbney wrote: Hi Kaveh, On Thu, Apr 11, 2013 at 11:53 PM, dev-digest-h...@nutch.apache.org mailto:dev-digest-h...@nutch.apache.org wrote: so why does solrindex-mapping.xml get ignored? 23089 by: kaveh minooie why are we doing this? I have no idea. What is wrong?
[jira] [Commented] (NUTCH-1557) File extraction and classification for any MIME types from segments
[ https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630852#comment-13630852 ] Chao Yan commented on NUTCH-1557: - Hi Lewis, I am still trying to build a usable patch. The segment dumper will serve as a plugin for Nutch to dump files from SequenceFiles, but I am still not clear that which extension-point it should be mount to. The dumper requires a mimes.type file which contains the mapping from mime types to file extensions and a third party library. File extraction and classification for any MIME types from segments --- Key: NUTCH-1557 URL: https://issues.apache.org/jira/browse/NUTCH-1557 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.6 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15 Reporter: Chao Yan Priority: Minor Attachments: FileDumper.java, readme.txt Basic idea is to implement a file dumper as a plugin to extra files from Nutch SequenceFiles. The file dumper should detect the content type and dump them into different directories based on content type. The extracted file will be renamed based on information from URL, metadata, and even content. File name should be globally unique with the correct file extension. The file dumper should also allow user to specify the formats of the files they want, and can be extended to specify any criteria on the extracted files. A more advanced goal is to implement it with MapReduce. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1557) File extraction and classification for any MIME types from segments
[ https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630852#comment-13630852 ] Chao Yan edited comment on NUTCH-1557 at 4/13/13 1:22 AM: -- Hi Lewis, I am still trying to build a usable patch. The segment dumper will serve as a plugin for Nutch to dump files from SequenceFiles, but I am still not clear that which extension-point it should be mount to. The dumper requires a mimes.type file which contains the mapping from mime types to file extensions and it also requires a third party library. was (Author: aceyan): Hi Lewis, I am still trying to build a usable patch. The segment dumper will serve as a plugin for Nutch to dump files from SequenceFiles, but I am still not clear that which extension-point it should be mount to. The dumper requires a mimes.type file which contains the mapping from mime types to file extensions and a third party library. File extraction and classification for any MIME types from segments --- Key: NUTCH-1557 URL: https://issues.apache.org/jira/browse/NUTCH-1557 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.6 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15 Reporter: Chao Yan Priority: Minor Attachments: FileDumper.java, readme.txt Basic idea is to implement a file dumper as a plugin to extra files from Nutch SequenceFiles. The file dumper should detect the content type and dump them into different directories based on content type. The extracted file will be renamed based on information from URL, metadata, and even content. File name should be globally unique with the correct file extension. The file dumper should also allow user to specify the formats of the files they want, and can be extended to specify any criteria on the extracted files. A more advanced goal is to implement it with MapReduce. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Nutch-nutchgora #567
See https://builds.apache.org/job/Nutch-nutchgora/567/ -- [...truncated 2874 lines...] deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-regex copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-regex init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-suffix init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-suffix [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] Note: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 1 warning jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/urlfilter-suffix.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-suffix copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-suffix init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-validator init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlfilter-validator [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/urlfilter-validator.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-validator init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlnormalizer-basic [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlnormalizer-basic/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlnormalizer-basic/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlnormalizer-basic
Build failed in Jenkins: Nutch-trunk #2166
See https://builds.apache.org/job/Nutch-trunk/2166/ -- [...truncated 3759 lines...] deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlmeta [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 2 source files to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/urlmeta.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/test [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-basic [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds [javac] Compiling 1 source file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6 [javac] 1 warning jar: [jar] Building jar: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar deps-test: deploy: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic copy-generated-lib: [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-host/test/data [copy] Copying 1 file to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-host/test/data init: [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-host/classes [mkdir] Created dir: /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-host init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-host [javac] /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117: warning: