[jira] [Created] (NUTCH-1557) File extraction and classification for any MIME types from segments

2013-04-12 Thread Chao Yan (JIRA)
Chao Yan created NUTCH-1557:
---

 Summary: File extraction and classification for any MIME types 
from segments
 Key: NUTCH-1557
 URL: https://issues.apache.org/jira/browse/NUTCH-1557
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: 
Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
Reporter: Chao Yan
Priority: Minor


Basic idea is to implement a file dumper as a plugin to extra files from Nutch 
SequenceFiles. The file dumper should detect the content type and dump them 
into different directories based on content type. The extracted file will be 
renamed based on information from URL, metadata, and even content. File name 
should be globally unique with the correct file extension. The file dumper 
should also allow user to specify the formats of the files they want, and can 
be extended to specify any criteria on the extracted files. A more advanced 
goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1556) enabling updatedb to accept batchId

2013-04-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1556:


Fix Version/s: 2.2

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.2

 Attachments: NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1556) enabling updatedb to accept batchId

2013-04-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630657#comment-13630657
 ] 

Lewis John McGibbney commented on NUTCH-1556:
-

Nice one Kaveh. I will check this out soon.

 enabling updatedb to accept batchId 
 

 Key: NUTCH-1556
 URL: https://issues.apache.org/jira/browse/NUTCH-1556
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: kaveh minooie
 Fix For: 2.2

 Attachments: NUTCH-1556.patch


 So the idea here is to be able to run updatedb and fetch for different 
 batchId simultaneously. I put together a patch. it seems to be working ( it 
 does skip the rows that do not match the batchId), but I am worried if and 
 how it might affect the sorting in the reduce part. anyway check it out. 
 it also change the command line usage to this:
 Usage: DbUpdaterJob (batchId | -all) [-crawlId id]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: so why does solrindex-mapping.xml get ignored?

2013-04-12 Thread Lewis John Mcgibbney
Hi Kaveh,


On Thu, Apr 11, 2013 at 11:53 PM, dev-digest-h...@nutch.apache.org wrote:


 so why does solrindex-mapping.xml get ignored?
 23089 by: kaveh minooie

 why are we doing this?


I have no idea.
What is wrong?


[jira] [Commented] (NUTCH-1557) File extraction and classification for any MIME types from segments

2013-04-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630674#comment-13630674
 ] 

Lewis John McGibbney commented on NUTCH-1557:
-

Hi Chao,
Do you have any patch proposal for this?
What is your requirement behind this issue?

 File extraction and classification for any MIME types from segments
 ---

 Key: NUTCH-1557
 URL: https://issues.apache.org/jira/browse/NUTCH-1557
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: 
 Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
Reporter: Chao Yan
Priority: Minor

 Basic idea is to implement a file dumper as a plugin to extra files from 
 Nutch SequenceFiles. The file dumper should detect the content type and dump 
 them into different directories based on content type. The extracted file 
 will be renamed based on information from URL, metadata, and even content. 
 File name should be globally unique with the correct file extension. The file 
 dumper should also allow user to specify the formats of the files they want, 
 and can be extended to specify any criteria on the extracted files. A more 
 advanced goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: so why does solrindex-mapping.xml get ignored?

2013-04-12 Thread kaveh minooie
the code put the value under the original key anyway. there is no 
'mapping'. it just copies. we have other instruction for copying fields. 
i think the code should strictly follow the mapping file. i think that 
whole if statement should not be there.



On 04/12/2013 02:54 PM, Lewis John Mcgibbney wrote:

Hi Kaveh,


On Thu, Apr 11, 2013 at 11:53 PM, dev-digest-h...@nutch.apache.org 
mailto:dev-digest-h...@nutch.apache.org wrote:



so why does solrindex-mapping.xml get ignored?
23089 by: kaveh minooie

why are we doing this?


I have no idea.
What is wrong?




[jira] [Commented] (NUTCH-1557) File extraction and classification for any MIME types from segments

2013-04-12 Thread Chao Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630852#comment-13630852
 ] 

Chao Yan commented on NUTCH-1557:
-

Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a 
plugin for Nutch to dump files from SequenceFiles, but I am still not clear 
that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime 
types to file extensions and a third party library.

 File extraction and classification for any MIME types from segments
 ---

 Key: NUTCH-1557
 URL: https://issues.apache.org/jira/browse/NUTCH-1557
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: 
 Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
Reporter: Chao Yan
Priority: Minor
 Attachments: FileDumper.java, readme.txt


 Basic idea is to implement a file dumper as a plugin to extra files from 
 Nutch SequenceFiles. The file dumper should detect the content type and dump 
 them into different directories based on content type. The extracted file 
 will be renamed based on information from URL, metadata, and even content. 
 File name should be globally unique with the correct file extension. The file 
 dumper should also allow user to specify the formats of the files they want, 
 and can be extended to specify any criteria on the extracted files. A more 
 advanced goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (NUTCH-1557) File extraction and classification for any MIME types from segments

2013-04-12 Thread Chao Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630852#comment-13630852
 ] 

Chao Yan edited comment on NUTCH-1557 at 4/13/13 1:22 AM:
--

Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a 
plugin for Nutch to dump files from SequenceFiles, but I am still not clear 
that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime 
types to file extensions and it also requires a third party library.

  was (Author: aceyan):
Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a 
plugin for Nutch to dump files from SequenceFiles, but I am still not clear 
that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime 
types to file extensions and a third party library.
  
 File extraction and classification for any MIME types from segments
 ---

 Key: NUTCH-1557
 URL: https://issues.apache.org/jira/browse/NUTCH-1557
 Project: Nutch
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.6
 Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: 
 Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
Reporter: Chao Yan
Priority: Minor
 Attachments: FileDumper.java, readme.txt


 Basic idea is to implement a file dumper as a plugin to extra files from 
 Nutch SequenceFiles. The file dumper should detect the content type and dump 
 them into different directories based on content type. The extracted file 
 will be renamed based on information from URL, metadata, and even content. 
 File name should be globally unique with the correct file extension. The file 
 dumper should also allow user to specify the formats of the files they want, 
 and can be extended to specify any criteria on the extracted files. A more 
 advanced goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: Nutch-nutchgora #567

2013-04-12 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/567/

--
[...truncated 2874 lines...]
deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-regex

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-regex

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-suffix

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-suffix
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] Note: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/urlfilter-suffix/src/java/org/apache/nutch/urlfilter/suffix/SuffixURLFilter.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 1 warning

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-suffix/urlfilter-suffix.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-suffix

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-suffix

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-validator

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlfilter-validator/urlfilter-validator.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlnormalizer-basic
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/urlnormalizer-basic/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-nutchgora/2.x/build/plugins/urlnormalizer-basic


Build failed in Jenkins: Nutch-trunk #2166

2013-04-12 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2166/

--
[...truncated 3759 lines...]
deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlfilter-validator

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlmeta
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 2 source files to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlmeta/urlmeta.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlmeta

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/test
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
 warning: 'includeantruntime' was not set, defaulting to 
build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 1 source file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/classes
[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-basic/urlnormalizer-basic.jar

deps-test:

deploy:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic

copy-generated-lib:
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-basic
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-host/test/data
 [copy] Copying 1 file to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-host/test/data

init:
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/urlnormalizer-host/classes
[mkdir] Created dir: 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/build/plugins/urlnormalizer-host

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/ivy/ivysettings.xml

compile:
 [echo] Compiling plugin: urlnormalizer-host
[javac] 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/Nutch-trunk/trunk/src/plugin/build-plugin.xml:117:
 warning: