Build failed in Jenkins: Nutch-nutchgora #1019

2014-05-28 Thread Apache Jenkins Server
See 

--
[...truncated 2968 lines...]
 [echo] Compiling plugin: urlfilter-suffix
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlfilter-validator
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-basic
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 1 source file to 

[javac] warning: [options] bootstrap class path not set in conjunction with 
-source 1.6
[javac] 1 warning

jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 

[mkdir] Created dir: 


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2014-05-28 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011649#comment-14011649
 ] 

Sebastian Nagel commented on NUTCH-1785:


+1 (works, looks reasonable). Open points:
* what about truncated content? (not a blocker)
* conversion via {{new String(content.getContent())}} is needless if base64 is 
true
* this conversion depends on system's locale, cf. NUTCH-1693
* but which charset shall we use to convert the byte[] into a String if there 
exist codepoints >127?
*# the charset used for parsing is not available to indexer (it's in parse 
metadata)
*# maybe ASCII is a good choice, cf. comments in sniffCharacterEncoding 
(parse-html)
*# in any case (for non-ASCII stuff): the indexing back-ends must consider that 
the String in field binaryContent may need recoding


> Ability to index raw content
> 
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1785) Ability to index raw content

2014-05-28 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1785:
-

Attachment: NUTCH-1785-trunk.patch

Updated patch to reflect changed schema's.

> Ability to index raw content
> 
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2014-05-28 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010918#comment-14010918
 ] 

Markus Jelsma commented on NUTCH-1785:
--

Julien, that may stored="true" indexed="false" may be more suitable for Nutch 
in general. In our use-case we actually index it and have it passed through 
some analysis. Lets change it to stored="true" indexed="false".

lufeng,the -base64 option is passed at the command line. If a segment contains 
both types, it must be set because we cannot index such bytes.

> Ability to index raw content
> 
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2014-05-28 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010913#comment-14010913
 ] 

Julien Nioche commented on NUTCH-1785:
--

 bq. 

not sure it makes sense to index the binary content at all. I expect this would 
be used e.g. for providing a cache functionality in which case it wouldn't be 
searched on but should definitely be stored 


> Ability to index raw content
> 
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2014-05-28 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010889#comment-14010889
 ] 

lufeng commented on NUTCH-1785:
---

+1 elasticsearch 1.2.0 test ok. 

one question is why convert content byte[] to String type? If one segment 
contain both html and PDF or mp3 content. How to set this base64 parameter? 

> Ability to index raw content
> 
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)