[jira] [Updated] (NUTCH-2130) copyField rawcontent creates error within schema.xml

2015-11-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2130:

Fix Version/s: (was: 2.4)
   2.3.1

> copyField rawcontent creates error within schema.xml
> 
>
> Key: NUTCH-2130
> URL: https://issues.apache.org/jira/browse/NUTCH-2130
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2130.patch
>
>
> The presence of the rawcontent copyField within the Nutch Solr schema.xml is 
> creating problems for users when attempting to index NutchDocuments into Solr.
> The rawcontent field is produced by the 
> [index-html|https://github.com/apache/nutch/tree/2.x/src/plugin/index-html] 
> plugin however in committing this feature we have forgotten to add the field 
> definition to schema.xml before applying the copyField instruction.
> There are two ways to resolve this
>  * remove rawcontent from copyField, or
>  * add rawcontent as a field prior to it's copyFields defintiion.
> I propose to do the latter and will submit a patch ASAP unless someone else 
> is able to do so.
>  
> This was explained on [this 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13885.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2130) copyField rawcontent creates error within schema.xml

2015-11-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003551#comment-15003551
 ] 

Lewis John McGibbney commented on NUTCH-2130:
-

+1 Seb please commit Sir

> copyField rawcontent creates error within schema.xml
> 
>
> Key: NUTCH-2130
> URL: https://issues.apache.org/jira/browse/NUTCH-2130
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.3.1
>
> Attachments: NUTCH-2130.patch
>
>
> The presence of the rawcontent copyField within the Nutch Solr schema.xml is 
> creating problems for users when attempting to index NutchDocuments into Solr.
> The rawcontent field is produced by the 
> [index-html|https://github.com/apache/nutch/tree/2.x/src/plugin/index-html] 
> plugin however in committing this feature we have forgotten to add the field 
> definition to schema.xml before applying the copyField instruction.
> There are two ways to resolve this
>  * remove rawcontent from copyField, or
>  * add rawcontent as a field prior to it's copyFields defintiion.
> I propose to do the latter and will submit a patch ASAP unless someone else 
> is able to do so.
>  
> This was explained on [this 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13885.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2130) copyField rawcontent creates error within schema.xml

2015-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2130:
---
Attachment: NUTCH-2130.patch

Patch attached:
- field specification of "rawcontent" is added to schema.xml
- should be "string" because the index-html plugin converts the binary 
contented to a String
- removed the copyField statement for "rawcontent": it's a rare use case to 
tokenize HTML and make HTML elements, attributes, Javascript elements searchable

> copyField rawcontent creates error within schema.xml
> 
>
> Key: NUTCH-2130
> URL: https://issues.apache.org/jira/browse/NUTCH-2130
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.3.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 2.4
>
> Attachments: NUTCH-2130.patch
>
>
> The presence of the rawcontent copyField within the Nutch Solr schema.xml is 
> creating problems for users when attempting to index NutchDocuments into Solr.
> The rawcontent field is produced by the 
> [index-html|https://github.com/apache/nutch/tree/2.x/src/plugin/index-html] 
> plugin however in committing this feature we have forgotten to add the field 
> definition to schema.xml before applying the copyField instruction.
> There are two ways to resolve this
>  * remove rawcontent from copyField, or
>  * add rawcontent as a field prior to it's copyFields defintiion.
> I propose to do the latter and will submit a patch ASAP unless someone else 
> is able to do so.
>  
> This was explained on [this 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13885.html]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2169) Integrate index-html into Nutch build

2015-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2169:
---
Attachment: NUTCH-2169.patch

Patch to integrate index-html into ant build and javadoc. Also cleans up code 
and documentation.

> Integrate index-html into Nutch build
> -
>
> Key: NUTCH-2169
> URL: https://issues.apache.org/jira/browse/NUTCH-2169
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2169.patch
>
>
> The plugin index-html (added by NUTCH-1944) is loosely integrated:
> - code is in Nutch version control
> - no build (compile, javadoc generation)
> - src/plugin/index-html/src/java/org/apache/nutch/indexer/html/package.html 
> contains a description how to do the integration
> Well, the plugin should be available just by adding it to plugin.includes 
> without any extra efforts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2169) Integrate index-html into Nutch build

2015-11-12 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2169:
--

 Summary: Integrate index-html into Nutch build
 Key: NUTCH-2169
 URL: https://issues.apache.org/jira/browse/NUTCH-2169
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 2.3.1
Reporter: Sebastian Nagel
 Fix For: 2.3.1


The plugin index-html (added by NUTCH-1944) is loosely integrated:
- code is in Nutch version control
- no build (compile, javadoc generation)
- src/plugin/index-html/src/java/org/apache/nutch/indexer/html/package.html 
contains a description how to do the integration

Well, the plugin should be available just by adding it to plugin.includes 
without any extra efforts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2168) Parse-tika fails to retrieve parser

2015-11-12 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2168:
---
Attachment: NUTCH-2168.patch

Attached patch: use constructor of TikaConfig which passes the plugin's own 
class loader. Parser implementations are then successfully retrieved.

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2168) Parse-tika fails to retrieve parser

2015-11-12 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2168:
--

 Summary: Parse-tika fails to retrieve parser
 Key: NUTCH-2168
 URL: https://issues.apache.org/jira/browse/NUTCH-2168
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 2.3.1
Reporter: Sebastian Nagel
 Fix For: 2.3.1


The plugin parse-tika fails to parse most (all?) kinds of document types (PDF, 
xlsx, ...) when run via ParserChecker or ParserJob:
{noformat}
2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
http://localhost/pdftest.pdf
2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser for 
mime-type application/pdf
2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
content http://localhost/pdftest.pdf of type application/pdf
{noformat}

The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002820#comment-15002820
 ] 

Hudson commented on NUTCH-2165:
---

FAILURE: Integrated in Nutch-trunk #3308 (See 
[https://builds.apache.org/job/Nutch-trunk/3308/])
NUTCH-2165 - Fix FileDumper hard coded part-# folder (joyce: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1714104])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/tools/FileDumper.java


> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-trunk #3308

2015-11-12 Thread Apache Jenkins Server
See 

Changes:

[joyce] NUTCH-2165 - Fix FileDumper hard coded part-# folder

--
[...truncated 14561 lines...]
test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
3.657 sec
[junit] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.351 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] Running org.apache.nutch.tika.TestRTFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.06 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.586 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor
[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.258 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-querystring
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.027 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex

deps-test-compile:

compile-test:

jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-regex
[junit] Running 
org.apache.nutch.net.urlnormalizer.querystring.TestQuerystringURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.3 
sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-slash

deps-test-compile:

compile-test:

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-slash
[junit] Running 
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.448 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.546 sec

BUILD SUCCESSFUL
Total time: 10 minutes 13 seconds
[xUnit] [INFO] - Starting to record.
[xUnit] [INFO] - Processing JUnit
[xUnit] [INFO] - [JUnit] - 34 test report file(s) were found with the pattern 
'trunk/build/test/TEST-*.xml' relative to 
' for the testing framework 
'JUnit'.
[xUnit] [ERROR] - Test reports were found but not all of them are new. Did all 
the tests run?
  * 

 is 3 min 14 sec old
  * 

 is 3 min 9 sec old
  * 

 is 3 min 5 sec old
  * 

 is 3 min 1 sec old
  * 

 is 2 min 20 sec old
  * 


[jira] [Resolved] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-12 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2165.
--
Resolution: Fixed

Committed in r1714104

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-12 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002604#comment-15002604
 ] 

Michael Joyce commented on NUTCH-2165:
--

Thanks [~lewismc], I'll merge shortly

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002598#comment-15002598
 ] 

Lewis John McGibbney edited comment on NUTCH-2165 at 11/12/15 6:39 PM:
---

+1 [~mjoyce] verified on small sample crawl
{code}lmcgibbn@LMC-032857 /usr/local/trunk_new1/runtime/local(joshua) $ 
./bin/nutch dump -flatdir -mimeStats -outputDir 
/usr/local/trunk_new1/esdswg_crawl/dump -segment 
/usr/local/trunk_new1/esdswg_crawl/segments
Dumper File Stats:
TOTAL Stats:
[
{"mimeType":"text/html","count":"2809"}
{"mimeType":"application/octet-stream","count":"267"}
]
Total count: 3076

FILTERED Stats:
[
{"mimeType":"text/html","count":"2809"}
{"mimeType":"application/octet-stream","count":"267"}
]
Total filtered count: 3076{code}

Following directory layout... please not multiple segment content data files.

{code}
lmcgibbn@LMC-032857 /usr/local/trunk_new1/esdswg_crawl(joshua) $ tree
.
├── crawldb
│   ├── current
│   │   └── part-0
│   │   ├── data
│   │   └── index
│   └── old
│   ├── part-0
│   │   ├── data
│   │   └── index
│   └── part-1
│   ├── data
│   └── index
├── dump
├── linkdb
│   └── current
│   └── part-0
│   ├── data
│   └── index
├── pstats
│   └── part-r-0
├── segments
│   ├── 20151102194433
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102194500
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102194552
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102194903
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102195503
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├─

[jira] [Commented] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2015-11-12 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002598#comment-15002598
 ] 

Lewis John McGibbney commented on NUTCH-2165:
-

+1 [~mjoyce] verified on small sample crawl
{code}lmcgibbn@LMC-032857 /usr/local/trunk_new1/runtime/local(joshua) $ 
./bin/nutch dump -flatdir -mimeStats -outputDir 
/usr/local/trunk_new1/esdswg_crawl/dump -segment 
/usr/local/trunk_new1/esdswg_crawl/segments
Dumper File Stats:
TOTAL Stats:
[
{"mimeType":"text/html","count":"2809"}
{"mimeType":"application/octet-stream","count":"267"}
]
Total count: 3076

FILTERED Stats:
[
{"mimeType":"text/html","count":"2809"}
{"mimeType":"application/octet-stream","count":"267"}
]
Total filtered count: 3076{code}

Following directory layout... please not multiple segment data files.

{code}
lmcgibbn@LMC-032857 /usr/local/trunk_new1/esdswg_crawl(joshua) $ tree
.
├── crawldb
│   ├── current
│   │   └── part-0
│   │   ├── data
│   │   └── index
│   └── old
│   ├── part-0
│   │   ├── data
│   │   └── index
│   └── part-1
│   ├── data
│   └── index
├── dump
├── linkdb
│   └── current
│   └── part-0
│   ├── data
│   └── index
├── pstats
│   └── part-r-0
├── segments
│   ├── 20151102194433
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102194500
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102194552
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102194903
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_fetch
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   ├── crawl_generate
│   │   │   └── part-0
│   │   ├── crawl_parse
│   │   │   ├── part-0
│   │   │   └── part-1
│   │   ├── parse_data
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1
│   │   │   ├── data
│   │   │   └── index
│   │   └── parse_text
│   │   ├── part-0
│   │   │   ├── data
│   │   │   └── index
│   │   └── part-1
│   │   ├── data
│   │   └── index
│   ├── 20151102195503
│   │   ├── content
│   │   │   ├── part-0
│   │   │   │   ├── data
│   │   │   │   └── index
│   │   │   └── part-1

Re: [jira] [Commented] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-12 Thread Mattmann, Chris A (3980)
We’ll run into file length issues - Giuseppe had the same problem,
and so did students who used it from USC hence the solution we have
now. I think having nested directory structures is probably the best
bet, and making it configurable.

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: "Michael Joyce (JIRA)" 
Reply-To: "dev@nutch.apache.org" 
Date: Thursday, November 12, 2015 at 11:17 AM
To: "dev@nutch.apache.org" 
Subject: [jira] [Commented] (NUTCH-2166) Add reverse URL format to dump
tool

>
>[ 
>https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.p
>lugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002328#com
>ment-15002328 ] 
>
>Michael Joyce commented on NUTCH-2166:
>--
>
>Small change in dump format. Instead of making a bajillion nested folders
>it seems like it might be nicer to simple use the reverse URL as the file
>name.
>
>So the file for 
>http://bar.foo.com:8983/to/index.htm
>Would dump to the encoded
>/com%2Ffoo%2Fbar%2F8983%2Fhttp%2Fto%2Findex.htm
>
>Of course, we may then run into file name length issues this way. Perhaps
>having both eventually will be useful?
>
>> Add reverse URL format to dump tool
>> ---
>>
>> Key: NUTCH-2166
>> URL: https://issues.apache.org/jira/browse/NUTCH-2166
>> Project: Nutch
>>  Issue Type: Improvement
>>  Components: tool
>>Affects Versions: 2.3, 1.10
>>Reporter: Michael Joyce
>>Assignee: Michael Joyce
>> Fix For: 2.4, 1.11
>>
>>
>> Update the FileDumper tool with an option for dumping files to the
>>output directory in reverse URL format.
>> So the file for 
>> http://bar.foo.com:8983/to/index.html?a=b
>> Would dump to
>> /com/foo/bar/8983/http/to/index.html?a=b
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)



[jira] [Commented] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002399#comment-15002399
 ] 

Hudson commented on NUTCH-2167:
---

SUCCESS: Integrated in Nutch-trunk #3307 (See 
[https://builds.apache.org/job/Nutch-trunk/3307/])
NUTCH-2167 CHANGES update (joyce: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1714081])
* trunk/CHANGES.txt
NUTCH-2167 Backport TableUtil tests from 2.x to trunk (joyce: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1714079])
* trunk/src/test/org/apache/nutch/util/TestTableUtil.java
NUTCH-2167 - Backport TableUtil from 2.x to trunk (joyce: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1714078])
* trunk/src/java/org/apache/nutch/util/TableUtil.java


> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2166) Add reverse URL format to dump tool

2015-11-12 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002328#comment-15002328
 ] 

Michael Joyce commented on NUTCH-2166:
--

Small change in dump format. Instead of making a bajillion nested folders it 
seems like it might be nicer to simple use the reverse URL as the file name.

So the file for 
http://bar.foo.com:8983/to/index.htm
Would dump to the encoded
/com%2Ffoo%2Fbar%2F8983%2Fhttp%2Fto%2Findex.htm

Of course, we may then run into file name length issues this way. Perhaps 
having both eventually will be useful?

> Add reverse URL format to dump tool
> ---
>
> Key: NUTCH-2166
> URL: https://issues.apache.org/jira/browse/NUTCH-2166
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 2.4, 1.11
>
>
> Update the FileDumper tool with an option for dumping files to the output 
> directory in reverse URL format.
> So the file for 
> http://bar.foo.com:8983/to/index.html?a=b
> Would dump to
> /com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2167) Backport TableUtil from 2.x for URL reversing

2015-11-12 Thread Michael Joyce (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce resolved NUTCH-2167.
--
Resolution: Fixed

TableUtil copied over in r1714078 and tests copied over in 1714079

> Backport TableUtil from 2.x for URL reversing
> -
>
> Key: NUTCH-2167
> URL: https://issues.apache.org/jira/browse/NUTCH-2167
> Project: Nutch
>  Issue Type: Sub-task
>  Components: tool
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
>
> The 
> [TableUtil|https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/util/TableUtil.java]
>  file provides a number of helpful utilities functions for URL reversing that 
> would be useful to have in 1.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2160) Upgrade Selenium Java to 2.48.2

2015-11-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002285#comment-15002285
 ] 

Hudson commented on NUTCH-2160:
---

SUCCESS: Integrated in Nutch-trunk #3306 (See 
[https://builds.apache.org/job/Nutch-trunk/3306/])
NUTCH-2160 Upgrade Selenium Java to 2.48.2 (lewismc: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1714071])
* trunk/CHANGES.txt
* trunk/src/plugin/lib-selenium/howto_upgrade_selenium.txt
* trunk/src/plugin/lib-selenium/ivy.xml
* trunk/src/plugin/lib-selenium/plugin.xml


> Upgrade Selenium Java to 2.48.2
> ---
>
> Key: NUTCH-2160
> URL: https://issues.apache.org/jira/browse/NUTCH-2160
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2160.patch
>
>
> Current Selenium support is pegged at a very old version of Firefox. The 
> attached patch, running with the most recent version of Selenium Java, works 
> with Firefox 38.4.0 very well. The remainder of the lib-selenium dependencies 
> have also been updated.
> Thanks
> [~kwhitehall] can you please scope if you get a wee minute?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002286#comment-15002286
 ] 

Hudson commented on NUTCH-2120:
---

SUCCESS: Integrated in Nutch-trunk #3306 (See 
[https://builds.apache.org/job/Nutch-trunk/3306/])
NUTCH-2120 Remove MapWritable from trunk codebase (lewismc: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1714068])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/crawl/MapWritable.java


> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2120.patch
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2160) Upgrade Selenium Java to 2.48.2

2015-11-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2160.
-
Resolution: Fixed

Committed revision 1714071

> Upgrade Selenium Java to 2.48.2
> ---
>
> Key: NUTCH-2160
> URL: https://issues.apache.org/jira/browse/NUTCH-2160
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2160.patch
>
>
> Current Selenium support is pegged at a very old version of Firefox. The 
> attached patch, running with the most recent version of Selenium Java, works 
> with Firefox 38.4.0 very well. The remainder of the lib-selenium dependencies 
> have also been updated.
> Thanks
> [~kwhitehall] can you please scope if you get a wee minute?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2160) Upgrade Selenium Java to 2.48.2

2015-11-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2160:

Issue Type: Improvement  (was: Bug)

> Upgrade Selenium Java to 2.48.2
> ---
>
> Key: NUTCH-2160
> URL: https://issues.apache.org/jira/browse/NUTCH-2160
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.11
>
> Attachments: NUTCH-2160.patch
>
>
> Current Selenium support is pegged at a very old version of Firefox. The 
> attached patch, running with the most recent version of Selenium Java, works 
> with Firefox 38.4.0 very well. The remainder of the lib-selenium dependencies 
> have also been updated.
> Thanks
> [~kwhitehall] can you please scope if you get a wee minute?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney closed NUTCH-2120.
---

Committed revision 1714068

> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2120.patch
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-12 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2120.
-
   Resolution: Fixed
Fix Version/s: (was: 1.12)
   1.11

> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2120.patch
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-11-12 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002072#comment-15002072
 ] 

Markus Jelsma commented on NUTCH-2120:
--

Im fine with removing it, we're using Hadoop's MapWritable anyway.


> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Task
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2120.patch
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)