Re: [VOTE] Moving to Git

2016-01-08 Thread Julien Nioche
+1 to move to Git

Note : I don't think Dennis is on the PMC anymore

Ju

On 8 January 2016 at 08:46, Chris Mattmann  wrote:

> Hi Everyone,
>
> I proposed this earlier, and we said we’d wait until after the
> 1.11 release. So it’s time to VOTE to move Nutch to Git. So
> far, the following people have expressed +1s and if I don’t hear
> otherwise, I will implicitly count their VOTE from the DISCUSS
> thread:
>
> +1 PMC
>
> Chris Mattmann*
> Sebastien Nagel*
> Michael Joyce*
> Asitang Mishra*
> Dennis Kubes*
> BlackIce
>
> Everyone else (or those above that would like to amend their VOTE),
> please VOTE below. I will leave the VOTE open for at least 72 hours.
>
> [ ] +1 Move the Nutch SCM to Writeable Git repositories at the ASF.
> [ ] +0 No opinion.
> [ ] -1 Don’t move the Nutch SCM to Writeable Git repositories at the
> ASF because…
>
> Please note, I created a page for Tika that is worth checking out and
> perhaps copying over to the Nutch wiki:
>
> http://wiki.apache.org/tika/UsingGit
>
> Please have a look as I think it will help with our workflows too.
>
> Cheers,
> Chris
>
>
>
>
> -Original Message-
> From: jpluser 
> Reply-To: "dev@nutch.apache.org" 
> Date: Wednesday, November 18, 2015 at 7:39 PM
> To: "dev@nutch.apache.org" 
> Subject: [DISCUSS] Moving to Git
>
> >Hi All,
> >
> >I propose that we consider moving to ASF supported writeable git
> >repos fro Nutch. This would entail moving Nutch’s canonical repo
> >from:
> >
> >https://svn.apache.org/repos/asf/nutch
> >
> >TO
> >
> >https://git-wip-us.apache.org/repos/asf/nutch.git
> >
> >
> >We are already accepting PRs and so forth from Github and I think
> >many of us are using Git in our regular day to day workflows.
> >
> >Thoughts?
> >
> >Cheers,
> >Chris
> >
> >++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: chris.a.mattm...@nasa.gov
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++
> >
> >
> >
>
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble 


Re: [VOTE] Moving to Git

2016-01-08 Thread Sebastian Nagel
+1

Sebastian

On 01/08/2016 09:46 AM, Chris Mattmann wrote:
> Hi Everyone,
> 
> I proposed this earlier, and we said we’d wait until after the
> 1.11 release. So it’s time to VOTE to move Nutch to Git. So
> far, the following people have expressed +1s and if I don’t hear
> otherwise, I will implicitly count their VOTE from the DISCUSS
> thread:
> 
> +1 PMC
> 
> Chris Mattmann*
> Sebastien Nagel*
> Michael Joyce*
> Asitang Mishra*
> Dennis Kubes*
> BlackIce
> 
> Everyone else (or those above that would like to amend their VOTE),
> please VOTE below. I will leave the VOTE open for at least 72 hours.
> 
> [ ] +1 Move the Nutch SCM to Writeable Git repositories at the ASF.
> [ ] +0 No opinion.
> [ ] -1 Don’t move the Nutch SCM to Writeable Git repositories at the
> ASF because…
> 
> Please note, I created a page for Tika that is worth checking out and
> perhaps copying over to the Nutch wiki:
> 
> http://wiki.apache.org/tika/UsingGit
> 
> Please have a look as I think it will help with our workflows too.
> 
> Cheers,
> Chris
> 
> 
> 
> 
> -Original Message-
> From: jpluser 
> Reply-To: "dev@nutch.apache.org" 
> Date: Wednesday, November 18, 2015 at 7:39 PM
> To: "dev@nutch.apache.org" 
> Subject: [DISCUSS] Moving to Git
> 
>> Hi All,
>>
>> I propose that we consider moving to ASF supported writeable git
>> repos fro Nutch. This would entail moving Nutch’s canonical repo
>> from:
>>
>> https://svn.apache.org/repos/asf/nutch
>>
>> TO
>>
>> https://git-wip-us.apache.org/repos/asf/nutch.git
>>
>>
>> We are already accepting PRs and so forth from Github and I think
>> many of us are using Git in our regular day to day workflows.
>>
>> Thoughts?
>>
>> Cheers,
>> Chris
>>
>> ++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>>
>>
>>
> 
> 



[jira] [Resolved] (NUTCH-2169) Integrate index-html into Nutch build

2016-01-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2169.

Resolution: Fixed
  Assignee: Sebastian Nagel

Committed to 2.x, r1723794.

> Integrate index-html into Nutch build
> -
>
> Key: NUTCH-2169
> URL: https://issues.apache.org/jira/browse/NUTCH-2169
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2169.patch
>
>
> The plugin index-html (added by NUTCH-1944) is loosely integrated:
> - code is in Nutch version control
> - no build (compile, javadoc generation)
> - src/plugin/index-html/src/java/org/apache/nutch/indexer/html/package.html 
> contains a description how to do the integration
> Well, the plugin should be available just by adding it to plugin.includes 
> without any extra efforts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2169) Integrate index-html into Nutch build

2016-01-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089997#comment-15089997
 ] 

Hudson commented on NUTCH-2169:
---

SUCCESS: Integrated in Nutch-nutchgora #1544 (See 
[https://builds.apache.org/job/Nutch-nutchgora/1544/])
NUTCH-2169 Integrate index-html into Nutch build (snagel: 
[http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev=1723794])
* 2.x/CHANGES.txt
* 2.x/build.xml
* 2.x/default.properties
* 2.x/src/plugin/build.xml
* 
2.x/src/plugin/index-html/src/java/org/apache/nutch/indexer/html/HtmlIndexingFilter.java
* 2.x/src/plugin/index-html/src/java/org/apache/nutch/indexer/html/README.md
* 
2.x/src/plugin/index-html/src/java/org/apache/nutch/indexer/html/package-info.java
* 2.x/src/plugin/index-html/src/java/org/apache/nutch/indexer/html/package.html


> Integrate index-html into Nutch build
> -
>
> Key: NUTCH-2169
> URL: https://issues.apache.org/jira/browse/NUTCH-2169
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2169.patch
>
>
> The plugin index-html (added by NUTCH-1944) is loosely integrated:
> - code is in Nutch version control
> - no build (compile, javadoc generation)
> - src/plugin/index-html/src/java/org/apache/nutch/indexer/html/package.html 
> contains a description how to do the integration
> Well, the plugin should be available just by adding it to plugin.includes 
> without any extra efforts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-1449.
--
Resolution: Fixed

Committed revision 1723688.


> Optionally delete documents skipped by IndexingFilters
> --
>
> Key: NUTCH-1449
> URL: https://issues.apache.org/jira/browse/NUTCH-1449
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.5.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1449.patch, NUTCH-1449.patch, NUTCH-1449.patch
>
>
> Add configuration option to delete documents instead of skipping them if the 
> indexing filters return null. This is useful to delete documents with new 
> business logic in the indexing filter chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2178) DeduplicationJob to optionally group on host or domain

2016-01-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2178:
-
Summary: DeduplicationJob to optionally group on host or domain  (was: 
DeduplicationJob to optionall group on host or domain)

> DeduplicationJob to optionally group on host or domain
> --
>
> Key: NUTCH-2178
> URL: https://issues.apache.org/jira/browse/NUTCH-2178
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2178.patch
>
>
> Add optional grouping to DeduplicationJob.
> Usage: DeduplicationJob  [-group ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2178) DeduplicationJob to optionally group on host or domain

2016-01-08 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2178.
--
Resolution: Fixed

Committed to trunk in revision 1723690.


> DeduplicationJob to optionally group on host or domain
> --
>
> Key: NUTCH-2178
> URL: https://issues.apache.org/jira/browse/NUTCH-2178
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2178.patch
>
>
> Add optional grouping to DeduplicationJob.
> Usage: DeduplicationJob  [-group ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089073#comment-15089073
 ] 

Markus Jelsma edited comment on NUTCH-1449 at 1/8/16 11:16 AM:
---

Committed to trunk revision 1723688.



was (Author: markus17):
Committed revision 1723688.


> Optionally delete documents skipped by IndexingFilters
> --
>
> Key: NUTCH-1449
> URL: https://issues.apache.org/jira/browse/NUTCH-1449
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.5.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1449.patch, NUTCH-1449.patch, NUTCH-1449.patch
>
>
> Add configuration option to delete documents instead of skipping them if the 
> indexing filters return null. This is useful to delete documents with new 
> business logic in the indexing filter chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2190) Protocol normalizer

2016-01-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089081#comment-15089081
 ] 

Markus Jelsma commented on NUTCH-2190:
--

I'll also get this one in soon unless objections of course :)

> Protocol normalizer
> ---
>
> Key: NUTCH-2190
> URL: https://issues.apache.org/jira/browse/NUTCH-2190
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2190.patch
>
>
> URL normalizer to normalize protocols for specified hosts/domains, e.g. 
> normalizing http://www.apache.org/ to https://www.apache.org/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090337#comment-15090337
 ] 

Lewis John McGibbney commented on NUTCH-2168:
-

+1 for commit [~wastl-nagel] nice catch and debugging!

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090337#comment-15090337
 ] 

Lewis John McGibbney edited comment on NUTCH-2168 at 1/9/16 2:03 AM:
-

+1 for commit [~wastl-nagel] nice catch and debugging! If you can commit then I 
will roll the RC tomorrow.


was (Author: lewismc):
+1 for commit [~wastl-nagel] nice catch and debugging!

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2094) Stopping and Restarting a crawl has issues in the Web UI

2016-01-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2094:

Fix Version/s: (was: 2.4)
   2.3.1

> Stopping and Restarting a crawl has issues in the Web UI
> 
>
> Key: NUTCH-2094
> URL: https://issues.apache.org/jira/browse/NUTCH-2094
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Reporter: Prerna Satija
>Assignee: Chris A. Mattmann
> Fix For: 2.3.1
>
>
> I have created a stop button in Nutch webapp to stop a running crawl from the 
> UI on click of a "stop" button. While testing, I found that I am able to stop 
> a crawl successfully but when I restart a stopped crawl and try to stop it, 
> it doesn't stop. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2166) Add reverse URL format to dump tool

2016-01-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2166:

Fix Version/s: (was: 2.4)

> Add reverse URL format to dump tool
> ---
>
> Key: NUTCH-2166
> URL: https://issues.apache.org/jira/browse/NUTCH-2166
> Project: Nutch
>  Issue Type: Improvement
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2166_joyce_13Nov2015.patch
>
>
> Update the FileDumper tool with an option for dumping files to the output 
> directory in reverse URL format.
> So the file for 
> http://bar.foo.com:8983/to/index.html?a=b
> Would dump to
> /com/foo/bar/8983/http/to/index.html?a=b



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2165) FileDumper Util hard codes part-# folder name

2016-01-08 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2165:

Fix Version/s: (was: 2.4)

> FileDumper Util hard codes part-# folder name
> -
>
> Key: NUTCH-2165
> URL: https://issues.apache.org/jira/browse/NUTCH-2165
> Project: Nutch
>  Issue Type: Bug
>  Components: tool
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Michael Joyce
> Fix For: 1.11
>
> Attachments: NUTCH-2165_joyce_11Nov2015.patch
>
>
> Hi folks, [~lewismc] and I were just discussing this off list. It seems that 
> the part-# folders seem to be hard coded to part-0 in the [FileDumper 
> utility|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/tools/FileDumper.java#L166-L167]
>  which could prove problematic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089165#comment-15089165
 ] 

Hudson commented on NUTCH-1838:
---

SUCCESS: Integrated in Nutch-trunk #3332 (See 
[https://builds.apache.org/job/Nutch-trunk/3332/])
NUTCH-1838 Host and domain based regex and automaton filtering (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1723710])
* trunk/CHANGES.txt
* 
trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
* 
trunk/src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
* 
trunk/src/plugin/urlfilter-automaton/src/java/org/apache/nutch/urlfilter/automaton/AutomatonURLFilter.java
* trunk/src/plugin/urlfilter-regex/sample/nutch1838.rules
* trunk/src/plugin/urlfilter-regex/sample/nutch1838.urls
* 
trunk/src/plugin/urlfilter-regex/src/java/org/apache/nutch/urlfilter/regex/RegexURLFilter.java
* 
trunk/src/plugin/urlfilter-regex/src/test/org/apache/nutch/urlfilter/regex/TestRegexURLFilter.java


> Host and domain based regex and automaton filtering
> ---
>
> Key: NUTCH-1838
> URL: https://issues.apache.org/jira/browse/NUTCH-1838
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, 
> NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch
>
>
> Both regex and automaton filter pass all URL's through all rules although 
> this makes little sense if you have a lot of generated rules for many 
> different hosts or domains. This patch allows the users to configure specific 
> rules for a specific host or domain only, making filtering much more 
> efficient.
> Each rule has an optional hostOrDomain field, the filter is applied for rules 
> that have no hostOrDomain and for URL's that match the rule's host name and 
> domain name.
> The following line enables hostOrDomain specific rules:
> {code}
> > www.example.org
> {code}
> The following line disables/resets it again:
> {code}
> <
> {code}
> full example:
> {code}
> -some generic filter
> +another generic filter
> > www.example.org
> -rule only applied to URL's of www.example.org
> +another rule only applied to URL's of www.example.org
> > apache.org
> -rule only applied to URL's of apache.org
> +another rule only applied to URL's of apache.org
> <
> -more generic rules
> +and another one
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089297#comment-15089297
 ] 

Markus Jelsma commented on NUTCH-2191:
--

Hi - i've 'read' that discussion that couple of weeks ago when i had the 
problem, but i don't completely understand it. Does this mean it is not going 
to work? Do we need to implement something (that appears to be missing) to the 
PluginClassLoader?

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2178) DeduplicationJob to optionally group on host or domain

2016-01-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089099#comment-15089099
 ] 

Hudson commented on NUTCH-2178:
---

SUCCESS: Integrated in Nutch-trunk #3331 (See 
[https://builds.apache.org/job/Nutch-trunk/3331/])
NUTCH-2178 DeduplicationJob to optionally group on host or domain (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1723690])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/crawl/DeduplicationJob.java


> DeduplicationJob to optionally group on host or domain
> --
>
> Key: NUTCH-2178
> URL: https://issues.apache.org/jira/browse/NUTCH-2178
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2178.patch
>
>
> Add optional grouping to DeduplicationJob.
> Usage: DeduplicationJob  [-group ]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1449) Optionally delete documents skipped by IndexingFilters

2016-01-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089098#comment-15089098
 ] 

Hudson commented on NUTCH-1449:
---

SUCCESS: Integrated in Nutch-trunk #3331 (See 
[https://builds.apache.org/job/Nutch-trunk/3331/])
NUTCH-1449 Optionally delete documents skipped by IndexingFilters (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev=1723688])
* trunk/CHANGES.txt
* trunk/conf/nutch-default.xml
* trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java


> Optionally delete documents skipped by IndexingFilters
> --
>
> Key: NUTCH-1449
> URL: https://issues.apache.org/jira/browse/NUTCH-1449
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.5.1
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Attachments: NUTCH-1449.patch, NUTCH-1449.patch, NUTCH-1449.patch
>
>
> Add configuration option to delete documents instead of skipping them if the 
> indexing filters return null. This is useful to delete documents with new 
> business logic in the indexing filter chain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1838) Host and domain based regex and automaton filtering

2016-01-08 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089121#comment-15089121
 ] 

Markus Jelsma commented on NUTCH-1838:
--

Committed to trunk in revision 1723710.


> Host and domain based regex and automaton filtering
> ---
>
> Key: NUTCH-1838
> URL: https://issues.apache.org/jira/browse/NUTCH-1838
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, 
> NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch, NUTCH-1838.patch
>
>
> Both regex and automaton filter pass all URL's through all rules although 
> this makes little sense if you have a lot of generated rules for many 
> different hosts or domains. This patch allows the users to configure specific 
> rules for a specific host or domain only, making filtering much more 
> efficient.
> Each rule has an optional hostOrDomain field, the filter is applied for rules 
> that have no hostOrDomain and for URL's that match the rule's host name and 
> domain name.
> The following line enables hostOrDomain specific rules:
> {code}
> > www.example.org
> {code}
> The following line disables/resets it again:
> {code}
> <
> {code}
> full example:
> {code}
> -some generic filter
> +another generic filter
> > www.example.org
> -rule only applied to URL's of www.example.org
> +another rule only applied to URL's of www.example.org
> > apache.org
> -rule only applied to URL's of apache.org
> +another rule only applied to URL's of apache.org
> <
> -more generic rules
> +and another one
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Moving to Git

2016-01-08 Thread Sujen Shah
+1

Regards,
Sujen Shah
M.S - Computer Science (Class of 2016)
University of Southern California
http://www.linkedin.com/in/sujenshah

On Fri, Jan 8, 2016 at 2:58 PM, Julien Nioche  wrote:

> +1 to move to Git
>
> Note : I don't think Dennis is on the PMC anymore
>
> Ju
>
> On 8 January 2016 at 08:46, Chris Mattmann  wrote:
>
>> Hi Everyone,
>>
>> I proposed this earlier, and we said we’d wait until after the
>> 1.11 release. So it’s time to VOTE to move Nutch to Git. So
>> far, the following people have expressed +1s and if I don’t hear
>> otherwise, I will implicitly count their VOTE from the DISCUSS
>> thread:
>>
>> +1 PMC
>>
>> Chris Mattmann*
>> Sebastien Nagel*
>> Michael Joyce*
>> Asitang Mishra*
>> Dennis Kubes*
>> BlackIce
>>
>> Everyone else (or those above that would like to amend their VOTE),
>> please VOTE below. I will leave the VOTE open for at least 72 hours.
>>
>> [ ] +1 Move the Nutch SCM to Writeable Git repositories at the ASF.
>> [ ] +0 No opinion.
>> [ ] -1 Don’t move the Nutch SCM to Writeable Git repositories at the
>> ASF because…
>>
>> Please note, I created a page for Tika that is worth checking out and
>> perhaps copying over to the Nutch wiki:
>>
>> http://wiki.apache.org/tika/UsingGit
>>
>> Please have a look as I think it will help with our workflows too.
>>
>> Cheers,
>> Chris
>>
>>
>>
>>
>> -Original Message-
>> From: jpluser 
>> Reply-To: "dev@nutch.apache.org" 
>> Date: Wednesday, November 18, 2015 at 7:39 PM
>> To: "dev@nutch.apache.org" 
>> Subject: [DISCUSS] Moving to Git
>>
>> >Hi All,
>> >
>> >I propose that we consider moving to ASF supported writeable git
>> >repos fro Nutch. This would entail moving Nutch’s canonical repo
>> >from:
>> >
>> >https://svn.apache.org/repos/asf/nutch
>> >
>> >TO
>> >
>> >https://git-wip-us.apache.org/repos/asf/nutch.git
>> >
>> >
>> >We are already accepting PRs and so forth from Github and I think
>> >many of us are using Git in our regular day to day workflows.
>> >
>> >Thoughts?
>> >
>> >Cheers,
>> >Chris
>> >
>> >++
>> >Chris Mattmann, Ph.D.
>> >Chief Architect
>> >Instrument Software and Science Data Systems Section (398)
>> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >Office: 168-519, Mailstop: 168-527
>> >Email: chris.a.mattm...@nasa.gov
>> >WWW:  http://sunset.usc.edu/~mattmann/
>> >++
>> >Adjunct Associate Professor, Computer Science Department
>> >University of Southern California, Los Angeles, CA 90089 USA
>> >++
>> >
>> >
>> >
>>
>>
>>
>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble 
>


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-08 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089545#comment-15089545
 ] 

Chris A. Mattmann commented on NUTCH-2191:
--

Markus thanks! Check out:
https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-interactiveselenium
 

and the handlers there

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)