from:"Michael Joyce \(JIRA\)"

[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-18 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501674#comment-14501674
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hey Chris,

Will do. I'll try to take a poke at updating this tomorrow/Monday when I have a 
bit of free time.

> Make bin/crawl indexer agnostic
> ---
>
> Key: NUTCH-1987
> URL: https://issues.apache.org/jira/browse/NUTCH-1987
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
>  Labels: memex
> Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't 
> Solr. For instance, when I want to use the indexer-elastic plugin I still 
> need to call the crawler script with a fake Solr URL otherwise it will skip 
> the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files 
> (to mirror the elastic search indexer conf and others) and to make the 
> indexing parameter simply toggle whether indexing does or doesn't occur 
> instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing

2015-04-16 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498689#comment-14498689
 ] 

Michael Joyce commented on NUTCH-1911:
--

Hey folks,

Here's what the output from this looks like

{code}
Usage: DomainStatistics inputDirs outDir mode [numOfReducer]
inputDirs   Comma separated list of crawldb input directories
E.g.: crawl/crawldb/current/
outDir  Output directory where results should be dumped
modeSet statistics gathering mode
hostGather statistics by host
domain  Gather statistics by domain
suffix  Gather statistics by suffix
tld Gather statistics by top level directory
[numOfReducers] Optional number of reduce jobs to use. Defaults to 1.
{code}

> Imeprove DomainStatistics tool command line parsing
> ---
>
> Key: NUTCH-1911
> URL: https://issues.apache.org/jira/browse/NUTCH-1911
> Project: Nutch
>  Issue Type: Bug
>  Components: util
>Affects Versions: 1.9, 2.2.1
>Reporter: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.11
>
>
> The DomainStatistic's tool could be improved based on the comments addressed 
> in [this mai 
> thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html]
> For convenience, I've also pasted them below
> {quote}
> You cannot just tell it where the crawldb is, you need to tell it where the 
> directory is, so specifying current is ok, but not part-*
> {quote}
> Patch should be trivial work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1906) Typo in CrawlDbReader command line help

2015-04-16 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498573#comment-14498573
 ] 

Michael Joyce commented on NUTCH-1906:
--

Hi folks,

I'll throw a patch up shortly for this.

> Typo in CrawlDbReader command line help
> ---
>
> Key: NUTCH-1906
> URL: https://issues.apache.org/jira/browse/NUTCH-1906
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.9
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.11
>
>
> Currently the CrawlDbReader tool, when invoked without any command line 
> arguments helps us as follows
> {code}
> [mdeploy@crawl local]$ ./bin/nutch readdb
> Usage: CrawlDbReader  (-stats | -dump  | -topN  
>  [] | -url )
>  directory name where crawldb is located
>   -stats [-sort]  print overall statistics to System.out
>   [-sort] list status sorted by host
>   -dump  [-format normal|csv|crawldb]dump the whole db to a 
> text file in 
>   [-format csv]   dump in Csv format
>   [-format normal]dump in standard format (default option)
>   [-format crawldb]   dump as CrawlDB
>   [-regex ] filter records with expression
>   [-retry ]  minimum retry count
>   [-status ]  filter records by CrawlDatum status
>   -url   print information on  to System.out
>   -topN   []  dump top  urls sorted by score to 
> 
>   [] skip records with scores below this value.
>   This can significantly improve performance.
> {code}
> The code that bothers me is
> {code}
>   -stats [-sort]  print overall statistics to System.out
>   [-sort] list status sorted by host
> {code}
> The inclusion of the double -sort is not necessary or required.
> Having looked through the code there is no other optional flag which we can 
> substitute for the second one (which I thought may lead to this being a 
> placeholder for something else) therefore we can just remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1964) tmp directory not cleaned up after using commoncrawldump tool

2015-04-16 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498557#comment-14498557
 ] 

Michael Joyce commented on NUTCH-1964:
--

Hey folks, I can't seem to duplicate this and I'm not seeing a problem in the 
code. Any ideas on this?

> tmp directory not cleaned up after using commoncrawldump tool
> -
>
> Key: NUTCH-1964
> URL: https://issues.apache.org/jira/browse/NUTCH-1964
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.10
>
>
> After using the commoncrawldump tool I am seeing a persistent tmp directory 
> in the directory where I invoked the tool from e.g.
> {code}
> [mdeploy@crawl local]$ ls
> bin  conf  lib  logs  plugins  test  tmp_1426114168524-231608436
> {code}
> We need to make sure that this is cleaned up after invoking the tool.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

2015-04-16 Thread Michael Joyce (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1986:
-
Labels: memex  (was: )

> Clarify Elastic Search Indexer Plugin Settings
> --
>
> Key: NUTCH-1986
> URL: https://issues.apache.org/jira/browse/NUTCH-1986
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation, indexer, plugin
>Affects Versions: 1.9
>Reporter: Michael Joyce
>  Labels: memex
> Fix For: 1.10
>
>
> Was working on getting indexing into elastic search working and realized that 
> the majority of my difficulties were simply me misunderstanding what the 
> config needed. Patch incoming to hopefully clarify what is needed by default, 
> what each option does, and add any helpful defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-16 Thread Michael Joyce (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1987:
-
Labels: memex  (was: )

> Make bin/crawl indexer agnostic
> ---
>
> Key: NUTCH-1987
> URL: https://issues.apache.org/jira/browse/NUTCH-1987
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
>  Labels: memex
> Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't 
> Solr. For instance, when I want to use the indexer-elastic plugin I still 
> need to call the crawler script with a fake Solr URL otherwise it will skip 
> the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files 
> (to mirror the elastic search indexer conf and others) and to make the 
> indexing parameter simply toggle whether indexing does or doesn't occur 
> instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional

2015-04-16 Thread Michael Joyce (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1988:
-
Labels: memex  (was: )

> Make nested output directory dump optional
> --
>
> Key: NUTCH-1988
> URL: https://issues.apache.org/jira/browse/NUTCH-1988
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.9
>Reporter: Michael Joyce
>Priority: Minor
>  Labels: memex
> Fix For: 1.10
>
>
> NUTCH-1957 added nested directories to the bin/nutch dump output to help 
> avoid naming conflicts in output files. It would be nice to be able to 
> specify that you want the older flat directory output as an optional 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497027#comment-14497027
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hey Sebastian, thanks for the feedback.

I agree the positional argument handling is a bit daft. I was aiming more for a 
quick intermediate solution that didn't disrupt too much while getting this 
functionality in there. I'm happy to update this patch with a bit nicer 
handling of arguments or waiting and doing a quick follow-on patch if this gets 
merged. Whatever works for everyone is fine with me.

> Make bin/crawl indexer agnostic
> ---
>
> Key: NUTCH-1987
> URL: https://issues.apache.org/jira/browse/NUTCH-1987
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
> Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't 
> Solr. For instance, when I want to use the indexer-elastic plugin I still 
> need to call the crawler script with a fake Solr URL otherwise it will skip 
> the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files 
> (to mirror the elastic search indexer conf and others) and to make the 
> indexing parameter simply toggle whether indexing does or doesn't occur 
> instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496755#comment-14496755
 ] 

Michael Joyce commented on NUTCH-1988:
--

Hi folks. Here's an example output run of this.

{code}
[mjjoyce@machine local]$ bin/nutch dump -outputDir ./foodir -segment 
../local_elasticsearch_testt/crawl/segments/
[mjjoyce@machine local]$ bin/nutch dump -flatdir -outputDir ./foodir2 -segment 
../local_elasticsearch_testt/crawl/segments/
[mjjoyce@machine local]$ ls -R foodir
foodir:
8f  f8

foodir/8f:
a7

foodir/8f/a7:
8d84f847f7310620a9edc4327bbfc133_.html

foodir/f8:
df

foodir/f8/df:
fec7849283af7a0adc77eddefb242b6e_.html
[mjjoyce@machine local]$ ls -R foodir2
foodir2:
8d84f847f7310620a9edc4327bbfc133_.html  fec7849283af7a0adc77eddefb242b6e_.html
[mjjoyce@machine local]$ 
{code}

> Make nested output directory dump optional
> --
>
> Key: NUTCH-1988
> URL: https://issues.apache.org/jira/browse/NUTCH-1988
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.9
>Reporter: Michael Joyce
>Priority: Minor
> Fix For: 1.10
>
>
> NUTCH-1957 added nested directories to the bin/nutch dump output to help 
> avoid naming conflicts in output files. It would be nice to be able to 
> specify that you want the older flat directory output as an optional 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Michael Joyce (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1988:
-
Priority: Minor  (was: Major)

> Make nested output directory dump optional
> --
>
> Key: NUTCH-1988
> URL: https://issues.apache.org/jira/browse/NUTCH-1988
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.9
>Reporter: Michael Joyce
>Priority: Minor
> Fix For: 1.10
>
>
> NUTCH-1957 added nested directories to the bin/nutch dump output to help 
> avoid naming conflicts in output files. It would be nice to be able to 
> specify that you want the older flat directory output as an optional 
> parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1988) Make nested output directory dump optional

2015-04-15 Thread Michael Joyce (JIRA)

Michael Joyce created NUTCH-1988:


 Summary: Make nested output directory dump optional
 Key: NUTCH-1988
 URL: https://issues.apache.org/jira/browse/NUTCH-1988
 Project: Nutch
  Issue Type: Improvement
  Components: dumpers
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid 
naming conflicts in output files. It would be nice to be able to specify that 
you want the older flat directory output as an optional parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496426#comment-14496426
 ] 

Michael Joyce edited comment on NUTCH-1987 at 4/15/15 3:54 PM:
---

Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes call 
format changes for people with existing setups and only really requires that a 
single configuration value is added/updated if you want to keep using Solr on 
an existing setup. Note, this change obviously requires documentation updates. 
I'm more than happy to help with those as well but I wasn't including them in 
this ticket.

Thoughts?


was (Author: mjoyce):
Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes calling 
formats for people with existing setups and only really requires that a single 
configuration value is added/updated. Note, this change obviously requires 
some/many documentation updates. I'm more than happy to help with those as well 
but I wasn't including them in this ticket.

Thoughts?

> Make bin/crawl indexer agnostic
> ---
>
> Key: NUTCH-1987
> URL: https://issues.apache.org/jira/browse/NUTCH-1987
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
> Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't 
> Solr. For instance, when I want to use the indexer-elastic plugin I still 
> need to call the crawler script with a fake Solr URL otherwise it will skip 
> the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files 
> (to mirror the elastic search indexer conf and others) and to make the 
> indexing parameter simply toggle whether indexing does or doesn't occur 
> instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496426#comment-14496426
 ] 

Michael Joyce commented on NUTCH-1987:
--

Hi folks,

I'll have a patch up in a bit for this. I think my current plan to minimize the 
number of changes that I'm shoving into a single patch is to:

* Add solr.server.url to nutch-default and set the value to some sane default 
(http://127.0.0.1:8983/solr/)
* Make the 'index' calls in the bin/nutch script generic and slightly change 
the call format.
* Update some variable names and echos in the bin/crawl script so it doesn't 
only mention Solr and confuse people

I envision a call being something similar to this after these changes:
{code}
# Run the indexer
bin/crawl urls/ crawl/ "run_indexer" 1

# Don't run the indexer
bin/crawl urls/ crawl/ 1
{code}

I don't think this is necessarily the ideal solution but it minimizes calling 
formats for people with existing setups and only really requires that a single 
configuration value is added/updated. Note, this change obviously requires 
some/many documentation updates. I'm more than happy to help with those as well 
but I wasn't including them in this ticket.

Thoughts?

> Make bin/crawl indexer agnostic
> ---
>
> Key: NUTCH-1987
> URL: https://issues.apache.org/jira/browse/NUTCH-1987
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Michael Joyce
> Fix For: 1.10
>
>
> The crawl script makes it a bit challenging to use an indexer that isn't 
> Solr. For instance, when I want to use the indexer-elastic plugin I still 
> need to call the crawler script with a fake Solr URL otherwise it will skip 
> the indexing step all together.
> {code}
> bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1
> {code}
> It would be nice to keep configuration for the Solr indexer in the conf files 
> (to mirror the elastic search indexer conf and others) and to make the 
> indexing parameter simply toggle whether indexing does or doesn't occur 
> instead of also trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1987) Make bin/crawl indexer agnostic

2015-04-15 Thread Michael Joyce (JIRA)

Michael Joyce created NUTCH-1987:


 Summary: Make bin/crawl indexer agnostic
 Key: NUTCH-1987
 URL: https://issues.apache.org/jira/browse/NUTCH-1987
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


The crawl script makes it a bit challenging to use an indexer that isn't Solr. 
For instance, when I want to use the indexer-elastic plugin I still need to 
call the crawler script with a fake Solr URL otherwise it will skip the 
indexing step all together.

{code}
bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1
{code}

It would be nice to keep configuration for the Solr indexer in the conf files 
(to mirror the elastic search indexer conf and others) and to make the indexing 
parameter simply toggle whether indexing does or doesn't occur instead of also 
trying to configure the indexer at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

2015-04-15 Thread Michael Joyce (JIRA)

Michael Joyce created NUTCH-1986:


 Summary: Clarify Elastic Search Indexer Plugin Settings
 Key: NUTCH-1986
 URL: https://issues.apache.org/jira/browse/NUTCH-1986
 Project: Nutch
  Issue Type: Improvement
  Components: documentation, indexer, plugin
Affects Versions: 1.9
Reporter: Michael Joyce
 Fix For: 1.10


Was working on getting indexing into elastic search working and realized that 
the majority of my difficulties were simply me misunderstanding what the config 
needed. Patch incoming to hopefully clarify what is needed by default, what 
each option does, and add any helpful defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1972) Dockerfile for Nutch 1.x

2015-04-15 Thread Michael Joyce (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496257#comment-14496257
 ] 

Michael Joyce commented on NUTCH-1972:
--

Awesome, thanks for merging [~chrismattmann]!!

> Dockerfile for Nutch 1.x
> 
>
> Key: NUTCH-1972
> URL: https://issues.apache.org/jira/browse/NUTCH-1972
> Project: Nutch
>  Issue Type: Improvement
>  Components: deployment
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.10
>
> Attachments: Joyce-NUTCH-1792-patch.txt
>
>
> Hi folks,
> I noticed that there was a Docker file for Nutch 2.x but I didn't see 
> anything for 1.x. I figured I would throw something up real quick. Note that 
> this currently doesn't install Solr. I didn't need it at the time when I was 
> making this, but I'll work on getting it added before too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (NUTCH-1972) Dockerfile for Nutch 1.x

2015-03-19 Thread Michael Joyce (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Joyce updated NUTCH-1972:
-
Attachment: Joyce-NUTCH-1792-patch.txt

Adding patch

> Dockerfile for Nutch 1.x
> 
>
> Key: NUTCH-1972
> URL: https://issues.apache.org/jira/browse/NUTCH-1972
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Michael Joyce
>Priority: Minor
> Fix For: 1.10
>
> Attachments: Joyce-NUTCH-1792-patch.txt
>
>
> Hi folks,
> I noticed that there was a Docker file for Nutch 2.x but I didn't see 
> anything for 1.x. I figured I would throw something up real quick. Note that 
> this currently doesn't install Solr. I didn't need it at the time when I was 
> making this, but I'll work on getting it added before too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (NUTCH-1972) Dockerfile for Nutch 1.x

2015-03-19 Thread Michael Joyce (JIRA)

Michael Joyce created NUTCH-1972:


 Summary: Dockerfile for Nutch 1.x
 Key: NUTCH-1972
 URL: https://issues.apache.org/jira/browse/NUTCH-1972
 Project: Nutch
  Issue Type: Improvement
Reporter: Michael Joyce
Priority: Minor
 Fix For: 1.10


Hi folks,

I noticed that there was a Docker file for Nutch 2.x but I didn't see anything 
for 1.x. I figured I would throw something up real quick. Note that this 
currently doesn't install Solr. I didn't need it at the time when I was making 
this, but I'll work on getting it added before too long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

[jira] [Commented] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing

[jira] [Commented] (NUTCH-1906) Typo in CrawlDbReader command line help

[jira] [Commented] (NUTCH-1964) tmp directory not cleaned up after using commoncrawldump tool

[jira] [Updated] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

[jira] [Updated] (NUTCH-1987) Make bin/crawl indexer agnostic

[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional

[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional

[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional

[jira] [Created] (NUTCH-1988) Make nested output directory dump optional

[jira] [Comment Edited] (NUTCH-1987) Make bin/crawl indexer agnostic

[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic

[jira] [Created] (NUTCH-1987) Make bin/crawl indexer agnostic

[jira] [Created] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings

[jira] [Commented] (NUTCH-1972) Dockerfile for Nutch 1.x

[jira] [Updated] (NUTCH-1972) Dockerfile for Nutch 1.x

[jira] [Created] (NUTCH-1972) Dockerfile for Nutch 1.x

< 1 2

101 - 118 of 118 matches

Site Navigation

Mail list logo

Footer information