[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501674#comment-14501674 ] Michael Joyce commented on NUTCH-1987: -- Hey Chris, Will do. I'll try to take a poke at updating this tomorrow/Monday when I have a bit of free time. > Make bin/crawl indexer agnostic > --- > > Key: NUTCH-1987 > URL: https://issues.apache.org/jira/browse/NUTCH-1987 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Michael Joyce > Labels: memex > Fix For: 1.10 > > > The crawl script makes it a bit challenging to use an indexer that isn't > Solr. For instance, when I want to use the indexer-elastic plugin I still > need to call the crawler script with a fake Solr URL otherwise it will skip > the indexing step all together. > {code} > bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1 > {code} > It would be nice to keep configuration for the Solr indexer in the conf files > (to mirror the elastic search indexer conf and others) and to make the > indexing parameter simply toggle whether indexing does or doesn't occur > instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1911) Imeprove DomainStatistics tool command line parsing
[ https://issues.apache.org/jira/browse/NUTCH-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498689#comment-14498689 ] Michael Joyce commented on NUTCH-1911: -- Hey folks, Here's what the output from this looks like {code} Usage: DomainStatistics inputDirs outDir mode [numOfReducer] inputDirs Comma separated list of crawldb input directories E.g.: crawl/crawldb/current/ outDir Output directory where results should be dumped modeSet statistics gathering mode hostGather statistics by host domain Gather statistics by domain suffix Gather statistics by suffix tld Gather statistics by top level directory [numOfReducers] Optional number of reduce jobs to use. Defaults to 1. {code} > Imeprove DomainStatistics tool command line parsing > --- > > Key: NUTCH-1911 > URL: https://issues.apache.org/jira/browse/NUTCH-1911 > Project: Nutch > Issue Type: Bug > Components: util >Affects Versions: 1.9, 2.2.1 >Reporter: Lewis John McGibbney >Priority: Trivial > Fix For: 1.11 > > > The DomainStatistic's tool could be improved based on the comments addressed > in [this mai > thread|http://www.mail-archive.com/user%40nutch.apache.org/msg13028.html] > For convenience, I've also pasted them below > {quote} > You cannot just tell it where the crawldb is, you need to tell it where the > directory is, so specifying current is ok, but not part-* > {quote} > Patch should be trivial work -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1906) Typo in CrawlDbReader command line help
[ https://issues.apache.org/jira/browse/NUTCH-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498573#comment-14498573 ] Michael Joyce commented on NUTCH-1906: -- Hi folks, I'll throw a patch up shortly for this. > Typo in CrawlDbReader command line help > --- > > Key: NUTCH-1906 > URL: https://issues.apache.org/jira/browse/NUTCH-1906 > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 1.9 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 1.11 > > > Currently the CrawlDbReader tool, when invoked without any command line > arguments helps us as follows > {code} > [mdeploy@crawl local]$ ./bin/nutch readdb > Usage: CrawlDbReader (-stats | -dump | -topN > [] | -url ) > directory name where crawldb is located > -stats [-sort] print overall statistics to System.out > [-sort] list status sorted by host > -dump [-format normal|csv|crawldb]dump the whole db to a > text file in > [-format csv] dump in Csv format > [-format normal]dump in standard format (default option) > [-format crawldb] dump as CrawlDB > [-regex ] filter records with expression > [-retry ] minimum retry count > [-status ] filter records by CrawlDatum status > -url print information on to System.out > -topN [] dump top urls sorted by score to > > [] skip records with scores below this value. > This can significantly improve performance. > {code} > The code that bothers me is > {code} > -stats [-sort] print overall statistics to System.out > [-sort] list status sorted by host > {code} > The inclusion of the double -sort is not necessary or required. > Having looked through the code there is no other optional flag which we can > substitute for the second one (which I thought may lead to this being a > placeholder for something else) therefore we can just remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1964) tmp directory not cleaned up after using commoncrawldump tool
[ https://issues.apache.org/jira/browse/NUTCH-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498557#comment-14498557 ] Michael Joyce commented on NUTCH-1964: -- Hey folks, I can't seem to duplicate this and I'm not seeing a problem in the code. Any ideas on this? > tmp directory not cleaned up after using commoncrawldump tool > - > > Key: NUTCH-1964 > URL: https://issues.apache.org/jira/browse/NUTCH-1964 > Project: Nutch > Issue Type: Bug > Components: commoncrawl >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Priority: Minor > Fix For: 1.10 > > > After using the commoncrawldump tool I am seeing a persistent tmp directory > in the directory where I invoked the tool from e.g. > {code} > [mdeploy@crawl local]$ ls > bin conf lib logs plugins test tmp_1426114168524-231608436 > {code} > We need to make sure that this is cleaned up after invoking the tool. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings
[ https://issues.apache.org/jira/browse/NUTCH-1986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1986: - Labels: memex (was: ) > Clarify Elastic Search Indexer Plugin Settings > -- > > Key: NUTCH-1986 > URL: https://issues.apache.org/jira/browse/NUTCH-1986 > Project: Nutch > Issue Type: Improvement > Components: documentation, indexer, plugin >Affects Versions: 1.9 >Reporter: Michael Joyce > Labels: memex > Fix For: 1.10 > > > Was working on getting indexing into elastic search working and realized that > the majority of my difficulties were simply me misunderstanding what the > config needed. Patch incoming to hopefully clarify what is needed by default, > what each option does, and add any helpful defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1987: - Labels: memex (was: ) > Make bin/crawl indexer agnostic > --- > > Key: NUTCH-1987 > URL: https://issues.apache.org/jira/browse/NUTCH-1987 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Michael Joyce > Labels: memex > Fix For: 1.10 > > > The crawl script makes it a bit challenging to use an indexer that isn't > Solr. For instance, when I want to use the indexer-elastic plugin I still > need to call the crawler script with a fake Solr URL otherwise it will skip > the indexing step all together. > {code} > bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1 > {code} > It would be nice to keep configuration for the Solr indexer in the conf files > (to mirror the elastic search indexer conf and others) and to make the > indexing parameter simply toggle whether indexing does or doesn't occur > instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1988: - Labels: memex (was: ) > Make nested output directory dump optional > -- > > Key: NUTCH-1988 > URL: https://issues.apache.org/jira/browse/NUTCH-1988 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.9 >Reporter: Michael Joyce >Priority: Minor > Labels: memex > Fix For: 1.10 > > > NUTCH-1957 added nested directories to the bin/nutch dump output to help > avoid naming conflicts in output files. It would be nice to be able to > specify that you want the older flat directory output as an optional > parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497027#comment-14497027 ] Michael Joyce commented on NUTCH-1987: -- Hey Sebastian, thanks for the feedback. I agree the positional argument handling is a bit daft. I was aiming more for a quick intermediate solution that didn't disrupt too much while getting this functionality in there. I'm happy to update this patch with a bit nicer handling of arguments or waiting and doing a quick follow-on patch if this gets merged. Whatever works for everyone is fine with me. > Make bin/crawl indexer agnostic > --- > > Key: NUTCH-1987 > URL: https://issues.apache.org/jira/browse/NUTCH-1987 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Michael Joyce > Fix For: 1.10 > > > The crawl script makes it a bit challenging to use an indexer that isn't > Solr. For instance, when I want to use the indexer-elastic plugin I still > need to call the crawler script with a fake Solr URL otherwise it will skip > the indexing step all together. > {code} > bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1 > {code} > It would be nice to keep configuration for the Solr indexer in the conf files > (to mirror the elastic search indexer conf and others) and to make the > indexing parameter simply toggle whether indexing does or doesn't occur > instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496755#comment-14496755 ] Michael Joyce commented on NUTCH-1988: -- Hi folks. Here's an example output run of this. {code} [mjjoyce@machine local]$ bin/nutch dump -outputDir ./foodir -segment ../local_elasticsearch_testt/crawl/segments/ [mjjoyce@machine local]$ bin/nutch dump -flatdir -outputDir ./foodir2 -segment ../local_elasticsearch_testt/crawl/segments/ [mjjoyce@machine local]$ ls -R foodir foodir: 8f f8 foodir/8f: a7 foodir/8f/a7: 8d84f847f7310620a9edc4327bbfc133_.html foodir/f8: df foodir/f8/df: fec7849283af7a0adc77eddefb242b6e_.html [mjjoyce@machine local]$ ls -R foodir2 foodir2: 8d84f847f7310620a9edc4327bbfc133_.html fec7849283af7a0adc77eddefb242b6e_.html [mjjoyce@machine local]$ {code} > Make nested output directory dump optional > -- > > Key: NUTCH-1988 > URL: https://issues.apache.org/jira/browse/NUTCH-1988 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.9 >Reporter: Michael Joyce >Priority: Minor > Fix For: 1.10 > > > NUTCH-1957 added nested directories to the bin/nutch dump output to help > avoid naming conflicts in output files. It would be nice to be able to > specify that you want the older flat directory output as an optional > parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1988) Make nested output directory dump optional
[ https://issues.apache.org/jira/browse/NUTCH-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1988: - Priority: Minor (was: Major) > Make nested output directory dump optional > -- > > Key: NUTCH-1988 > URL: https://issues.apache.org/jira/browse/NUTCH-1988 > Project: Nutch > Issue Type: Improvement > Components: dumpers >Affects Versions: 1.9 >Reporter: Michael Joyce >Priority: Minor > Fix For: 1.10 > > > NUTCH-1957 added nested directories to the bin/nutch dump output to help > avoid naming conflicts in output files. It would be nice to be able to > specify that you want the older flat directory output as an optional > parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1988) Make nested output directory dump optional
Michael Joyce created NUTCH-1988: Summary: Make nested output directory dump optional Key: NUTCH-1988 URL: https://issues.apache.org/jira/browse/NUTCH-1988 Project: Nutch Issue Type: Improvement Components: dumpers Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 NUTCH-1957 added nested directories to the bin/nutch dump output to help avoid naming conflicts in output files. It would be nice to be able to specify that you want the older flat directory output as an optional parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496426#comment-14496426 ] Michael Joyce edited comment on NUTCH-1987 at 4/15/15 3:54 PM: --- Hi folks, I'll have a patch up in a bit for this. I think my current plan to minimize the number of changes that I'm shoving into a single patch is to: * Add solr.server.url to nutch-default and set the value to some sane default (http://127.0.0.1:8983/solr/) * Make the 'index' calls in the bin/nutch script generic and slightly change the call format. * Update some variable names and echos in the bin/crawl script so it doesn't only mention Solr and confuse people I envision a call being something similar to this after these changes: {code} # Run the indexer bin/crawl urls/ crawl/ "run_indexer" 1 # Don't run the indexer bin/crawl urls/ crawl/ 1 {code} I don't think this is necessarily the ideal solution but it minimizes call format changes for people with existing setups and only really requires that a single configuration value is added/updated if you want to keep using Solr on an existing setup. Note, this change obviously requires documentation updates. I'm more than happy to help with those as well but I wasn't including them in this ticket. Thoughts? was (Author: mjoyce): Hi folks, I'll have a patch up in a bit for this. I think my current plan to minimize the number of changes that I'm shoving into a single patch is to: * Add solr.server.url to nutch-default and set the value to some sane default (http://127.0.0.1:8983/solr/) * Make the 'index' calls in the bin/nutch script generic and slightly change the call format. * Update some variable names and echos in the bin/crawl script so it doesn't only mention Solr and confuse people I envision a call being something similar to this after these changes: {code} # Run the indexer bin/crawl urls/ crawl/ "run_indexer" 1 # Don't run the indexer bin/crawl urls/ crawl/ 1 {code} I don't think this is necessarily the ideal solution but it minimizes calling formats for people with existing setups and only really requires that a single configuration value is added/updated. Note, this change obviously requires some/many documentation updates. I'm more than happy to help with those as well but I wasn't including them in this ticket. Thoughts? > Make bin/crawl indexer agnostic > --- > > Key: NUTCH-1987 > URL: https://issues.apache.org/jira/browse/NUTCH-1987 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Michael Joyce > Fix For: 1.10 > > > The crawl script makes it a bit challenging to use an indexer that isn't > Solr. For instance, when I want to use the indexer-elastic plugin I still > need to call the crawler script with a fake Solr URL otherwise it will skip > the indexing step all together. > {code} > bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1 > {code} > It would be nice to keep configuration for the Solr indexer in the conf files > (to mirror the elastic search indexer conf and others) and to make the > indexing parameter simply toggle whether indexing does or doesn't occur > instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1987) Make bin/crawl indexer agnostic
[ https://issues.apache.org/jira/browse/NUTCH-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496426#comment-14496426 ] Michael Joyce commented on NUTCH-1987: -- Hi folks, I'll have a patch up in a bit for this. I think my current plan to minimize the number of changes that I'm shoving into a single patch is to: * Add solr.server.url to nutch-default and set the value to some sane default (http://127.0.0.1:8983/solr/) * Make the 'index' calls in the bin/nutch script generic and slightly change the call format. * Update some variable names and echos in the bin/crawl script so it doesn't only mention Solr and confuse people I envision a call being something similar to this after these changes: {code} # Run the indexer bin/crawl urls/ crawl/ "run_indexer" 1 # Don't run the indexer bin/crawl urls/ crawl/ 1 {code} I don't think this is necessarily the ideal solution but it minimizes calling formats for people with existing setups and only really requires that a single configuration value is added/updated. Note, this change obviously requires some/many documentation updates. I'm more than happy to help with those as well but I wasn't including them in this ticket. Thoughts? > Make bin/crawl indexer agnostic > --- > > Key: NUTCH-1987 > URL: https://issues.apache.org/jira/browse/NUTCH-1987 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.9 >Reporter: Michael Joyce > Fix For: 1.10 > > > The crawl script makes it a bit challenging to use an indexer that isn't > Solr. For instance, when I want to use the indexer-elastic plugin I still > need to call the crawler script with a fake Solr URL otherwise it will skip > the indexing step all together. > {code} > bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1 > {code} > It would be nice to keep configuration for the Solr indexer in the conf files > (to mirror the elastic search indexer conf and others) and to make the > indexing parameter simply toggle whether indexing does or doesn't occur > instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1987) Make bin/crawl indexer agnostic
Michael Joyce created NUTCH-1987: Summary: Make bin/crawl indexer agnostic Key: NUTCH-1987 URL: https://issues.apache.org/jira/browse/NUTCH-1987 Project: Nutch Issue Type: Improvement Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 The crawl script makes it a bit challenging to use an indexer that isn't Solr. For instance, when I want to use the indexer-elastic plugin I still need to call the crawler script with a fake Solr URL otherwise it will skip the indexing step all together. {code} bin/crawl urls/ crawl/ "http://fakeurl.com:9200"; 1 {code} It would be nice to keep configuration for the Solr indexer in the conf files (to mirror the elastic search indexer conf and others) and to make the indexing parameter simply toggle whether indexing does or doesn't occur instead of also trying to configure the indexer at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1986) Clarify Elastic Search Indexer Plugin Settings
Michael Joyce created NUTCH-1986: Summary: Clarify Elastic Search Indexer Plugin Settings Key: NUTCH-1986 URL: https://issues.apache.org/jira/browse/NUTCH-1986 Project: Nutch Issue Type: Improvement Components: documentation, indexer, plugin Affects Versions: 1.9 Reporter: Michael Joyce Fix For: 1.10 Was working on getting indexing into elastic search working and realized that the majority of my difficulties were simply me misunderstanding what the config needed. Patch incoming to hopefully clarify what is needed by default, what each option does, and add any helpful defaults. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-1972) Dockerfile for Nutch 1.x
[ https://issues.apache.org/jira/browse/NUTCH-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496257#comment-14496257 ] Michael Joyce commented on NUTCH-1972: -- Awesome, thanks for merging [~chrismattmann]!! > Dockerfile for Nutch 1.x > > > Key: NUTCH-1972 > URL: https://issues.apache.org/jira/browse/NUTCH-1972 > Project: Nutch > Issue Type: Improvement > Components: deployment >Reporter: Michael Joyce >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.10 > > Attachments: Joyce-NUTCH-1792-patch.txt > > > Hi folks, > I noticed that there was a Docker file for Nutch 2.x but I didn't see > anything for 1.x. I figured I would throw something up real quick. Note that > this currently doesn't install Solr. I didn't need it at the time when I was > making this, but I'll work on getting it added before too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1972) Dockerfile for Nutch 1.x
[ https://issues.apache.org/jira/browse/NUTCH-1972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Joyce updated NUTCH-1972: - Attachment: Joyce-NUTCH-1792-patch.txt Adding patch > Dockerfile for Nutch 1.x > > > Key: NUTCH-1972 > URL: https://issues.apache.org/jira/browse/NUTCH-1972 > Project: Nutch > Issue Type: Improvement >Reporter: Michael Joyce >Priority: Minor > Fix For: 1.10 > > Attachments: Joyce-NUTCH-1792-patch.txt > > > Hi folks, > I noticed that there was a Docker file for Nutch 2.x but I didn't see > anything for 1.x. I figured I would throw something up real quick. Note that > this currently doesn't install Solr. I didn't need it at the time when I was > making this, but I'll work on getting it added before too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-1972) Dockerfile for Nutch 1.x
Michael Joyce created NUTCH-1972: Summary: Dockerfile for Nutch 1.x Key: NUTCH-1972 URL: https://issues.apache.org/jira/browse/NUTCH-1972 Project: Nutch Issue Type: Improvement Reporter: Michael Joyce Priority: Minor Fix For: 1.10 Hi folks, I noticed that there was a Docker file for Nutch 2.x but I didn't see anything for 1.x. I figured I would throw something up real quick. Note that this currently doesn't install Solr. I didn't need it at the time when I was making this, but I'll work on getting it added before too long. -- This message was sent by Atlassian JIRA (v6.3.4#6332)