[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011746#comment-17011746 ] ASF GitHub Bot commented on NUTCH-2184: --- sebastian-nagel commented on issue #95: NUTCH-2184 Enable IndexingJob to function with no crawldb URL: https://github.com/apache/nutch/pull/95#issuecomment-572532933 Closed in favor of #486 - indexing without a CrawlDb record has already been implemented in NUTCH-2456/#240 - various improvements from this PR have been integrated in #486 - separation of mapper and reducer classes is part of NUTCH-2375/#221 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011747#comment-17011747 ] ASF GitHub Bot commented on NUTCH-2184: --- sebastian-nagel commented on pull request #95: NUTCH-2184 Enable IndexingJob to function with no crawldb URL: https://github.com/apache/nutch/pull/95 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011700#comment-17011700 ] Hudson commented on NUTCH-2184: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3660 (See [https://builds.apache.org/job/Nutch-trunk/3660/]) NUTCH-2184 Enable IndexingJob to function with no crawldb - make the (snagel: [https://github.com/apache/nutch/commit/c4fade499e9487bb3df85a0f49abfee475bdc280]) * (edit) src/java/org/apache/nutch/indexer/IndexingJob.java * (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java NUTCH-2184 Enable IndexingJob to function with no crawldb - log if there (snagel: [https://github.com/apache/nutch/commit/57802d105259624bea20ea0e8be4cb3858d5716b]) * (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java * (edit) src/java/org/apache/nutch/indexer/IndexingJob.java > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011681#comment-17011681 ] ASF GitHub Bot commented on NUTCH-2184: --- sebastian-nagel commented on pull request #486: NUTCH-2184 Enable IndexingJob to function with no crawldb URL: https://github.com/apache/nutch/pull/486 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980357#comment-16980357 ] ASF GitHub Bot commented on NUTCH-2184: --- sebastian-nagel commented on pull request #486: NUTCH-2184 Enable IndexingJob to function with no crawldb URL: https://github.com/apache/nutch/pull/486 This PR obsoletes #95 (parts of the work are already done in [NUTCH-2456](https://issues.apache.org/jira/browse/NUTCH-2456)/#240). It - makes the CrawlDb argument passed to indexing job optional - but does not change the behavior of the indexing job otherwise - if there are non-optional arguments, the first of them is expected to be the CrawlDb unless `-nocrawldb` is given - it picks various improvements from PR #95 - and improves the command-line help: ``` Usage: Indexer ( | -nocrawldb) ( ... | -dir ) [general options] Index given segments using configured indexer plugins The CrawlDb is optional but it is required to send deletion requests for duplicates and to read the proper document score/boost/weight passed to the indexers. Required arguments: path to CrawlDb, or -nocrawldb flag to indicate that no CrawlDb shall be used ... path(s) to segment, or -dir path to segments/ directory, (all subdirectories are read as segments) General options: -linkdb use LinkDb to index anchor texts of incoming links -params k1=v1&k2=v2... parameters passed to indexer plugins (via property indexer.additional.params) -noCommit do not call the commit method of indexer plugins -deleteGone send deletion requests for 404s, redirects, duplicates -filter skip documents with URL rejected by configured URL filters -normalize normalize URLs before indexing -addBinaryContent index raw/binary content in field `binaryContent` -base64 use Base64 encoding for binary content ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Major > Fix For: 1.17 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244723#comment-16244723 ] ASF GitHub Bot commented on NUTCH-2184: --- sebastian-nagel commented on a change in pull request #95: NUTCH-2184 Enable IndexingJob to function with no crawldb URL: https://github.com/apache/nutch/pull/95#discussion_r149795839 ## File path: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ## @@ -52,14 +52,34 @@ import org.apache.nutch.protocol.Content; import org.apache.nutch.scoring.ScoringFilterException; import org.apache.nutch.scoring.ScoringFilters; - -public class IndexerMapReduce extends Configured implements -Mapper, -Reducer { +import org.apache.nutch.util.NutchConfiguration; + +/** + * This class is typically invoked from within + * {@link org.apache.nutch.indexer.IndexingJob} + * and handles all MapReduce functionality required + * when undertaking indexing. + * This is a consequence of one or more indexing plugins + * being invoked which extend + * {@link org.apache.nutch.indexer.IndexWriter}. + * See + * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, Collection, JobConf, boolean)} + * for details on the specific data structures and parameters required for indexing. + * + */ +public class IndexerMapReduce { public static final Logger LOG = LoggerFactory .getLogger(IndexerMapReduce.class); + // using normalizers and/or filters + private static boolean normalize = false; + private static boolean filter = false; + + // url normalizers, filters and job configuration + private static URLNormalizers urlNormalizers; + private static URLFilters urlFilters; Review comment: This also does not work in distributed mode: mapper and reducer are executed in different tasks/JVMs, see [NUTCH-2375/#221](https://github.com/apache/nutch/pull/221#pullrequestreview-62780003) for the same problem. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068597#comment-16068597 ] Lewis John McGibbney commented on NUTCH-2184: - Hi [~markus17] I need to finish the bloody MR tests over at https://github.com/apache/nutch/pull/95. I am very tight on cycles right now. If you can pick up the patch and want to work with it then by all means please go ahead ;) > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062909#comment-16062909 ] Markus Jelsma commented on NUTCH-2184: -- Any progress on this one? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.14 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304651#comment-15304651 ] ASF GitHub Bot commented on NUTCH-2184: --- Github user naegelejd commented on a diff in the pull request: https://github.com/apache/nutch/pull/95#discussion_r64957448 --- Diff: src/java/org/apache/nutch/indexer/IndexingJob.java --- @@ -155,43 +161,146 @@ public void index(Path crawlDb, Path linkDb, List segments, counter.getName()); } long end = System.currentTimeMillis(); - LOG.info("Indexer: finished at " + sdf.format(end) + ", elapsed: " - + TimingUtil.elapsedTime(start, end)); + LOG.info("Indexer: finished at {}, elapsed: {}", sdf.format(end), + TimingUtil.elapsedTime(start, end)); } finally { FileSystem.get(job).delete(tmp, true); } } public int run(String[] args) throws Exception { -if (args.length < 2) { - System.err - //.println("Usage: Indexer [-linkdb ] [-params k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] [-filter] [-normalize]"); - .println("Usage: Indexer [-linkdb ] [-params k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] [-filter] [-normalize] [-addBinaryContent] [-base64]"); - IndexWriters writers = new IndexWriters(getConf()); - System.err.println(writers.describe()); - return -1; -} - -final Path crawlDb = new Path(args[0]); -Path linkDb = null; - -final List segments = new ArrayList(); -String params = null; - -boolean noCommit = false; -boolean deleteGone = false; -boolean filter = false; -boolean normalize = false; -boolean addBinaryContent = false; -boolean base64 = false; +// boolean options +Option helpOpt = new Option("h", "help", false, "show this help message"); +// argument options +@SuppressWarnings("static-access") +Option crawldbOpt = OptionBuilder +.withArgName("crawldb") +.hasArg() +.withDescription( +"a crawldb directory to use with this tool (optional)") +.create("crawldb"); +@SuppressWarnings("static-access") +Option linkdbOpt = OptionBuilder +.withArgName("linkdb") +.hasArg() +.withDescription( +"a linkdb directory to use with this tool (optional)") +.create("linkdb"); +@SuppressWarnings("static-access") +Option paramsOpt = OptionBuilder +.withArgName("params") +.hasArg() +.withDescription( +"key value parameters to be used with this tool e.g. k1=v1&k2=v2... (optional)") +.create("params"); +@SuppressWarnings("static-access") +Option segOpt = OptionBuilder +.withArgName("segment") +.hasArgs() +.withDescription("the segment(s) to use (either this or --segmentDir is mandatory)") +.create("segment"); +@SuppressWarnings("static-access") +Option segmentDirOpt = OptionBuilder +.withArgName("segmentDir") +.hasArg() +.withDescription( +"directory containing one or more segments to be used with this tool " ++ "(either this or --segment is mandatory)") +.create("segmentDir"); +@SuppressWarnings("static-access") +Option noCommitOpt = OptionBuilder +.withArgName("noCommit") +.withDescription( +"do the commits once and for all the reducers in one go (optional)") --- End diff -- This description is backward: the "-noCommit" option tells the Indexer *not* to do a final commit after the job finishes. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.13 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176497#comment-15176497 ] ASF GitHub Bot commented on NUTCH-2184: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/95#discussion_r54792239 --- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java --- @@ -52,14 +52,34 @@ import org.apache.nutch.protocol.Content; import org.apache.nutch.scoring.ScoringFilterException; import org.apache.nutch.scoring.ScoringFilters; - -public class IndexerMapReduce extends Configured implements -Mapper, -Reducer { +import org.apache.nutch.util.NutchConfiguration; + +/** + * This class is typically invoked from within + * {@link org.apache.nutch.indexer.IndexingJob} + * and handles all MapReduce functionality required + * when undertaking indexing. + * This is a consequence of one or more indexing plugins + * being invoked which extend + * {@link org.apache.nutch.indexer.IndexWriter}. + * See + * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, Collection, JobConf, boolean)} + * for details on the specific data structures and parameters required for indexing. + * + */ +public class IndexerMapReduce { public static final Logger LOG = LoggerFactory .getLogger(IndexerMapReduce.class); + // using normalizers and/or filters + private static boolean normalize = false; + private static boolean filter = false; + + // url normalizers, filters and job configuration + private static URLNormalizers urlNormalizers; + private static URLFilters urlFilters; --- End diff -- No, this would duplicate the variables. Maybe pull out the nested static classes, see this trial https://github.com/sebastian-nagel/nutch/tree/NUTCH-2184 > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176462#comment-15176462 ] ASF GitHub Bot commented on NUTCH-2184: --- Github user lewismc commented on a diff in the pull request: https://github.com/apache/nutch/pull/95#discussion_r54788893 --- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java --- @@ -52,14 +52,34 @@ import org.apache.nutch.protocol.Content; import org.apache.nutch.scoring.ScoringFilterException; import org.apache.nutch.scoring.ScoringFilters; - -public class IndexerMapReduce extends Configured implements -Mapper, -Reducer { +import org.apache.nutch.util.NutchConfiguration; + +/** + * This class is typically invoked from within + * {@link org.apache.nutch.indexer.IndexingJob} + * and handles all MapReduce functionality required + * when undertaking indexing. + * This is a consequence of one or more indexing plugins + * being invoked which extend + * {@link org.apache.nutch.indexer.IndexWriter}. + * See + * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, Collection, JobConf, boolean)} + * for details on the specific data structures and parameters required for indexing. + * + */ +public class IndexerMapReduce { public static final Logger LOG = LoggerFactory .getLogger(IndexerMapReduce.class); + // using normalizers and/or filters + private static boolean normalize = false; + private static boolean filter = false; + + // url normalizers, filters and job configuration + private static URLNormalizers urlNormalizers; + private static URLFilters urlFilters; --- End diff -- Thanks @sebastian-nagel, you suggest we create the variables within the mapper and reducer respectively? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176368#comment-15176368 ] ASF GitHub Bot commented on NUTCH-2184: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/95#discussion_r54782218 --- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java --- @@ -166,235 +145,310 @@ private String filterUrl(String url) { return url; } - public void map(Text key, Writable value, - OutputCollector output, Reporter reporter) - throws IOException { + /** + * Implementation of {@link org.apache.hadoop.mapred.Mapper} + * which optionally normalizes then filters a URL before simply + * collecting key and values with the keys being URLs (manifested + * as {@link org.apache.hadoop.io.Text}) and the + * values as {@link org.apache.nutch.crawl.NutchWritable} instances + * of {@link org.apache.nutch.crawl.CrawlDatum}. + */ + public static class IndexerMapReduceMapper implements Mapper { + +@Override +public void configure(JobConf job) { +} + +public void map(Text key, Writable value, +OutputCollector output, Reporter reporter) +throws IOException { + + String urlString = filterUrl(normalizeUrl(key.toString())); + if (urlString == null) { +return; + } else { +key.set(urlString); + } + + output.collect(key, new NutchWritable(value)); +} -String urlString = filterUrl(normalizeUrl(key.toString())); -if (urlString == null) { - return; -} else { - key.set(urlString); +@Override +public void close() throws IOException { } -output.collect(key, new NutchWritable(value)); } - public void reduce(Text key, Iterator values, - OutputCollector output, Reporter reporter) - throws IOException { -Inlinks inlinks = null; -CrawlDatum dbDatum = null; -CrawlDatum fetchDatum = null; -Content content = null; -ParseData parseData = null; -ParseText parseText = null; - -while (values.hasNext()) { - final Writable value = values.next().get(); // unwrap - if (value instanceof Inlinks) { -inlinks = (Inlinks) value; - } else if (value instanceof CrawlDatum) { -final CrawlDatum datum = (CrawlDatum) value; -if (CrawlDatum.hasDbStatus(datum)) { - dbDatum = datum; -} else if (CrawlDatum.hasFetchStatus(datum)) { - // don't index unmodified (empty) pages - if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) { -fetchDatum = datum; + /** + * Implementation of {@link org.apache.hadoop.mapred.Reducer} + * which generates {@link org.apache.nutch.indexer.NutchIndexAction}'s + * from combinations of various Nutch data structures. Essentially + * teh result is a key representing a URL and a value representing a + * unit of indexing holding the document and action information. + */ + public static class IndexerMapReduceReducer implements Reducer { + +private boolean skip = false; +private boolean delete = false; +private boolean deleteRobotsNoIndex = false; +private boolean deleteSkippedByIndexingFilter = false; +private boolean base64 = false; +private IndexingFilters filters; +private ScoringFilters scfilters; + +@Override +public void configure(JobConf job) { + Configuration conf = NutchConfiguration.create(); --- End diff -- JobConf extends Configuration, there should be no need to create a new Configuration object. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176363#comment-15176363 ] ASF GitHub Bot commented on NUTCH-2184: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/95#discussion_r54781452 --- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java --- @@ -52,14 +52,34 @@ import org.apache.nutch.protocol.Content; import org.apache.nutch.scoring.ScoringFilterException; import org.apache.nutch.scoring.ScoringFilters; - -public class IndexerMapReduce extends Configured implements -Mapper, -Reducer { +import org.apache.nutch.util.NutchConfiguration; + +/** + * This class is typically invoked from within + * {@link org.apache.nutch.indexer.IndexingJob} + * and handles all MapReduce functionality required + * when undertaking indexing. + * This is a consequence of one or more indexing plugins + * being invoked which extend + * {@link org.apache.nutch.indexer.IndexWriter}. + * See + * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, Collection, JobConf, boolean)} + * for details on the specific data structures and parameters required for indexing. + * + */ +public class IndexerMapReduce { public static final Logger LOG = LoggerFactory .getLogger(IndexerMapReduce.class); + // using normalizers and/or filters + private static boolean normalize = false; + private static boolean filter = false; + + // url normalizers, filters and job configuration + private static URLNormalizers urlNormalizers; + private static URLFilters urlFilters; --- End diff -- Why are these 4 member variables now static? Also, it looks weird if a static variable of the outer class is initialized in a non-static method of one inner class (IndexerMapReduceReducer.config()). The mapper class cannot be used without instantiating the reducer. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176328#comment-15176328 ] ASF GitHub Bot commented on NUTCH-2184: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/95#discussion_r54779042 --- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java --- @@ -166,235 +145,310 @@ private String filterUrl(String url) { return url; } - public void map(Text key, Writable value, - OutputCollector output, Reporter reporter) - throws IOException { + /** + * Implementation of {@link org.apache.hadoop.mapred.Mapper} + * which optionally normalizes then filters a URL before simply + * collecting key and values with the keys being URLs (manifested + * as {@link org.apache.hadoop.io.Text}) and the + * values as {@link org.apache.nutch.crawl.NutchWritable} instances + * of {@link org.apache.nutch.crawl.CrawlDatum}. + */ + public static class IndexerMapReduceMapper implements Mapper { + +@Override +public void configure(JobConf job) { +} + +public void map(Text key, Writable value, +OutputCollector output, Reporter reporter) +throws IOException { + + String urlString = filterUrl(normalizeUrl(key.toString())); + if (urlString == null) { +return; + } else { +key.set(urlString); + } + + output.collect(key, new NutchWritable(value)); +} -String urlString = filterUrl(normalizeUrl(key.toString())); -if (urlString == null) { - return; -} else { - key.set(urlString); +@Override +public void close() throws IOException { } -output.collect(key, new NutchWritable(value)); } - public void reduce(Text key, Iterator values, - OutputCollector output, Reporter reporter) - throws IOException { -Inlinks inlinks = null; -CrawlDatum dbDatum = null; -CrawlDatum fetchDatum = null; -Content content = null; -ParseData parseData = null; -ParseText parseText = null; - -while (values.hasNext()) { - final Writable value = values.next().get(); // unwrap - if (value instanceof Inlinks) { -inlinks = (Inlinks) value; - } else if (value instanceof CrawlDatum) { -final CrawlDatum datum = (CrawlDatum) value; -if (CrawlDatum.hasDbStatus(datum)) { - dbDatum = datum; -} else if (CrawlDatum.hasFetchStatus(datum)) { - // don't index unmodified (empty) pages - if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) { -fetchDatum = datum; + /** + * Implementation of {@link org.apache.hadoop.mapred.Reducer} + * which generates {@link org.apache.nutch.indexer.NutchIndexAction}'s + * from combinations of various Nutch data structures. Essentially + * teh result is a key representing a URL and a value representing a --- End diff -- typo teh -> the > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176122#comment-15176122 ] Lewis John McGibbney commented on NUTCH-2184: - sh*t, I didn't push up my assertions. I'll get them into the patch and also cover the other test cases you've quoted. Ta > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175400#comment-15175400 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis! I don't understand the unit test, there is one assertion and it is commented. I am probably missing something about that mapreduce driver. Also, there doesn't seem to be a test that indexes mock segment data with and without crawldb, which would be nice. Rest looks fine! > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175147#comment-15175147 ] Chris A. Mattmann commented on NUTCH-2184: -- +1 this is a big needed functionality for us on MEMEX and on my USC projects where we have data collected by students that doesn't always include crawldb. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175004#comment-15175004 ] ASF GitHub Bot commented on NUTCH-2184: --- GitHub user lewismc opened a pull request: https://github.com/apache/nutch/pull/95 NUTCH-2184 Enable IndexingJob to function with no crawldb OK folks, this issue addresses https://issues.apache.org/jira/browse/NUTCH-2184 by * rebasing the [NUTCH-2184v2.patch](https://issues.apache.org/jira/secure/attachment/12784260/NUTCH-2184v2.patch) against master branch * making the IndexerMapReduceMapper and IndexerMapReduceReducer in IndexerMapReduce code explicit so that these functions can be tested * adding in some mrunit tests for testing the IndexerMapReduceMapper and IndexerMapReduceReducer * removing some trivial imports which are unsed * formatting ivy.xml which has somehow (again) become a dogs dinner * adding default constructor to NutchIndexAction() Any questions, then please let me know. I would really appreciate if people could pull this code and try it out within your test or local environment. Thanks, also thanks Markus for the original suggestions for tests, etc. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lewismc/nutch NUTCH-2184 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/95.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #95 commit c4429eb7e4a33fc619cea5e5d6c26f54969e4f55 Author: Lewis John McGibbney Date: 2016-03-02T04:21:52Z NUTCH-2184 Enable IndexingJob to function with no crawldb > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083092#comment-15083092 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis! * it should be no problem. But since IndexerMapReduce is complicated, i would love to have a simple unit test for it so we can guard for breaking things. * we should make sure the fetchDatum also caries the desired parseData fields, that we usually store in the CrawlDatum. This is not always true, see the fix i did for NUTCH-2093. I think it should be possible to index as much fields as with the CrawlDatum. If you implement it as such, then custom indexing filters that use CrawlDatum still work :) * Well yes, i think i already answered that question myself indeed, silly. This feature would be very handy for small segments but large CrawlDBs! > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074453#comment-15074453 ] Lewis John McGibbney commented on NUTCH-2184: - [~markus17] coming back to this one briefly, I've thought about the points you've raised and wanted to make the following points. 1. This proposed patch does not change core indexing functionality as such, it instead extends (improves???) it to permit indexing of just segments. 2. If you experience the scenarios you've highlighted (e.g. possible configurations for index. * .md and db.parsemeta.to.crawldb), then AFAICT nothing changes... if you have the crawldb then they are used, if not then they are not. If I am wrong here can you point to the code that I need to have a look at. I didn't originally put the index. * .md and db.parsemeta.to.crawldb functionality in place so a bit of guidance would be nice here. 3. Finally, to address the following bq. Also, what is going to happen to transient errors? Records with FETCH_STATUS_RETRY should be ignored. On the final one I am not sure right now I will revisit this patch post vacation (2nd week January). If you can provide feedback on 2 above then I'll hammer my way through that first. Thanks [~markus17] > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060155#comment-15060155 ] Lewis John McGibbney commented on NUTCH-2184: - Ack On Wednesday, December 16, 2015, Markus Jelsma (JIRA) -- *Lewis* > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060036#comment-15060036 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis - you can use the indexer-dummy in unit tests to easily check indexer output against expected output. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060023#comment-15060023 ] Lewis John McGibbney commented on NUTCH-2184: - Excellent points Markus thanks for bringing them up. I'll revisit today. These issues are more difficult to characterize and I expect a few npe until we get it right. I think I'll start writing tests for each plugin which relies upon the dbDatum to ensure coverage is sufficient. I'll post a new patch today. On Wednesday, December 16, 2015, Markus Jelsma (JIRA) -- *Lewis* > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059750#comment-15059750 ] Markus Jelsma commented on NUTCH-2184: -- Hello Lewis - keep in mind the possible configurations for index.*.md and db.parsemeta.to.crawldb. Different settings may lead to peculiar situations. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059489#comment-15059489 ] Lewis John McGibbney commented on NUTCH-2184: - No, just the following https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L370-L382 crawl_fetch, crawl_parse, parse_data and parse_text. If you want to index the binary content with the -addBinaryContent then the 'content' subdirectory needs to be present as well [~sujenshah] > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059478#comment-15059478 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059474#comment-15059474 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059476#comment-15059476 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059475#comment-15059475 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059473#comment-15059473 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059479#comment-15059479 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059477#comment-15059477 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059471#comment-15059471 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059472#comment-15059472 ] Sujen Shah commented on NUTCH-2184: --- This patch requires the segment directory to have all its folders (content, crawl_generate, etc) right ? > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059459#comment-15059459 ] Lewis John McGibbney commented on NUTCH-2184: - I've tested this on scores of segments today and it is working great. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059417#comment-15059417 ] Sujen Shah commented on NUTCH-2184: --- Thanks [~lewismc] > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059016#comment-15059016 ] Chris A. Mattmann commented on NUTCH-2184: -- +1 > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058977#comment-15058977 ] Lewis John McGibbney commented on NUTCH-2184: - To describe what this patch does * implements much clearer command line logic (using commons-cli) for IndexingJob * permits continuation of IndexerMapReduce based upon exclusion of dbDatum == null in OR logic meaning that we can still index segment(s) even if they are not accompanied by a crawldb CrawlDatum * undertakes all necessary checks to ensure that a dbDatum object with value null is never accessed * implements the same error checking in all ScoringFilters which @override the #indexerScore method ensuring that we never try to access a null value within the MR job. That's it folks. I've tried this with a bunch of test segments and I am now able to index segments without a crawldb or linkdb. It should be noted that this complies with backwards compatibility ensuring that existing scripts, etc. will still work when running the index command. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058962#comment-15058962 ] Lewis John McGibbney commented on NUTCH-2184: - Issue is logged at NUTCH-2186 > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058955#comment-15058955 ] Lewis John McGibbney commented on NUTCH-2184: - I am going to open another issue which references the above as this can only be reproduced when the -addBinaryContent flag is used. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > Attachments: NUTCH-2184.patch > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056690#comment-15056690 ] Lewis John McGibbney commented on NUTCH-2184: - This issue also improves command line parsing for the IndexingJob tool with the following help if invoked without arguments {code} lmcgibbn@LMC-032857 /usr/local/trunk_new(joshua) $ ./runtime/local/bin/nutch index Failed to parse command line Did not see expected # of arguments, saw 0 usage: IndexingJob [-crawldb ] [-linkdb ] [-params k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] [-filter] [-normalize] [-addBinaryContent] [-base64] -abc,--addBinaryContent add the raw content of the document to the indexing job (optional) -b,--base64 if raw content is added, base64 encode it (optional) -c,--crawldb a crawldb directory to use with this tool (optional) -dg,--deleteGone delete gone documents e.g. documents which no longer exist at the particular resource (optional) -f,--filter filter documents (optional) -l,--linkdb a linkdb directory to use with this tool (optional) -n,--normalizenormalize documents (optional) -nc,--noCommitdo the commits once and for all the reducers in one go (optional) -p,--params key value parameters to be used with this tool e.g. k1=v1&k2=v2... (optional) -s,--segment a single segment directory to be used with this tool (either this or -segmentDir is mandatory) -sd,--segmentDir a directory containing one or more segments to be used with this tool (either this or -segment is mandatory) Active IndexWriters : SolrIndexWriter solr.server.type : Type of SolrServer to communicate with (default 'http' however options include 'cloud', 'lb' and 'concurrent') solr.server.url : URL of the Solr instance (mandatory) solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' value for solr.server.type) solr.loadbalance.urls : Comma-separated string of Solr server strings to be used (madatory if 'lb' value for solr.server.type) solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) solr.commit.size : buffer size when sending to Solr (default 1000) solr.auth : use authentication (default false) solr.auth.username : username for authentication solr.auth.password : password for authentication {code} > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055417#comment-15055417 ] Chris A. Mattmann commented on NUTCH-2184: -- Nice, bruh > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
[ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053975#comment-15053975 ] Lewis John McGibbney commented on NUTCH-2184: - Working on this right now folks. > Enable IndexingJob to function with no crawldb > -- > > Key: NUTCH-2184 > URL: https://issues.apache.org/jira/browse/NUTCH-2184 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Fix For: 1.12 > > > Sometimes when working with distributed team(s), we have found that we can > 'loose' data structures which are currently considered as critical e.g. > crawldb, linkdb and/or segments. > In my current scenario I have a requirement to index segment data with no > accompanying crawldb or linkdb. > Absence of the latter is OK as linkdb is optional however currently in > [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] > crawldb is mandatory. > This ticket should enhance the IndexerMapReduce code to support the use case > where you ONLY have segments and want to force an index for every record > present. -- This message was sent by Atlassian JIRA (v6.3.4#6332)