[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2020-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011746#comment-17011746
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

sebastian-nagel commented on issue #95: NUTCH-2184 Enable IndexingJob to 
function with no crawldb
URL: https://github.com/apache/nutch/pull/95#issuecomment-572532933
 
 
   Closed in favor of #486 
   - indexing without a CrawlDb record has already been implemented in 
NUTCH-2456/#240
   - various improvements from this PR have been integrated in #486 
   - separation of mapper and reducer classes is part of NUTCH-2375/#221
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2020-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011747#comment-17011747
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

sebastian-nagel commented on pull request #95: NUTCH-2184 Enable IndexingJob to 
function with no crawldb
URL: https://github.com/apache/nutch/pull/95
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2020-01-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011700#comment-17011700
 ] 

Hudson commented on NUTCH-2184:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3660 (See 
[https://builds.apache.org/job/Nutch-trunk/3660/])
NUTCH-2184 Enable IndexingJob to function with no crawldb - make the (snagel: 
[https://github.com/apache/nutch/commit/c4fade499e9487bb3df85a0f49abfee475bdc280])
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java
NUTCH-2184 Enable IndexingJob to function with no crawldb - log if there 
(snagel: 
[https://github.com/apache/nutch/commit/57802d105259624bea20ea0e8be4cb3858d5716b])
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* (edit) src/java/org/apache/nutch/indexer/IndexingJob.java


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2020-01-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011681#comment-17011681
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

sebastian-nagel commented on pull request #486: NUTCH-2184 Enable IndexingJob 
to function with no crawldb
URL: https://github.com/apache/nutch/pull/486
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2019-11-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980357#comment-16980357
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

sebastian-nagel commented on pull request #486: NUTCH-2184 Enable IndexingJob 
to function with no crawldb
URL: https://github.com/apache/nutch/pull/486
 
 
   This PR obsoletes #95 (parts of the work are already done in 
[NUTCH-2456](https://issues.apache.org/jira/browse/NUTCH-2456)/#240). It
   - makes the CrawlDb argument passed to indexing job optional
   - but does not change the behavior of the indexing job otherwise
   - if there are non-optional arguments, the first of them is expected to be 
the CrawlDb unless `-nocrawldb` is given
   - it picks various improvements from PR #95
   - and improves the command-line help:
   ```
   Usage: Indexer ( | -nocrawldb) ( ... | -dir ) 
[general options]
   
   Index given segments using configured indexer plugins
   
   The CrawlDb is optional but it is required to send deletion requests for 
duplicates
   and to read the proper document score/boost/weight passed to the indexers.
   
   Required arguments:
   
  path to CrawlDb, or
   -nocrawldb  flag to indicate that no CrawlDb shall be used
   
...   path(s) to segment, or
   -dir  path to segments/ directory,
   (all subdirectories are read as segments)
   
   General options:
   
   -linkdb use LinkDb to index anchor texts of incoming 
links
   -params k1=v1&k2=v2...  parameters passed to indexer plugins
   (via property indexer.additional.params)
   
   -noCommit   do not call the commit method of indexer plugins
   -deleteGone send deletion requests for 404s, redirects, 
duplicates
   -filter skip documents with URL rejected by configured URL 
filters
   -normalize  normalize URLs before indexing
   -addBinaryContent   index raw/binary content in field 
`binaryContent`
   -base64 use Base64 encoding for binary content
   ```
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.17
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2017-11-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244723#comment-16244723
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

sebastian-nagel commented on a change in pull request #95: NUTCH-2184 Enable 
IndexingJob to function with no crawldb
URL: https://github.com/apache/nutch/pull/95#discussion_r149795839
 
 

 ##
 File path: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
 ##
 @@ -52,14 +52,34 @@
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.scoring.ScoringFilterException;
 import org.apache.nutch.scoring.ScoringFilters;
-
-public class IndexerMapReduce extends Configured implements
-Mapper,
-Reducer {
+import org.apache.nutch.util.NutchConfiguration;
+
+/**
+ * This class is typically invoked from within 
+ * {@link org.apache.nutch.indexer.IndexingJob}
+ * and handles all MapReduce functionality required
+ * when undertaking indexing.
+ * This is a consequence of one or more indexing plugins 
+ * being invoked which extend 
+ * {@link org.apache.nutch.indexer.IndexWriter}.
+ * See 
+ * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, 
Collection, JobConf, boolean)}
+ * for details on the specific data structures and parameters required for 
indexing.
+ *
+ */
+public class IndexerMapReduce {
 
   public static final Logger LOG = LoggerFactory
   .getLogger(IndexerMapReduce.class);
 
+  // using normalizers and/or filters
+  private static boolean normalize = false;
+  private static boolean filter = false;
+
+  // url normalizers, filters and job configuration
+  private static URLNormalizers urlNormalizers;
+  private static URLFilters urlFilters;
 
 Review comment:
   This also does not work in distributed mode: mapper and reducer are executed 
in different tasks/JVMs, see 
[NUTCH-2375/#221](https://github.com/apache/nutch/pull/221#pullrequestreview-62780003)
 for the same problem.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2017-06-29 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068597#comment-16068597
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

Hi [~markus17] I need to finish the bloody MR tests over at 
https://github.com/apache/nutch/pull/95. I am very tight on cycles right now. 
If you can pick up the patch and want to work with it then by all means please 
go ahead ;)

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2017-06-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062909#comment-16062909
 ] 

Markus Jelsma commented on NUTCH-2184:
--

Any progress on this one?

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.14
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-05-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15304651#comment-15304651
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

Github user naegelejd commented on a diff in the pull request:

https://github.com/apache/nutch/pull/95#discussion_r64957448
  
--- Diff: src/java/org/apache/nutch/indexer/IndexingJob.java ---
@@ -155,43 +161,146 @@ public void index(Path crawlDb, Path linkDb, 
List segments,
 counter.getName());
   }
   long end = System.currentTimeMillis();
-  LOG.info("Indexer: finished at " + sdf.format(end) + ", elapsed: "
-  + TimingUtil.elapsedTime(start, end));
+  LOG.info("Indexer: finished at {}, elapsed: {}", sdf.format(end),
+  TimingUtil.elapsedTime(start, end));
 } finally {
   FileSystem.get(job).delete(tmp, true);
 }
   }
 
   public int run(String[] args) throws Exception {
-if (args.length < 2) {
-  System.err
-  //.println("Usage: Indexer  [-linkdb ] [-params 
k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] 
[-filter] [-normalize]");
-  .println("Usage: Indexer  [-linkdb ] [-params 
k1=v1&k2=v2...] ( ... | -dir ) [-noCommit] [-deleteGone] 
[-filter] [-normalize] [-addBinaryContent] [-base64]");
-  IndexWriters writers = new IndexWriters(getConf());
-  System.err.println(writers.describe());
-  return -1;
-}
-
-final Path crawlDb = new Path(args[0]);
-Path linkDb = null;
-
-final List segments = new ArrayList();
-String params = null;
-
-boolean noCommit = false;
-boolean deleteGone = false;
-boolean filter = false;
-boolean normalize = false;
-boolean addBinaryContent = false;
-boolean base64 = false;
+// boolean options
+Option helpOpt = new Option("h", "help", false, "show this help 
message");
+// argument options
+@SuppressWarnings("static-access")
+Option crawldbOpt = OptionBuilder
+.withArgName("crawldb")
+.hasArg()
+.withDescription(
+"a crawldb directory to use with this tool (optional)")
+.create("crawldb");
+@SuppressWarnings("static-access")
+Option linkdbOpt = OptionBuilder
+.withArgName("linkdb")
+.hasArg()
+.withDescription(
+"a linkdb directory to use with this tool (optional)")
+.create("linkdb");
+@SuppressWarnings("static-access")
+Option paramsOpt = OptionBuilder
+.withArgName("params")
+.hasArg()
+.withDescription(
+"key value parameters to be used with this tool e.g. 
k1=v1&k2=v2... (optional)")
+.create("params");
+@SuppressWarnings("static-access")
+Option segOpt = OptionBuilder
+.withArgName("segment")
+.hasArgs()
+.withDescription("the segment(s) to use (either this or --segmentDir 
is mandatory)")
+.create("segment");
+@SuppressWarnings("static-access")
+Option segmentDirOpt = OptionBuilder
+.withArgName("segmentDir")
+.hasArg()
+.withDescription(
+"directory containing one or more segments to be used with this 
tool "
++ "(either this or --segment is mandatory)")
+.create("segmentDir");
+@SuppressWarnings("static-access")
+Option noCommitOpt = OptionBuilder
+.withArgName("noCommit")
+.withDescription(
+"do the commits once and for all the reducers in one go 
(optional)")
--- End diff --

This description is backward: the "-noCommit" option tells the Indexer 
*not* to do a final commit after the job finishes.


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.13
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176497#comment-15176497
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/95#discussion_r54792239
  
--- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ---
@@ -52,14 +52,34 @@
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.scoring.ScoringFilterException;
 import org.apache.nutch.scoring.ScoringFilters;
-
-public class IndexerMapReduce extends Configured implements
-Mapper,
-Reducer {
+import org.apache.nutch.util.NutchConfiguration;
+
+/**
+ * This class is typically invoked from within 
+ * {@link org.apache.nutch.indexer.IndexingJob}
+ * and handles all MapReduce functionality required
+ * when undertaking indexing.
+ * This is a consequence of one or more indexing plugins 
+ * being invoked which extend 
+ * {@link org.apache.nutch.indexer.IndexWriter}.
+ * See 
+ * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, 
Collection, JobConf, boolean)}
+ * for details on the specific data structures and parameters required for 
indexing.
+ *
+ */
+public class IndexerMapReduce {
 
   public static final Logger LOG = LoggerFactory
   .getLogger(IndexerMapReduce.class);
 
+  // using normalizers and/or filters
+  private static boolean normalize = false;
+  private static boolean filter = false;
+
+  // url normalizers, filters and job configuration
+  private static URLNormalizers urlNormalizers;
+  private static URLFilters urlFilters;
--- End diff --

No, this would duplicate the variables. Maybe pull out the nested static 
classes, see this trial https://github.com/sebastian-nagel/nutch/tree/NUTCH-2184


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176462#comment-15176462
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/nutch/pull/95#discussion_r54788893
  
--- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ---
@@ -52,14 +52,34 @@
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.scoring.ScoringFilterException;
 import org.apache.nutch.scoring.ScoringFilters;
-
-public class IndexerMapReduce extends Configured implements
-Mapper,
-Reducer {
+import org.apache.nutch.util.NutchConfiguration;
+
+/**
+ * This class is typically invoked from within 
+ * {@link org.apache.nutch.indexer.IndexingJob}
+ * and handles all MapReduce functionality required
+ * when undertaking indexing.
+ * This is a consequence of one or more indexing plugins 
+ * being invoked which extend 
+ * {@link org.apache.nutch.indexer.IndexWriter}.
+ * See 
+ * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, 
Collection, JobConf, boolean)}
+ * for details on the specific data structures and parameters required for 
indexing.
+ *
+ */
+public class IndexerMapReduce {
 
   public static final Logger LOG = LoggerFactory
   .getLogger(IndexerMapReduce.class);
 
+  // using normalizers and/or filters
+  private static boolean normalize = false;
+  private static boolean filter = false;
+
+  // url normalizers, filters and job configuration
+  private static URLNormalizers urlNormalizers;
+  private static URLFilters urlFilters;
--- End diff --

Thanks @sebastian-nagel, you suggest we create the variables within the 
mapper and reducer respectively?


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176368#comment-15176368
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/95#discussion_r54782218
  
--- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ---
@@ -166,235 +145,310 @@ private String filterUrl(String url) {
 return url;
   }
 
-  public void map(Text key, Writable value,
-  OutputCollector output, Reporter reporter)
-  throws IOException {
+  /**
+   * Implementation of {@link org.apache.hadoop.mapred.Mapper}
+   * which optionally normalizes then filters a URL before simply
+   * collecting key and values with the keys being URLs (manifested
+   * as {@link org.apache.hadoop.io.Text}) and the 
+   * values as {@link org.apache.nutch.crawl.NutchWritable} instances
+   * of {@link org.apache.nutch.crawl.CrawlDatum}.
+   */
+  public static class IndexerMapReduceMapper implements Mapper {
+
+@Override
+public void configure(JobConf job) {
+}
+
+public void map(Text key, Writable value,
+OutputCollector output, Reporter reporter)
+throws IOException {
+
+  String urlString = filterUrl(normalizeUrl(key.toString()));
+  if (urlString == null) {
+return;
+  } else {
+key.set(urlString);
+  }
+
+  output.collect(key, new NutchWritable(value));
+}
 
-String urlString = filterUrl(normalizeUrl(key.toString()));
-if (urlString == null) {
-  return;
-} else {
-  key.set(urlString);
+@Override
+public void close() throws IOException {
 }
 
-output.collect(key, new NutchWritable(value));
   }
 
-  public void reduce(Text key, Iterator values,
-  OutputCollector output, Reporter reporter)
-  throws IOException {
-Inlinks inlinks = null;
-CrawlDatum dbDatum = null;
-CrawlDatum fetchDatum = null;
-Content content = null;
-ParseData parseData = null;
-ParseText parseText = null;
-
-while (values.hasNext()) {
-  final Writable value = values.next().get(); // unwrap
-  if (value instanceof Inlinks) {
-inlinks = (Inlinks) value;
-  } else if (value instanceof CrawlDatum) {
-final CrawlDatum datum = (CrawlDatum) value;
-if (CrawlDatum.hasDbStatus(datum)) {
-  dbDatum = datum;
-} else if (CrawlDatum.hasFetchStatus(datum)) {
-  // don't index unmodified (empty) pages
-  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
-fetchDatum = datum;
+  /**
+   * Implementation of {@link org.apache.hadoop.mapred.Reducer}
+   * which generates {@link org.apache.nutch.indexer.NutchIndexAction}'s
+   * from combinations of various Nutch data structures. Essentially 
+   * teh result is a key representing a URL and a value representing a
+   * unit of indexing holding the document and action information.
+   */
+  public static class IndexerMapReduceReducer implements Reducer {
+
+private boolean skip = false;
+private boolean delete = false;
+private boolean deleteRobotsNoIndex = false;
+private boolean deleteSkippedByIndexingFilter = false;
+private boolean base64 = false;
+private IndexingFilters filters;
+private ScoringFilters scfilters;
+
+@Override
+public void configure(JobConf job) {
+  Configuration conf = NutchConfiguration.create();
--- End diff --

JobConf extends Configuration, there should be no need to create a new 
Configuration object.


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176363#comment-15176363
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/95#discussion_r54781452
  
--- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ---
@@ -52,14 +52,34 @@
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.scoring.ScoringFilterException;
 import org.apache.nutch.scoring.ScoringFilters;
-
-public class IndexerMapReduce extends Configured implements
-Mapper,
-Reducer {
+import org.apache.nutch.util.NutchConfiguration;
+
+/**
+ * This class is typically invoked from within 
+ * {@link org.apache.nutch.indexer.IndexingJob}
+ * and handles all MapReduce functionality required
+ * when undertaking indexing.
+ * This is a consequence of one or more indexing plugins 
+ * being invoked which extend 
+ * {@link org.apache.nutch.indexer.IndexWriter}.
+ * See 
+ * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, 
Collection, JobConf, boolean)}
+ * for details on the specific data structures and parameters required for 
indexing.
+ *
+ */
+public class IndexerMapReduce {
 
   public static final Logger LOG = LoggerFactory
   .getLogger(IndexerMapReduce.class);
 
+  // using normalizers and/or filters
+  private static boolean normalize = false;
+  private static boolean filter = false;
+
+  // url normalizers, filters and job configuration
+  private static URLNormalizers urlNormalizers;
+  private static URLFilters urlFilters;
--- End diff --

Why are these 4 member variables now static?
Also, it looks weird if a static variable of the outer class is initialized 
in a non-static method of one inner class (IndexerMapReduceReducer.config()). 
The mapper class cannot be used without instantiating the reducer.


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176328#comment-15176328
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/95#discussion_r54779042
  
--- Diff: src/java/org/apache/nutch/indexer/IndexerMapReduce.java ---
@@ -166,235 +145,310 @@ private String filterUrl(String url) {
 return url;
   }
 
-  public void map(Text key, Writable value,
-  OutputCollector output, Reporter reporter)
-  throws IOException {
+  /**
+   * Implementation of {@link org.apache.hadoop.mapred.Mapper}
+   * which optionally normalizes then filters a URL before simply
+   * collecting key and values with the keys being URLs (manifested
+   * as {@link org.apache.hadoop.io.Text}) and the 
+   * values as {@link org.apache.nutch.crawl.NutchWritable} instances
+   * of {@link org.apache.nutch.crawl.CrawlDatum}.
+   */
+  public static class IndexerMapReduceMapper implements Mapper {
+
+@Override
+public void configure(JobConf job) {
+}
+
+public void map(Text key, Writable value,
+OutputCollector output, Reporter reporter)
+throws IOException {
+
+  String urlString = filterUrl(normalizeUrl(key.toString()));
+  if (urlString == null) {
+return;
+  } else {
+key.set(urlString);
+  }
+
+  output.collect(key, new NutchWritable(value));
+}
 
-String urlString = filterUrl(normalizeUrl(key.toString()));
-if (urlString == null) {
-  return;
-} else {
-  key.set(urlString);
+@Override
+public void close() throws IOException {
 }
 
-output.collect(key, new NutchWritable(value));
   }
 
-  public void reduce(Text key, Iterator values,
-  OutputCollector output, Reporter reporter)
-  throws IOException {
-Inlinks inlinks = null;
-CrawlDatum dbDatum = null;
-CrawlDatum fetchDatum = null;
-Content content = null;
-ParseData parseData = null;
-ParseText parseText = null;
-
-while (values.hasNext()) {
-  final Writable value = values.next().get(); // unwrap
-  if (value instanceof Inlinks) {
-inlinks = (Inlinks) value;
-  } else if (value instanceof CrawlDatum) {
-final CrawlDatum datum = (CrawlDatum) value;
-if (CrawlDatum.hasDbStatus(datum)) {
-  dbDatum = datum;
-} else if (CrawlDatum.hasFetchStatus(datum)) {
-  // don't index unmodified (empty) pages
-  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
-fetchDatum = datum;
+  /**
+   * Implementation of {@link org.apache.hadoop.mapred.Reducer}
+   * which generates {@link org.apache.nutch.indexer.NutchIndexAction}'s
+   * from combinations of various Nutch data structures. Essentially 
+   * teh result is a key representing a URL and a value representing a
--- End diff --

typo teh -> the


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-02 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176122#comment-15176122
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

sh*t, I didn't push up my assertions. I'll get them into the patch and also 
cover the other test cases you've quoted. Ta

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175400#comment-15175400
 ] 

Markus Jelsma commented on NUTCH-2184:
--

Hello Lewis! I don't understand the unit test, there is one assertion and it is 
commented. I am probably missing something about that mapreduce driver. Also, 
there doesn't seem to be a test that indexes mock segment data with and without 
crawldb, which would be nice.

Rest looks fine!

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175147#comment-15175147
 ] 

Chris A. Mattmann commented on NUTCH-2184:
--

+1 this is a big needed functionality for us on MEMEX and on my USC projects 
where we have data collected by students that doesn't always include crawldb.

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175004#comment-15175004
 ] 

ASF GitHub Bot commented on NUTCH-2184:
---

GitHub user lewismc opened a pull request:

https://github.com/apache/nutch/pull/95

NUTCH-2184 Enable IndexingJob to function with no crawldb

OK folks, this issue addresses 
https://issues.apache.org/jira/browse/NUTCH-2184 by
 * rebasing the 
[NUTCH-2184v2.patch](https://issues.apache.org/jira/secure/attachment/12784260/NUTCH-2184v2.patch)
 against master branch
 * making the IndexerMapReduceMapper and IndexerMapReduceReducer in 
IndexerMapReduce code explicit so that these functions can be tested
 * adding in some mrunit tests for testing the IndexerMapReduceMapper and 
IndexerMapReduceReducer
 * removing some trivial imports which are unsed
 * formatting ivy.xml which has somehow (again) become a dogs dinner
 * adding default constructor to NutchIndexAction()

Any questions, then please let me know. I would really appreciate if people 
could pull this code and try it out within your test or local environment.
Thanks, also thanks Markus for the original suggestions for tests, etc.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lewismc/nutch NUTCH-2184

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/95.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #95


commit c4429eb7e4a33fc619cea5e5d6c26f54969e4f55
Author: Lewis John McGibbney 
Date:   2016-03-02T04:21:52Z

NUTCH-2184 Enable IndexingJob to function with no crawldb




> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2016-01-05 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083092#comment-15083092
 ] 

Markus Jelsma commented on NUTCH-2184:
--

Hello Lewis!

* it should be no problem. But since IndexerMapReduce is complicated, i would 
love to have a simple unit test for it so we can guard for breaking things.
* we should make sure the fetchDatum also caries the desired parseData fields, 
that we usually store in the CrawlDatum. This is not always true, see the fix i 
did for NUTCH-2093. I think it should be possible to index as much fields as 
with the CrawlDatum. If you implement it as such, then custom indexing filters 
that use CrawlDatum still work :)
* Well yes, i think i already answered that question myself indeed, silly.

This feature would be very handy for small segments but large CrawlDBs! 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-29 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074453#comment-15074453
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

[~markus17] coming back to this one briefly, I've thought about the points 
you've raised and wanted to make the following points. 
  1. This proposed patch does not change core indexing functionality as such, 
it instead extends (improves???) it to permit indexing of just segments. 
  2. If you experience the scenarios you've highlighted (e.g. possible 
configurations for index. * .md and db.parsemeta.to.crawldb), then AFAICT 
nothing changes... if you have the crawldb then they are used, if not then they 
are not. If I am wrong here can you point to the code that I need to have a 
look at. I didn't originally put the index. * .md and db.parsemeta.to.crawldb 
functionality in place so a bit of guidance would be nice here.
  3. Finally, to address the following
bq. Also, what is going to happen to transient errors? Records with 
FETCH_STATUS_RETRY should be ignored.

On the final one I am not sure right now I will revisit this patch post 
vacation (2nd week January). If you can provide feedback on 2 above then I'll 
hammer my way through that first. 
Thanks [~markus17]

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060155#comment-15060155
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

Ack

On Wednesday, December 16, 2015, Markus Jelsma (JIRA) 



-- 
*Lewis*


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060036#comment-15060036
 ] 

Markus Jelsma commented on NUTCH-2184:
--

Hello Lewis - you can use the indexer-dummy in unit tests to easily check 
indexer output against expected output. 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060023#comment-15060023
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

Excellent points Markus thanks for bringing them up. I'll revisit today.
These issues are more difficult to characterize and I expect a few npe
until we get it right. I think I'll start writing tests for each plugin
which relies upon the dbDatum to ensure coverage is sufficient. I'll post a
new patch today.

On Wednesday, December 16, 2015, Markus Jelsma (JIRA) 



-- 
*Lewis*


> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-16 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059750#comment-15059750
 ] 

Markus Jelsma commented on NUTCH-2184:
--

Hello Lewis - keep in mind the possible configurations for index.*.md and 
db.parsemeta.to.crawldb. Different settings may lead to peculiar situations.

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059489#comment-15059489
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

No, just the following
https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L370-L382
crawl_fetch, crawl_parse, parse_data and parse_text.
If you want to index the binary content with the -addBinaryContent then the 
'content' subdirectory needs to be present as well [~sujenshah] 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059478#comment-15059478
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059474#comment-15059474
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059476#comment-15059476
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059475#comment-15059475
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059473#comment-15059473
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059479#comment-15059479
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059477#comment-15059477
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059471#comment-15059471
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059472#comment-15059472
 ] 

Sujen Shah commented on NUTCH-2184:
---

This patch requires the segment directory to have all its folders (content, 
crawl_generate, etc) right ? 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059459#comment-15059459
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

I've tested this on scores of segments today and it is working great. 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059417#comment-15059417
 ] 

Sujen Shah commented on NUTCH-2184:
---

Thanks [~lewismc]

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15059016#comment-15059016
 ] 

Chris A. Mattmann commented on NUTCH-2184:
--

+1

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058977#comment-15058977
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

To describe what this patch does
 * implements much clearer command line logic (using commons-cli) for 
IndexingJob
 * permits continuation of IndexerMapReduce based upon exclusion of dbDatum == 
null in OR logic meaning that we can still index segment(s) even if they are 
not accompanied by a crawldb CrawlDatum
 * undertakes all necessary checks to ensure that a dbDatum object with value 
null is never accessed
 * implements the same error checking in all ScoringFilters which @override the 
#indexerScore method ensuring that we never try to access a null value within 
the MR job.

That's it folks. I've tried this with a bunch of test segments and I am now 
able to index segments without a crawldb or linkdb. 
It should be noted that this complies with backwards compatibility ensuring 
that existing scripts, etc. will still work when running the index command. 

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058962#comment-15058962
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

Issue is logged at NUTCH-2186

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-15 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058955#comment-15058955
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

I am going to open another issue which references the above as this can only be 
reproduced when the -addBinaryContent flag is used.

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2184.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056690#comment-15056690
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

This issue also improves command line parsing for the IndexingJob tool with the 
following help if invoked without arguments
{code}
lmcgibbn@LMC-032857 /usr/local/trunk_new(joshua) $ ./runtime/local/bin/nutch 
index
Failed to parse command line Did not see expected # of arguments, saw 0
usage: IndexingJob [-crawldb ] [-linkdb ] [-params
   k1=v1&k2=v2...] ( ... | -dir )
   [-noCommit] [-deleteGone] [-filter] [-normalize]
   [-addBinaryContent] [-base64]
 -abc,--addBinaryContent   add the raw content of the document to the
   indexing job (optional)
 -b,--base64   if raw content is added, base64 encode it
   (optional)
 -c,--crawldb a crawldb directory to use with this tool
   (optional)
 -dg,--deleteGone  delete gone documents e.g. documents which no
   longer exist at the particular resource
   (optional)
 -f,--filter   filter documents (optional)
 -l,--linkdb  a linkdb directory to use with this tool
   (optional)
 -n,--normalizenormalize documents (optional)
 -nc,--noCommitdo the commits once and for all the reducers in
   one go (optional)
 -p,--params  key value parameters to be used with this tool
   e.g. k1=v1&k2=v2... (optional)
 -s,--segment a single segment directory to be used with this
   tool (either this or -segmentDir is mandatory)
 -sd,--segmentDir a directory containing one or more segments to
   be used with this tool (either this or -segment
   is mandatory)
Active IndexWriters :
SolrIndexWriter
solr.server.type : Type of SolrServer to communicate with (default 
'http' however options include 'cloud', 'lb' and 'concurrent')
solr.server.url : URL of the Solr instance (mandatory)
solr.zookeeper.url : URL of the Zookeeper URL (mandatory if 'cloud' 
value for solr.server.type)
solr.loadbalance.urls : Comma-separated string of Solr server strings 
to be used (madatory if 'lb' value for solr.server.type)
solr.mapping.file : name of the mapping file for fields (default 
solrindex-mapping.xml)
solr.commit.size : buffer size when sending to Solr (default 1000)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
{code}

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-13 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055417#comment-15055417
 ] 

Chris A. Mattmann commented on NUTCH-2184:
--

Nice, bruh

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

2015-12-11 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053975#comment-15053975
 ] 

Lewis John McGibbney commented on NUTCH-2184:
-

Working on this right now folks.

> Enable IndexingJob to function with no crawldb
> --
>
> Key: NUTCH-2184
> URL: https://issues.apache.org/jira/browse/NUTCH-2184
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.12
>
>
> Sometimes when working with distributed team(s), we have found that we can 
> 'loose' data structures which are currently considered as critical e.g. 
> crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no 
> accompanying crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in 
> [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
>  crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case 
> where you ONLY have segments and want to force an index for every record 
> present.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)