[ 
https://issues.apache.org/jira/browse/NUTCH-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002183#comment-17002183
 ] 

ASF GitHub Bot commented on NUTCH-1863:
---------------------------------------

sebastian-nagel commented on pull request #490: Fix for NUTCH-1863: Add JSON 
format dump output to readdb command
URL: https://github.com/apache/nutch/pull/490#discussion_r360831289
 
 

 ##########
 File path: src/java/org/apache/nutch/crawl/CrawlDbReader.java
 ##########
 @@ -200,13 +233,84 @@ public synchronized void close(TaskAttemptContext 
context) throws IOException {
     }
   }
 
-  public static class CrawlDbStatMapper extends
-      Mapper<Text, CrawlDatum, Text, NutchWritable> {
+  public static class CrawlDatumJsonOutputFormat
+      extends FileOutputFormat<Text, CrawlDatum> {
+    protected static class LineRecordWriter
+        extends RecordWriter<Text, CrawlDatum> {
+      private DataOutputStream out;
+      private ObjectMapper jsonMapper = new ObjectMapper();
+      private ObjectWriter jsonWriter;
+
+      public LineRecordWriter(DataOutputStream out) {
+        this.out = out;
+        jsonMapper.getFactory()
+            .configure(JsonGenerator.Feature.ESCAPE_NON_ASCII, true);
+        jsonWriter = jsonMapper.writer(new JsonIndenter());
+        try {
+          out.writeBytes("[");
+        } catch (IOException e) {
+        }
+      }
+
+      public synchronized void write(Text key, CrawlDatum value)
+          throws IOException {
+        Map<String, Object> data = new LinkedHashMap<String, Object>();
+        data.put("url", key.toString());
+        data.put("statusCode", Integer.toString(value.getStatus()));
 
 Review comment:
   JSON supports int and float as number types, so could leave the conversion 
to strings away. This would result in `"statusCode": 1` instead of 
`"statusCode": "1"`. But maybe a matter of taste?
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add JSON format dump output to readdb command
> ---------------------------------------------
>
>                 Key: NUTCH-1863
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1863
>             Project: Nutch
>          Issue Type: New Feature
>          Components: crawldb
>    Affects Versions: 2.3, 1.10
>            Reporter: Lewis John McGibbney
>            Assignee: Shashanka Balakuntala Srinivasa
>            Priority: Major
>             Fix For: 1.17
>
>
> Opening up the ability for third parties to consume Nutch crawldb data as 
> JSON would be a poisitive thing IMHO.
> This issue should improve the readdb functionality of both 1.X to enable JSON 
> dumps of crawldb data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to