[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320056#comment-16320056
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---------------------------------------

nmaro commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-356566774
 
 
   @lewismc Requested changes done - please note that
   
   * I had to extend the elastic http plugin to handle lists of Map objects 
that it previously just stringified
   * Any23 couldn't detect as many triples as you expected in your tests, had 
to lower the number - but it's good enough for now, people can still expand the 
any23 scope if they find out what the problem is
   * Data is now indexed as follows (example after crawling 
`https://smartive.ch/jobs`):
   
   ```
             "structured_data": [
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"IE-edge,chrome=1\"@de",
                 "key": "<http://vocab.sindice.net/any23#X-UA-Compatible>",
                 "short_key": "X-UA-Compatible"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"Wir sind smartive \\u2014 eine dynamische, 
innovative Schweizer Webentwicklungsagentur. Die Realisierung 
zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer Passion, 
wie die konstruktive Zusammenarbeit mit unseren Kundinnen und Kunden.\"@de",
                 "key": "<http://vocab.sindice.net/any23#description>",
                 "short_key": "description"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"width=device-width, initial-scale=1, 
shrink-to-fit=no\"@de",
                 "key": "<http://vocab.sindice.net/any23#viewport>",
                 "short_key": "viewport"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"width=device-width,initial-scale=1\"@de",
                 "key": "<http://vocab.sindice.net/any23#viewport>",
                 "short_key": "viewport"
               },
               {
                 "node": "<https://smartive.ch/jobs>",
                 "value": "\"ie=edge\"@de",
                 "key": "<http://vocab.sindice.net/any23#x-ua-compatible>",
                 "short_key": "x-ua-compatible"
               }
             ],
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> ------------------
>
>                 Key: NUTCH-1129
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1129
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.5
>
>         Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to