[jira] [Created] (JENA-820) Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

Rob Vesse (JIRA) Thu, 27 Nov 2014 08:04:23 -0800

Rob Vesse created JENA-820:
------------------------------

             Summary: Blank Node output under Hadoop can cause identifiers to 
diverge in multi-stage pipelines
                 Key: JENA-820
                 URL: https://issues.apache.org/jira/browse/JENA-820
             Project: Apache Jena
          Issue Type: Improvement
          Components: RDF Tools for Hadoop
            Reporter: Rob Vesse
            Assignee: Rob Vesse
             Fix For: Jena 2.12.2



In writing up the documentation on the RDF Tools for Hadoop and enumerating the 
possible issues that blank nodes imply I discovered an issue that I hadn't 
previously considered.

For a single job the input and output formats all ensure that blank nodes are 
consistently given the same identifiers if they had the same syntactic ID and 
were in the same file.  This is done even when a file is being read in multiple 
chunks by multiple map tasks.  However by its nature each reduce task will 
create an output file so potentially you can end up with blank nodes spread 
over multiple files.

However if we then read these files into a subsequent job the blank nodes may 
now be spread across multiple files so even though they were the same node 
originally our allocation policy will cause the identifiers to diverge and 
become distinct blank nodes which is incorrect behaviour.

Since there is no clear universal fix for this what I am considering doing is 
instead introducing a configuration setting that will allow the file path to be 
ignored for the purpose of blank node identifier allocations within a job.  
This will mean that identifiers are purely allocated on the basis of the Job ID 
and thus the same syntactic ID in any file will result in the same blank node 
identifier.  As the user will hopefully will have left this turned off for the 
first job even if we start with the same syntactic ID but in different files 
the normal allocation policy for the first job should ensure unique IDs for the 
later jobs.

My next step on this is to implement a failing unit test (and then temporarily 
ignore it) which demonstrates this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (JENA-820) Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

Reply via email to