[ 
https://issues.apache.org/jira/browse/JENA-820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229736#comment-14229736
 ] 

ASF subversion and git services commented on JENA-820:
------------------------------------------------------

Commit ed71be184374e51b59fa921c7af56150399c6413 in jena's branch 
refs/heads/hadoop-rdf from [~rvesse]
[ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=ed71be1 ]

Improved fix for JENA-820

This commit ensures that the JENA-820 fix applies over all input formats
not just line based formats.  It also expands the test cases for blank
node divergence and identity to cover a wider range of formats.


> Blank Node output under Hadoop can cause identifiers to diverge in 
> multi-stage pipelines
> ----------------------------------------------------------------------------------------
>
>                 Key: JENA-820
>                 URL: https://issues.apache.org/jira/browse/JENA-820
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: RDF Tools for Hadoop
>            Reporter: Rob Vesse
>            Assignee: Rob Vesse
>             Fix For: Jena 2.12.2
>
>
> In writing up the documentation on the RDF Tools for Hadoop and enumerating 
> the possible issues that blank nodes imply I discovered an issue that I 
> hadn't previously considered.
> For a single job the input and output formats all ensure that blank nodes are 
> consistently given the same identifiers if they had the same syntactic ID and 
> were in the same file.  This is done even when a file is being read in 
> multiple chunks by multiple map tasks.  However by its nature each reduce 
> task will create an output file so potentially you can end up with blank 
> nodes spread over multiple files.
> However if we then read these files into a subsequent job the blank nodes may 
> now be spread across multiple files so even though they were the same node 
> originally our allocation policy will cause the identifiers to diverge and 
> become distinct blank nodes which is incorrect behaviour.
> Since there is no clear universal fix for this what I am considering doing is 
> instead introducing a configuration setting that will allow the file path to 
> be ignored for the purpose of blank node identifier allocations within a job. 
>  This will mean that identifiers are purely allocated on the basis of the Job 
> ID and thus the same syntactic ID in any file will result in the same blank 
> node identifier.  As the user will hopefully will have left this turned off 
> for the first job even if we start with the same syntactic ID but in 
> different files the normal allocation policy for the first job should ensure 
> unique IDs for the later jobs.
> My next step on this is to implement a failing unit test (and then 
> temporarily ignore it) which demonstrates this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to