The documentation on Fedora Identifiers [1] states that (under
Normalization) that the separator character ":" may occur as "%3A" or "%3a",
but should be normalized to a colon (":").  The only other PID normalisation
that occurs is upper-casing of hex digits in escaped octets.  In the code,
PID normalisation occurs anywhere that a PID object is instantiated from a
string, so these various forms are widely accepted.
 
In PIDs we don't seem to treat escaped octets as "representations" of the
underlying character in the PID - but instead provide them as a syntactic
means for users to "include" characters not otherwise allowable; ie they are
treated as opaque and as three distinct characters - the PID "is" the string
of characters including the individual characters expressing the escaped
octet (normalized to upper case), and is never further decoded.  Furthermore
an escaped octet counts as three characters when checking the max length.
 
For instance the PIDs changeme:%58 and changeme:X are treated as distinct -
the %58 is not decoded to "X" (unlike URI normalisation, which would treat
%58 and X as semantically equivalent).
 
Or to put it another way, in a URI an escaped octet is a representation of a
single data character in the URI, in a PID an escaped octet seems to be just
three (syntactically-valid) characters.
 
If we allow non-normalized PIDs in REST API URLs, and wish to use these URIs
as persistent identifiers (FCREPO-650 [2]) then we potentially run into
problems.
 
For instance the following PIDs are legal:
 
1. changeme:one%3Atwo (normalized form)
2. changeme%3Aone%3Atwo
3. changeme%3aone%3atwo
 
URL-escaped these would be
1. changeme:one%253Atwo (the ":" may also be escaped, but will be treated as
equivalent from a URI perspective) 
2. changeme%253Aone%253Atwo
3. changeme%253aone%253atwo
 
This would result in REST API URLs that are equivalent in action, but are
not semantically equivalent from a URI perspective (they are different
identifiers).
 
URLs with all of the above forms currently work fine in the new REST API.
 
Should we ensure that the REST API *only* accepts the normalized form of a
PID (and then correctly URL-encoded) as part of FCREPO-650?  Otherwise we
might risk users issuing semantically different (but functionally
equivalent) URIs for the same resources.
 
Possibly we have an issue with how the URI forms are generated also - the
documentation [1] states that info:fedora/ is prepended.  This means that
the PIDs changeme:%58 and changeme:X - which are distinct - will result in
URIs info:fedora/changeme:%58 and info:fedora/changeme:X - which are
semantically equivalent.  I haven't done any testing on this however, yet.
 
As an aside - does anyone recall why the "%3A" representation of the
namespace separator was implemented (and therefore part of the
normalization) in the first place?  It shouldn't be necessary in SOAP API
calls (as ":" is legal in XML); and in REST API calls normal URL-encoding
should cope with any issues with the % character.  Though I note (see [3])
that URL decoding doesn't in fact seem to happen with the "LITE" APIs - so
possibly it's something to do with that?
 
Regards
Steve
 
 
[1]
http://www.fedora-commons.org/confluence/display/FCR30/Fedora+Identifiers
[2] http://www.fedora-commons.org/jira/browse/FCREPO-650
[3]
http://www.fedora-commons.org/jira/browse/FCREPO-627?focusedCommentId=15488
<http://www.fedora-commons.org/jira/browse/FCREPO-627?focusedCommentId=15488
&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act
ion_15488>
&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#act
ion_15488
------------------------------------------------------------------------------

_______________________________________________
Fedora-commons-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Reply via email to