[jira] [Commented] (OAK-5272) Expose BlobStore API to provide information whether blob id is content hashed

2018-03-01 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381740#comment-16381740
 ] 

Thomas Mueller commented on OAK-5272:
-

To be able to more easily migrate to other hashing algorithms, and also be able 
to use identifiers that are not content hash, I think it makes sense to further 
extend the API (maybe just the internal API), for example as follows:

{noformat}
/**
 * For this binary, returns the map of all known content hashes, 
  * and CRC codes, together with the hash algorithm used. 
 * This can save an application from having to call getStream() 
 * and calculate the CRC / content hash itself.
 * The returned map can be empty if the implementation would have to calculate 
the values.
 * If not empty, then the map contains one entry for each CRC / content hash 
already calculated.
 * The value is always hex-encoded (lowercase) without spaces.
 * For example, it could return a "CRC32" (if known), "SHA-1" (if known), 
"SHA-256", and so on.
 */
Map getKnownContentHashes();
{noformat}

For example Amazon S3 seems to calculate the MD5 and provide that as the ETag. 
While MD5 isn't secure, it can be used in the same way as the CRC32, to say 
whether two binaries are different for sure, or possibly the same.

> Expose BlobStore API to provide information whether blob id is content hashed
> -
>
> Key: OAK-5272
> URL: https://issues.apache.org/jira/browse/OAK-5272
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Reporter: Amit Jain
>Priority: Major
> Fix For: 1.10
>
>
> As per discussion in OAK-5253 it's better to have some information from the 
> BlobStore(s) whether the blob id can be solely relied upon for comparison.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-5272) Expose BlobStore API to provide information whether blob id is content hashed

2018-02-28 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379987#comment-16379987
 ] 

Thomas Mueller commented on OAK-5272:
-

>  But blobs from different DS, is that even happening in reality?

[~alexander.klimetschek] I agree it's probably not common, but I think it can 
happen:
* when migrating from one DS to another (e.g. from FileDataStore to S3)
* when migrating from SHA-1 to SHA-256
* when using "multi-stage" DS (e.g. S3, plus Amazon Glacier)

While it is "rare", I think having a clear, fast solution would be better, so 
that we have few exceptional cases.

> Expose BlobStore API to provide information whether blob id is content hashed
> -
>
> Key: OAK-5272
> URL: https://issues.apache.org/jira/browse/OAK-5272
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Reporter: Amit Jain
>Priority: Major
>
> As per discussion in OAK-5253 it's better to have some information from the 
> BlobStore(s) whether the blob id can be solely relied upon for comparison.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-5272) Expose BlobStore API to provide information whether blob id is content hashed

2018-02-21 Thread Alexander Klimetschek (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372136#comment-16372136
 ] 

Alexander Klimetschek commented on OAK-5272:


IMO the better solution is to leave the equals() method to the 
blobstore/datastore implementation instead of the globally shared 
AbstractBlob.equals().

If two blobs come from the same DS, the DS knows whether they are the same 
based on the content id or not (i.e. could do OAK-5253). If they come from 
different DS, then use the logic as of today. But blobs from different DS, is 
that even happening in reality? I mean from actually different DS, not if one 
is an inline SegmentBlob and the other from a DS. IIUC, in case of a 
CompositeBlobStore all underlying DS must follow the same content hashing logic 
anyway.

> Expose BlobStore API to provide information whether blob id is content hashed
> -
>
> Key: OAK-5272
> URL: https://issues.apache.org/jira/browse/OAK-5272
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Reporter: Amit Jain
>Priority: Major
>
> As per discussion in OAK-5253 it's better to have some information from the 
> BlobStore(s) whether the blob id can be solely relied upon for comparison.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-5272) Expose BlobStore API to provide information whether blob id is content hashed

2018-02-07 Thread Amit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355203#comment-16355203
 ] 

Amit Jain commented on OAK-5272:


[~tmueller] Thanks!

bq. what about the case where one blob has a SHA-1 content hash, and the other 
has a SHA-256 content hash?
Yeah I missed that in which case the method you added makes sense. We'll need 
to add the method to the concrete instances (BlobStoreBlob/SegmentBlob) or 
alternately add it only to the AbstractBlob and then expose the required info 
from the concrete instances (either the BlobStore or other info directly).

Let me work in it including tests and then I can put out a patch.

> Expose BlobStore API to provide information whether blob id is content hashed
> -
>
> Key: OAK-5272
> URL: https://issues.apache.org/jira/browse/OAK-5272
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Reporter: Amit Jain
>Priority: Major
>
> As per discussion in OAK-5253 it's better to have some information from the 
> BlobStore(s) whether the blob id can be solely relied upon for comparison.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-5272) Expose BlobStore API to provide information whether blob id is content hashed

2018-02-07 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355158#comment-16355158
 ] 

Thomas Mueller commented on OAK-5272:
-

[~amitjain] what about the case where one blob has a SHA-1 content hash, and 
the other has a SHA-256 content hash?
The content hash is different, but the content could still be the same.

>  currently the BlobStore(s) are not aware of the Blob object.

Are blobs aware of the blob store? If yes, what about adding a method "compare 
content" to the blob, something like this:

{noformat}
public enum Equality { EQUALS, DIFFERENT, UNKNOWN };

public Equality compareContent(Blob other) {
if (this == other) {
return Equality.EQUALS;
} else if (other == null) {
return Equality.DIFFERENT;
} 
if (length() != other.length()) {
return Equality.DIFFERENT;
}
if (!blobStore.hasContentAdressableBlobIds()) {
return Equality.UNKNOWN;
}
// TODO is strict type check needed, or is "instanceof" sufficient?
if (other.getClass() == getClass()) {
BlobStoreBlob otherBlob = (BlobStoreBlob) other;
if (!otherBlob.blobStore.hasContentAdressableBlobIds()) {
return Equality.UNKNOWN;
}
// TODO maybe blobId contains the length? in this case, truncate 
that part
if (otherBlob.blobId.length() != blobId.length()) {
return Equality.UNKNOWN;
}
return blobId.equals(otherBlob.blobId) ? Equality.EQUALS : 
Equality.DIFFERENT;
}
return Equality.UNKNOWN;
}
{noformat}

I know, many cases...

Your method is still needed, but we would need to extend the description a bit:

{noformat}
/**
 *
 * Will return true if blob ids are generated from content hash.
 * Content hashes of the same length can be used for equality checks
 * (content hashes of different length are generated with different 
algorithms).
 *
 * @return true if blobs are content addressable
 */
boolean hasContentAdressableBlobIds();
{noformat}



> Expose BlobStore API to provide information whether blob id is content hashed
> -
>
> Key: OAK-5272
> URL: https://issues.apache.org/jira/browse/OAK-5272
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Reporter: Amit Jain
>Priority: Major
>
> As per discussion in OAK-5253 it's better to have some information from the 
> BlobStore(s) whether the blob id can be solely relied upon for comparison.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-5272) Expose BlobStore API to provide information whether blob id is content hashed

2018-01-29 Thread Amit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343163#comment-16343163
 ] 

Amit Jain commented on OAK-5272:


[~mduerig], [~tmueller]

Would an addition of a method below to the BlobStore be good for the purpose? 
This keeps the amount of required changes to low. I have avoided pushing the 
blob equal method to the BlobStore because currently the BlobStore(s) are not 
aware of the Blob object. 

{code:java}
/**
 *
 * Will return true if blob ids are generated from content hash
 *
 * @return true if blobs are content addressable
 */
boolean hasContentAdressableBlobIds();
{code}

> Expose BlobStore API to provide information whether blob id is content hashed
> -
>
> Key: OAK-5272
> URL: https://issues.apache.org/jira/browse/OAK-5272
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Reporter: Amit Jain
>Priority: Major
>
> As per discussion in OAK-5253 it's better to have some information from the 
> BlobStore(s) whether the blob id can be solely relied upon for comparison.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (OAK-5272) Expose BlobStore API to provide information whether blob id is content hashed

2016-12-12 Thread JIRA

[ 
https://issues.apache.org/jira/browse/OAK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15741552#comment-15741552
 ] 

Michael Dürig commented on OAK-5272:


Or maybe add APIs to delegate determining equality of blobs to the blob store!?

> Expose BlobStore API to provide information whether blob id is content hashed
> -
>
> Key: OAK-5272
> URL: https://issues.apache.org/jira/browse/OAK-5272
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Reporter: Amit Jain
>
> As per discussion in OAK-5253 it's better to have some information from the 
> BlobStore(s) whether the blob id can be solely relied upon for comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)