Hi Dan, Webcontent library now tracks for DOM changes, and tries to efficiently scan the changed nodes. It utilizes the Mutation Observer API for browsers that support it (Chrome/Chromium, Firefox, Safari), and falls back on Mutation Events (deprecated now) for others (IE, Opera). I have updated the demo link - http://sids-aquarius.github.io/webcontent/example/index.html. It adds media elements in a random order on user clicks. The webcontent library also keeps a cache of scanned URLs for quick lookup if a URL in mutated DOM node has been scanned before.
Let me know what you think about the approach. I was wondering if there should be a further fallback for browsers that don't support either of the two. Probably, polling the DOM once in a while to look for changes. Meanwhile, I am starting some work on text scanning. Thanks, Siddharth On Wed, May 1, 2013 at 10:52 PM, Siddharth Kothari <[email protected]> wrote: > Hi Dan, > > Thanks for the feedback. :) > > Yes, tracking DOM changes has to be efficient. DOM mutation observer > https://developer.mozilla.org/en-US/docs/DOM/MutationObserver seems > like the way to go, but it's not enjoy support from all the browsers. > So a fallback might be needed. > > I think there might be a conflict of interest in doing fingerprinting > locally if it is computationally expensive, but I like the idea. I > also found a benchmark for peceptual hashes of images - > http://rihamark.nllk.net/, but it's last update is in 2010; so may > have to get some dust off. > > This is how I am currently thinking about the fingerprinting task priorities - > > * List down existing solutions, and test them against a benchmark. We > might have to come up with different benchmarks for different media > types. > a. For images, We can use the rihamark. If it is broken, the basic > idea is still pretty good - generate attacks using resize, blur, > rotation, scale, PNG and JPEG compression. > b. For audio, and video - We can generate attacks using crops, > resize, cutting and jumbling parts. > c. For text, the benchmark would be simple to create. > Not sure what would be a good data size, I would say 100 would be a > good number if we can't figure a way to automate a. and b.. Otherwise, > we can have in the order of 1000 documents for each corpus. The > documents can be split into two halves - one containing remixes using > the attacks, and the other being dissimilar. > > * Text fingerprinting using shingling technique should be easy to > implement. We can have that working first. The same can also help in > the scanning of text from DOM, and perhaps even computing the hashes > locally. > > * Depending on the benchmark results of images, we may go with an > existing library like pHash, or come up with a faster/simpler method > which works well for our case. If we decide to write a fingerprinting > algorithm for hashing images, I would prefer javascript since that > provides for the possibility of performing local computations. > > Let me know how this sounds. > > Thanks, > Siddharth > > > On Tue, Apr 30, 2013 at 10:57 PM, Dan Mills <[email protected]> wrote: >> Hi SIddharth, >> >> Awesome - this is definitely in the right direction, I think. >> >> I just missed you on irc, I'll be on and off today, maybe I can catch you >> later/tonight. >> >> On DOM changes: performance would be a concern - and it would need to be >> robust on client-side dynamic apps (which remain loaded and do a lot of >> JS/DOM work over time, instead of loading new pages on user action). But >> definitely an area to explore, and we should think about what APIs are more >> / less appropriate (e.g. we could listen for an event that the app is >> expected to emit on DOM changes, etc). >> >> On text scanning: yes, agreed. >> >> On fingerprinting: sounds good. we might need to make the web content api >> aware of fingerprinting for performance reasons in some situations, but your >> start is good. (by "performance reasons" I mean: perhaps we can do some >> fingerprinting client-side by dumping images into canvas and doing the math >> locally--that way we don't need to fetch the image server-side). >> >> Also: it might be likely that users would want to limit scanning to certain >> classes/kinds of DOM nodes (e.g. all nodes class "article", etc). >> >> Great start! >> >> Dan >> >> On Tuesday, April 30, 2013 at 5:52 AM, Siddharth Kothari wrote: >> >> I built a basic WebContent library which can scan custom medias and >> add a solid border if OpenHome service thinks it's a CC-licensed >> content. Readme and demo here - >> http://sids-aquarius.github.io/webcontent/. >> >> The most glaring missing parts are - >> * It doesn't track for changes in the DOM. It should ideally scan for >> content in the nodes that are changed. >> * Add support for scanning text. >> * the fingerprinting is an identity function currently, but the >> WebContent library has nothing to do with it. >> >> A code-review would be helpful. It's a small lib (< 50 lines) - >> https://github.com/sids-aquarius/webcontent/blob/master/lib/webcontent.js. >> I would like to know if I am thinking in the correct direction. >> >> I would also like to know about challenges I should anticipate over >> time. Imho, the WebContent project sounds a bit under-scoped. The only >> challenge I can think is making this robust to work in a variety of >> settings. >> >> Dan - Would you be on IRC sometime today? I wanted to chat about the >> fingerprinting library. >> >> Thanks, >> Siddharth >> >> On Mon, Apr 29, 2013 at 3:05 AM, Siddharth Kothari >> <[email protected]> wrote: >> >> Hi Dan, >> >> I have hosted my fork of Open Home here - >> http://106.187.50.124:50124/. Is the schema for the DB laid? I had to >> create one and change one of the arguments in getItem() and >> updateItem() call, replacing HashKeyElement with the actual attribute >> name to get it working. Also, look out for my pull request (it >> contains minor fixes). >> >> About the task - >> I will wait till we discuss more about the Fingerprinting library. >> Meanwhile, I will get started with the Web Content library. >> 1. For a start, I am planning to build/use a basic html parser that >> can fetch all the sources from the <img>/<video>/<audio> tags and send >> them over as a json object to Open Home server. >> 2. The Open Home server returns a json object indicating which of the >> contents are CC-licensed. >> 3. Inject some markup to indicate the contents that are CC-licensed. >> As far as fingerprinting is concerned, I will index SHA-1 hashes of >> few static content files (or I can hash the files I retrieve from a >> user's Dropbox account). >> >> Let me know if you would like to make some amends in the above. >> >> Thanks, >> Siddharth >> >> On Fri, Apr 26, 2013 at 7:31 AM, Dan Mills <[email protected]> wrote: >> >> On Thursday, April 25, 2013 at 12:52 PM, Siddharth Kothari wrote: >> >> Hi everyone, >> >> >> Hello! >> >> I am interested in a couple of projects - CC Web Content API, and Media >> Fingerprinting Library. I wanted to see if I understand how these projects >> fit in the OpenHome project before starting some contributions. >> >> The way I envision OpenHome is as a central system where the CC licensed >> contents will be indexed by their hashes. The CC Web Content API could be >> used by sites that aggregate user content, let's say: github, youtube, >> slideshare to find out remixing of an existing CC licensed content. The >> Media Fingerprinting library helps in determining deduplication of content >> (it should also work when a content is cropped, clipped, blurred or quoted >> in parts). Am I understanding this correctly? >> >> >> Roughly, yes. The DB needs to be laid out such that it can be queried from >> fingerprint alone (which is not like a MD5/SHA-1 hash). The Fingerprinting >> project should aim to catch cropped, distorted, resized, etc. files. >> >> I find the Fingerprinting project fascinating, but delving more into the >> idea and looking at pHash.org, I realized it already implements >> fingerprinting for image, audio, and video content and provides this as a >> nice API - http://www.phash.org/docs/howto.html. Unless we find GPLv3 too >> restrictive, I can't think of a good reason to not use this. Perhaps, pHash >> can be extended to support for text and compound media types (ppt, pdf). But >> I think starting with pHash and supporting text using w-shingling can be a >> pretty good start for the fingerprinting library. I would like to hear more >> thoughts on this. >> >> >> I have heard mixed reviews of pHash. I think a first step in the project >> should be to come up with a set of tests and metrics, and try out pHash as >> well as other solutions. >> >> The CC Web Content API project sounds appealing, since it is the glue that >> binds other parts, and perhaps crucial to the successful implementation of >> OpenHome project. Imo, this could perhaps be meshed with the Fingerprinting >> project (if pHash is used as a base). Essentially, the current >> Fingerprinting task is reduced to exposing the pHash library via a nice API. >> And over the time, pHash/Fingerprinting algorithms can be added/improved. >> >> >> Yes, if you'd like to focus on the Web content API, then you can abstract >> away the fingerprinting portion. Even straight-up SHA-1 would work for a >> demo of the Web content API (it wouldn't catch modified images, but it would >> catch the same file in other webpages). >> >> Let me know if I am making sense. Sorry if it's difficult to follow, we can >> carry this conversation on IRC. My nick is sids_aquarius. >> >> >> Sounds good! I'm traveling until Sunday, but will try to drop by when >> possible. >> >> Dan >> >> _______________________________________________ cc-devel mailing list [email protected] http://lists.ibiblio.org/mailman/listinfo/cc-devel
