Re: [cc-devel] Queries about OpenHome projects Web Content API, and Fingerprinting Library

Dan Mills Mon, 27 May 2013 15:16:05 -0700

Hi Siddarth, 

Sorry we did not end up accepting your application - but I hope you will stay 
and help us hack on these ideas.


I think we can focus on newer browsers first where polling is not required. 
I've been playing around with using angularjs for more stateful client-side app 
development, and I think we'll be targeting newer browsers anyway. 

Dan


On Monday, May 20, 2013 at 1:33 PM, Siddharth Kothari wrote:

> Hi Dan,
> 
> Webcontent library now tracks for DOM changes, and tries to
> efficiently scan the changed nodes. It utilizes the Mutation Observer
> API for browsers that support it (Chrome/Chromium, Firefox, Safari),
> and falls back on Mutation Events (deprecated now) for others (IE,
> Opera). I have updated the demo link -
> http://sids-aquarius.github.io/webcontent/example/index.html. It adds
> media elements in a random order on user clicks. The webcontent
> library also keeps a cache of scanned URLs for quick lookup if a URL
> in mutated DOM node has been scanned before.
> 
> Let me know what you think about the approach. I was wondering if
> there should be a further fallback for browsers that don't support
> either of the two. Probably, polling the DOM once in a while to look
> for changes. Meanwhile, I am starting some work on text scanning.
> 
> Thanks,
> Siddharth
> 
> On Wed, May 1, 2013 at 10:52 PM, Siddharth Kothari
> <[email protected] (mailto:[email protected])> wrote:
> > Hi Dan,
> > 
> > Thanks for the feedback. :)
> > 
> > Yes, tracking DOM changes has to be efficient. DOM mutation observer
> > https://developer.mozilla.org/en-US/docs/DOM/MutationObserver seems
> > like the way to go, but it's not enjoy support from all the browsers.
> > So a fallback might be needed.
> > 
> > I think there might be a conflict of interest in doing fingerprinting
> > locally if it is computationally expensive, but I like the idea. I
> > also found a benchmark for peceptual hashes of images -
> > http://rihamark.nllk.net/, but it's last update is in 2010; so may
> > have to get some dust off.
> > 
> > This is how I am currently thinking about the fingerprinting task 
> > priorities -
> > 
> > * List down existing solutions, and test them against a benchmark. We
> > might have to come up with different benchmarks for different media
> > types.
> > a. For images, We can use the rihamark. If it is broken, the basic
> > idea is still pretty good - generate attacks using resize, blur,
> > rotation, scale, PNG and JPEG compression.
> > b. For audio, and video - We can generate attacks using crops,
> > resize, cutting and jumbling parts.
> > c. For text, the benchmark would be simple to create.
> > Not sure what would be a good data size, I would say 100 would be a
> > good number if we can't figure a way to automate a. and b.. Otherwise,
> > we can have in the order of 1000 documents for each corpus. The
> > documents can be split into two halves - one containing remixes using
> > the attacks, and the other being dissimilar.
> > 
> > * Text fingerprinting using shingling technique should be easy to
> > implement. We can have that working first. The same can also help in
> > the scanning of text from DOM, and perhaps even computing the hashes
> > locally.
> > 
> > * Depending on the benchmark results of images, we may go with an
> > existing library like pHash, or come up with a faster/simpler method
> > which works well for our case. If we decide to write a fingerprinting
> > algorithm for hashing images, I would prefer javascript since that
> > provides for the possibility of performing local computations.
> > 
> > Let me know how this sounds.
> > 
> > Thanks,
> > Siddharth
> > 
> > 
> > On Tue, Apr 30, 2013 at 10:57 PM, Dan Mills <[email protected] 
> > (mailto:[email protected])> wrote:
> > > Hi SIddharth,
> > > 
> > > Awesome - this is definitely in the right direction, I think.
> > > 
> > > I just missed you on irc, I'll be on and off today, maybe I can catch you
> > > later/tonight.
> > > 
> > > On DOM changes: performance would be a concern - and it would need to be
> > > robust on client-side dynamic apps (which remain loaded and do a lot of
> > > JS/DOM work over time, instead of loading new pages on user action). But
> > > definitely an area to explore, and we should think about what APIs are 
> > > more
> > > / less appropriate (e.g. we could listen for an event that the app is
> > > expected to emit on DOM changes, etc).
> > > 
> > > On text scanning: yes, agreed.
> > > 
> > > On fingerprinting: sounds good. we might need to make the web content api
> > > aware of fingerprinting for performance reasons in some situations, but 
> > > your
> > > start is good. (by "performance reasons" I mean: perhaps we can do some
> > > fingerprinting client-side by dumping images into canvas and doing the 
> > > math
> > > locally--that way we don't need to fetch the image server-side).
> > > 
> > > Also: it might be likely that users would want to limit scanning to 
> > > certain
> > > classes/kinds of DOM nodes (e.g. all nodes class "article", etc).
> > > 
> > > Great start!
> > > 
> > > Dan
> > > 
> > > On Tuesday, April 30, 2013 at 5:52 AM, Siddharth Kothari wrote:
> > > 
> > > I built a basic WebContent library which can scan custom medias and
> > > add a solid border if OpenHome service thinks it's a CC-licensed
> > > content. Readme and demo here -
> > > http://sids-aquarius.github.io/webcontent/.
> > > 
> > > The most glaring missing parts are -
> > > * It doesn't track for changes in the DOM. It should ideally scan for
> > > content in the nodes that are changed.
> > > * Add support for scanning text.
> > > * the fingerprinting is an identity function currently, but the
> > > WebContent library has nothing to do with it.
> > > 
> > > A code-review would be helpful. It's a small lib (< 50 lines) -
> > > https://github.com/sids-aquarius/webcontent/blob/master/lib/webcontent.js.
> > > I would like to know if I am thinking in the correct direction.
> > > 
> > > I would also like to know about challenges I should anticipate over
> > > time. Imho, the WebContent project sounds a bit under-scoped. The only
> > > challenge I can think is making this robust to work in a variety of
> > > settings.
> > > 
> > > Dan - Would you be on IRC sometime today? I wanted to chat about the
> > > fingerprinting library.
> > > 
> > > Thanks,
> > > Siddharth
> > > 
> > > On Mon, Apr 29, 2013 at 3:05 AM, Siddharth Kothari
> > > <[email protected] (mailto:[email protected])> wrote:
> > > 
> > > Hi Dan,
> > > 
> > > I have hosted my fork of Open Home here -
> > > http://106.187.50.124:50124/. Is the schema for the DB laid? I had to
> > > create one and change one of the arguments in getItem() and
> > > updateItem() call, replacing HashKeyElement with the actual attribute
> > > name to get it working. Also, look out for my pull request (it
> > > contains minor fixes).
> > > 
> > > About the task -
> > > I will wait till we discuss more about the Fingerprinting library.
> > > Meanwhile, I will get started with the Web Content library.
> > > 1. For a start, I am planning to build/use a basic html parser that
> > > can fetch all the sources from the <img>/<video>/<audio> tags and send
> > > them over as a json object to Open Home server.
> > > 2. The Open Home server returns a json object indicating which of the
> > > contents are CC-licensed.
> > > 3. Inject some markup to indicate the contents that are CC-licensed.
> > > As far as fingerprinting is concerned, I will index SHA-1 hashes of
> > > few static content files (or I can hash the files I retrieve from a
> > > user's Dropbox account).
> > > 
> > > Let me know if you would like to make some amends in the above.
> > > 
> > > Thanks,
> > > Siddharth
> > > 
> > > On Fri, Apr 26, 2013 at 7:31 AM, Dan Mills <[email protected] 
> > > (mailto:[email protected])> wrote:
> > > 
> > > On Thursday, April 25, 2013 at 12:52 PM, Siddharth Kothari wrote:
> > > 
> > > Hi everyone,
> > > 
> > > 
> > > Hello!
> > > 
> > > I am interested in a couple of projects - CC Web Content API, and Media
> > > Fingerprinting Library. I wanted to see if I understand how these projects
> > > fit in the OpenHome project before starting some contributions.
> > > 
> > > The way I envision OpenHome is as a central system where the CC licensed
> > > contents will be indexed by their hashes. The CC Web Content API could be
> > > used by sites that aggregate user content, let's say: github, youtube,
> > > slideshare to find out remixing of an existing CC licensed content. The
> > > Media Fingerprinting library helps in determining deduplication of content
> > > (it should also work when a content is cropped, clipped, blurred or quoted
> > > in parts). Am I understanding this correctly?
> > > 
> > > 
> > > Roughly, yes. The DB needs to be laid out such that it can be queried from
> > > fingerprint alone (which is not like a MD5/SHA-1 hash). The Fingerprinting
> > > project should aim to catch cropped, distorted, resized, etc. files.
> > > 
> > > I find the Fingerprinting project fascinating, but delving more into the
> > > idea and looking at pHash.org (http://pHash.org), I realized it already 
> > > implements
> > > fingerprinting for image, audio, and video content and provides this as a
> > > nice API - http://www.phash.org/docs/howto.html. Unless we find GPLv3 too
> > > restrictive, I can't think of a good reason to not use this. Perhaps, 
> > > pHash
> > > can be extended to support for text and compound media types (ppt, pdf). 
> > > But
> > > I think starting with pHash and supporting text using w-shingling can be a
> > > pretty good start for the fingerprinting library. I would like to hear 
> > > more
> > > thoughts on this.
> > > 
> > > 
> > > I have heard mixed reviews of pHash. I think a first step in the project
> > > should be to come up with a set of tests and metrics, and try out pHash as
> > > well as other solutions.
> > > 
> > > The CC Web Content API project sounds appealing, since it is the glue that
> > > binds other parts, and perhaps crucial to the successful implementation of
> > > OpenHome project. Imo, this could perhaps be meshed with the 
> > > Fingerprinting
> > > project (if pHash is used as a base). Essentially, the current
> > > Fingerprinting task is reduced to exposing the pHash library via a nice 
> > > API.
> > > And over the time, pHash/Fingerprinting algorithms can be added/improved.
> > > 
> > > 
> > > Yes, if you'd like to focus on the Web content API, then you can abstract
> > > away the fingerprinting portion. Even straight-up SHA-1 would work for a
> > > demo of the Web content API (it wouldn't catch modified images, but it 
> > > would
> > > catch the same file in other webpages).
> > > 
> > > Let me know if I am making sense. Sorry if it's difficult to follow, we 
> > > can
> > > carry this conversation on IRC. My nick is sids_aquarius.
> > > 
> > > 
> > > Sounds good! I'm traveling until Sunday, but will try to drop by when
> > > possible.
> > > 
> > > Dan

_______________________________________________
cc-devel mailing list
[email protected]
http://lists.ibiblio.org/mailman/listinfo/cc-devel

Re: [cc-devel] Queries about OpenHome projects Web Content API, and Fingerprinting Library

Reply via email to