Hi SIddharth, Awesome - this is definitely in the right direction, I think.
I just missed you on irc, I'll be on and off today, maybe I can catch you later/tonight. On DOM changes: performance would be a concern - and it would need to be robust on client-side dynamic apps (which remain loaded and do a lot of JS/DOM work over time, instead of loading new pages on user action). But definitely an area to explore, and we should think about what APIs are more / less appropriate (e.g. we could listen for an event that the app is expected to emit on DOM changes, etc). On text scanning: yes, agreed. On fingerprinting: sounds good. we might need to make the web content api aware of fingerprinting for performance reasons in some situations, but your start is good. (by "performance reasons" I mean: perhaps we can do some fingerprinting client-side by dumping images into canvas and doing the math locally--that way we don't need to fetch the image server-side). Also: it might be likely that users would want to limit scanning to certain classes/kinds of DOM nodes (e.g. all nodes class "article", etc). Great start! Dan On Tuesday, April 30, 2013 at 5:52 AM, Siddharth Kothari wrote: > I built a basic WebContent library which can scan custom medias and > add a solid border if OpenHome service thinks it's a CC-licensed > content. Readme and demo here - > http://sids-aquarius.github.io/webcontent/. > > The most glaring missing parts are - > * It doesn't track for changes in the DOM. It should ideally scan for > content in the nodes that are changed. > * Add support for scanning text. > * the fingerprinting is an identity function currently, but the > WebContent library has nothing to do with it. > > A code-review would be helpful. It's a small lib (< 50 lines) - > https://github.com/sids-aquarius/webcontent/blob/master/lib/webcontent.js. > I would like to know if I am thinking in the correct direction. > > I would also like to know about challenges I should anticipate over > time. Imho, the WebContent project sounds a bit under-scoped. The only > challenge I can think is making this robust to work in a variety of > settings. > > Dan - Would you be on IRC sometime today? I wanted to chat about the > fingerprinting library. > > Thanks, > Siddharth > > On Mon, Apr 29, 2013 at 3:05 AM, Siddharth Kothari > <[email protected] (mailto:[email protected])> wrote: > > Hi Dan, > > > > I have hosted my fork of Open Home here - > > http://106.187.50.124:50124/. Is the schema for the DB laid? I had to > > create one and change one of the arguments in getItem() and > > updateItem() call, replacing HashKeyElement with the actual attribute > > name to get it working. Also, look out for my pull request (it > > contains minor fixes). > > > > About the task - > > I will wait till we discuss more about the Fingerprinting library. > > Meanwhile, I will get started with the Web Content library. > > 1. For a start, I am planning to build/use a basic html parser that > > can fetch all the sources from the <img>/<video>/<audio> tags and send > > them over as a json object to Open Home server. > > 2. The Open Home server returns a json object indicating which of the > > contents are CC-licensed. > > 3. Inject some markup to indicate the contents that are CC-licensed. > > As far as fingerprinting is concerned, I will index SHA-1 hashes of > > few static content files (or I can hash the files I retrieve from a > > user's Dropbox account). > > > > Let me know if you would like to make some amends in the above. > > > > Thanks, > > Siddharth > > > > On Fri, Apr 26, 2013 at 7:31 AM, Dan Mills <[email protected] > > (mailto:[email protected])> wrote: > > > On Thursday, April 25, 2013 at 12:52 PM, Siddharth Kothari wrote: > > > > > > Hi everyone, > > > > > > > > > Hello! > > > > > > I am interested in a couple of projects - CC Web Content API, and Media > > > Fingerprinting Library. I wanted to see if I understand how these projects > > > fit in the OpenHome project before starting some contributions. > > > > > > The way I envision OpenHome is as a central system where the CC licensed > > > contents will be indexed by their hashes. The CC Web Content API could be > > > used by sites that aggregate user content, let's say: github, youtube, > > > slideshare to find out remixing of an existing CC licensed content. The > > > Media Fingerprinting library helps in determining deduplication of content > > > (it should also work when a content is cropped, clipped, blurred or quoted > > > in parts). Am I understanding this correctly? > > > > > > > > > Roughly, yes. The DB needs to be laid out such that it can be queried from > > > fingerprint alone (which is not like a MD5/SHA-1 hash). The Fingerprinting > > > project should aim to catch cropped, distorted, resized, etc. files. > > > > > > I find the Fingerprinting project fascinating, but delving more into the > > > idea and looking at pHash.org (http://pHash.org), I realized it already > > > implements > > > fingerprinting for image, audio, and video content and provides this as a > > > nice API - http://www.phash.org/docs/howto.html. Unless we find GPLv3 too > > > restrictive, I can't think of a good reason to not use this. Perhaps, > > > pHash > > > can be extended to support for text and compound media types (ppt, pdf). > > > But > > > I think starting with pHash and supporting text using w-shingling can be a > > > pretty good start for the fingerprinting library. I would like to hear > > > more > > > thoughts on this. > > > > > > > > > I have heard mixed reviews of pHash. I think a first step in the project > > > should be to come up with a set of tests and metrics, and try out pHash as > > > well as other solutions. > > > > > > The CC Web Content API project sounds appealing, since it is the glue that > > > binds other parts, and perhaps crucial to the successful implementation of > > > OpenHome project. Imo, this could perhaps be meshed with the > > > Fingerprinting > > > project (if pHash is used as a base). Essentially, the current > > > Fingerprinting task is reduced to exposing the pHash library via a nice > > > API. > > > And over the time, pHash/Fingerprinting algorithms can be added/improved. > > > > > > > > > Yes, if you'd like to focus on the Web content API, then you can abstract > > > away the fingerprinting portion. Even straight-up SHA-1 would work for a > > > demo of the Web content API (it wouldn't catch modified images, but it > > > would > > > catch the same file in other webpages). > > > > > > Let me know if I am making sense. Sorry if it's difficult to follow, we > > > can > > > carry this conversation on IRC. My nick is sids_aquarius. > > > > > > > > > Sounds good! I'm traveling until Sunday, but will try to drop by when > > > possible. > > > > > > Dan
_______________________________________________ cc-devel mailing list [email protected] http://lists.ibiblio.org/mailman/listinfo/cc-devel
