Re: [whatwg] How to handle multitrack media resources in HTML
On Sun, 10 Apr 2011, Silvia Pfeiffer wrote: On Fri, Apr 8, 2011 at 4:54 PM, Ian Hickson i...@hixie.ch wrote: What is a main media resource? e.g. consider youtubedoubler.com; what is the main resource? Or similarly, when watching the director's commentary track on a movie, is the commentary the main track, or the movie? I don't think youtubedoubler.com is the main use case here. In the youtubedoubler.com use case, you have two independent videos that make sense by themselves, but are only coupled together by their timeline. The cases that I listed above, audio descriptions, sign language video, and dubbed audio tracks, make no sense by themselves. They are produced with a clear reference to one specific video and its details and could be delivered either as in-band tracks or as external files. From a developer and user point of view - and in analogy to the track element - it makes no sense to regard them as independent media resources. They all refer to a main resource - the original video. I don't know which is the main use case; I wouldn't be surprised if sites like youtubedoubler.com had as many if not more viewers than those with sign language videos. In any case, we have to handle both. My point was just that there isn't a well-defined main media resource. However, there are more similarities between audio, video and text tracks than one might think. For example, it is possible to want to have multiple video tracks and multiple text tracks rendered on top of a single video rendering area, and they may all be explicitly positioned just like positioned captions and they may all need to avoid each other. So, it could make sense to include them all in a single rendering approach. One could say the same about div. It seems like a bit of a superficial similarity. Similarities between audio and video tracks and text tracks are only really interesting here if they're not also similarities that apply to other even more unrelated things. Another example is that you may have a audio track with different captions to the captions of a related video element. Since the audio track has no visual display, its captions are not rendered, but the video's captions are rendered. Now, how are you going to make its captions available to the video's display area when the linked audio track is activated? Do you have a concrete example of this? I'm not sure I really follow. Some things will inherently be harder by taking the approach of separate video and audio elements rather than the track approach. I don't really see how this particular example relates to the issue of audio/video tracks being treated similarly or differently than text tracks. I agree that the described behaviour might need some tweaks to handle properly, but I don't think those tweaks would involve making the handling of audio/video tracks and text tracks more similar to each other. On Mon, 28 Mar 2011, Silvia Pfeiffer wrote: We haven't allowed caption tracks to start with a different startTimeOffset than the video, nor are we allowing to give them a different playbackRate to the video. It's relatively easy to do it for text tracks: you just take a text track and recreate it with different timings (something that can be done in a few lines of JavaScript given the API we expose). So there's no need for it to be explicit. For synchronising video and audio, we should expose multiple tracks starting at different offsets because it is easy to achieve yet provides numerous opportunities for authors. For example, it's not uncommon to want to compare two movies which have similar moments; showing such similarities would require either video editing or, if we allowed offsets, could be done merely by pointing to two movie files with appropriate offsets. It is not any more difficult to change the startTime of a video element in JavaScript than it is to change the start time of a track resource. Also, I believe that your use case can more easily be satisfied with temporal media fragment URIs, which not just get the offset, but the section from start to end that people are comparing. I don't follow. However, note that at the moment the MediaController feature doesn't support arbitrary offsets of audio/video tracks. Tracks in a multitrack resource (no matter if in-band or external files) are rather tightly authored to cover the exact same timeline in my experience. Sure. But it would be silly to only support one use case when with minimal effort we could support a vastly greater number of use cases, including many we have not yet considered. This is one of those situations where not supporting something actually requires more API complexity than supporting it. We are rarely faced with such an opportunity. I don't want to solve use cases that we haven't thought about yet. I don't want to add
Re: [whatwg] How to handle multitrack media resources in HTML
Silvia Pfeiffer and Ian Hickson exchanged: Yes, and the same (lack of definition) goes for javascript manipulation. It'd be great if we had the tools for manipulating video and audio tracks (extract/insert frames, move audio snippets around). It would make A/V editing - or more creative uses - really easy in HTML5. That's a use case we should investigate in due course, but I think it's probably a bit early to go there. When we do get around to it, it would be nice, as well, to be able to create sounds (as from wave forms) from scratch, in the browser. Cheers David
Re: [whatwg] How to handle multitrack media resources in HTML
That actually was a quote form Jeroen and Ian, not me. :-) S. On Thu, Apr 21, 2011 at 10:31 AM, David Dailey ddai...@zoominternet.net wrote: Silvia Pfeiffer and Ian Hickson exchanged: Yes, and the same (lack of definition) goes for javascript manipulation. It'd be great if we had the tools for manipulating video and audio tracks (extract/insert frames, move audio snippets around). It would make A/V editing - or more creative uses - really easy in HTML5. That's a use case we should investigate in due course, but I think it's probably a bit early to go there. When we do get around to it, it would be nice, as well, to be able to create sounds (as from wave forms) from scratch, in the browser. Cheers David
Re: [whatwg] How to handle multitrack media resources in HTML
On Thu, Apr 21, 2011 at 12:31 PM, David Dailey ddai...@zoominternet.net wrote: When we do get around to it, it would be nice, as well, to be able to create sounds (as from wave forms) from scratch, in the browser. There's experimental work being done on this. For example: https://wiki.mozilla.org/Audio_Data_API http://chromium.googlecode.com/svn/trunk/samples/audio/specification/specification.html Chris. -- http://www.bluishcoder.co.nz
Re: [whatwg] How to handle multitrack media resources in HTML
Hey Ian, all, Sorry for the slow response .. There's a big difference between text tracks, audio tracks, and video tracks. While it makes sense, for instance, to have text tracks enabled but not showing, it makes no sense to do that with audio tracks. Audio and video tracks require more data, hence it's less preferred to allow them being enabled but not showing. If data wasn't an issue, it would be great if this were possible; it'd allow instant switching between multiple audio dubs, or camera angles. I think we mean different things by active here. The hidden state for a text track is one where the UA isn't rendering the track but the UA is still firing all the events and so forth. I don't understand what the parallel would be for a video or audio track. The parallel would be fetching / decoding the tracks but not showing them to the display (video) or speakers (audio). I agree that, implementation wise, this is much less useful than having an active but hidden state for text tracks. However, some people might want to manipulate hidden tracks with the audio data API, much like hidden text tracks can be manipulated with javascript. Text tracks are discontinuous units of potentially overlapping textual data with position information and other metadata that can be styled with CSS and can be mutated from script. Audio and video tracks are continuous streams of immutable media data. Video and audio tracks do not necessarily produce continuous output - it is perfectly legal to have gaps in either, eg. segments that do not render. Both audio and video tracks can have metadata that affect their rendering: an audio track has a volume metadata that attenuates its contribution to the overall mix-down, and a video track has matrix that controls its rendering. The only thing preventing us from styling a video track with CSS is the lack of definition. Yes, and the same (lack of definition) goes for javascript manipulation. It'd be great if we had the tools for manipulating video and audio tracks (extract/insert frames, move audio snippets around). It would make A/V editing - or more creative uses - really easy in HTML5. Kind regards, Jeroen
Re: [whatwg] How to handle multitrack media resources in HTML
On Apr 11, 2011, at 5:26 PM, Ian Hickson wrote: On Mon, 11 Apr 2011, Jeroen Wijering wrote: On Apr 8, 2011, at 8:54 AM, Ian Hickson wrote: There's a big difference between text tracks, audio tracks, and video tracks. While it makes sense, for instance, to have text tracks enabled but not showing, it makes no sense to do that with audio tracks. Audio and video tracks require more data, hence it's less preferred to allow them being enabled but not showing. If data wasn't an issue, it would be great if this were possible; it'd allow instant switching between multiple audio dubs, or camera angles. I think we mean different things by active here. The hidden state for a text track is one where the UA isn't rendering the track but the UA is still firing all the events and so forth. I don't understand what the parallel would be for a video or audio track. Text tracks are discontinuous units of potentially overlapping textual data with position information and other metadata that can be styled with CSS and can be mutated from script. Audio and video tracks are continuous streams of immutable media data. Video and audio tracks do not necessarily produce continuous output - it is perfectly legal to have gaps in either, eg. segments that do not render. Both audio and video tracks can have metadata that affect their rendering: an audio track has a volume metadata that attenuates its contribution to the overall mix-down, and a video track has matrix that controls its rendering. The only thing preventing us from styling a video track with CSS is the lack of definition. I don't really see what they have in common other than us using the word track to refer to both of them, and that's mostly just an artefact of the language. Track is more than an artifact of the language, it is the commonly used term in the digital media industry for an independent stream of media samples in a container file. eric
Re: [whatwg] How to handle multitrack media resources in HTML
On Apr 8, 2011, at 8:54 AM, Ian Hickson wrote: *) Discoverability is indeed an issue, but this can be fixed by defining a common track API for signalling and enabling/disabling tracks: {{{ interface Track { readonly attribute DOMString kind; readonly attribute DOMString label; readonly attribute DOMString language; const unsigned short OFF = 0; const unsigned short HIDDEN = 1; const unsigned short SHOWING = 2; attribute unsigned short mode; }; interface HTMLMediaElement : HTMLElement { [...] readonly attribute Track[] tracks; }; }}} There's a big difference between text tracks, audio tracks, and video tracks. While it makes sense, for instance, to have text tracks enabled but not showing, it makes no sense to do that with audio tracks. Audio and video tracks require more data, hence it's less preferred to allow them being enabled but not showing. If data wasn't an issue, it would be great if this were possible; it'd allow instant switching between multiple audio dubs, or camera angles. In terms of the data model, I don't believe there's major differences between audio, text or video tracks. They all exist at the same level - one down from the main presentation layer. Toggling versus layering can be an option for all three kinds of tracks. For example, multiple video tracks can be mixed together in one media element's display. Think about PiP, perspective side by side (Stevenote style) or a 3D grid (group chat, like Skype). Perhaps this should be supported instead of relying upon multiple video elements, manual positioning and APIs to knit things together. One would loose in terms of flexibility, but gain in terms of API complexity (it's still one video) and ease of implementation for HTML developers. - Jeroen
Re: [whatwg] How to handle multitrack media resources in HTML
On Apr 8, 2011, at 8:54 AM, Ian Hickson wrote: but should be linked to the main media resource through markup. What is a main media resource? e.g. consider youtubedoubler.com; what is the main resource? Or similarly, when watching the director's commentary track on a movie, is the commentary the main track, or the movie? In systems like MPEG TS and DASH, there's the notion of the system clock. This is the overarching resource to which all audio, meta, text and video tracks are synced. The clock has no video frames or audio samples by itself, it just acts as the wardrobe for all tracks. Perhaps it's worth investigating if this would be useful for media elements? - Jeroen
Re: [whatwg] How to handle multitrack media resources in HTML
On Apr 10, 2011, at 12:36 PM, Mark Watson wrote: In the case of in-band tracks it may still be the case that they are retrieved independently over the network. This could happen two ways: - some file formats contain headers which enable precise navigation of the file, for example using HTTP byte ranges, so that the tracks could be retrieved independently. mp4 files would be an example. I don't know that anyone does this, though. QuickTime has supported tracks with external media samples in .mov files for more than 15 years. This type of file is most commonly used during editing, but they are occasionally found on the net. - in the case of adaptive streaming based on a manifest, the different tracks may be in different files, even though they appear as in-band tracks from an HTML perspective. In these cases it *might* make sense to expose separate buffer and network states for the different in-band tracks in just the same way as out-of-band tracks. I strongly disagree. Having different tracks APIs for different container formats will be extremely confusing for developers, and I don't think it will add anything. A UA that chooses to support non-self contained media files should account for all samples when reporting readyState and networkState. eric
Re: [whatwg] How to handle multitrack media resources in HTML
Sent from my iPhone On Apr 11, 2011, at 8:55 AM, Eric Carlson eric.carl...@apple.com wrote: On Apr 10, 2011, at 12:36 PM, Mark Watson wrote: In the case of in-band tracks it may still be the case that they are retrieved independently over the network. This could happen two ways: - some file formats contain headers which enable precise navigation of the file, for example using HTTP byte ranges, so that the tracks could be retrieved independently. mp4 files would be an example. I don't know that anyone does this, though. QuickTime has supported tracks with external media samples in .mov files for more than 15 years. This type of file is most commonly used during editing, but they are occasionally found on the net. I was also thinking of a client which downloads the MOOV box and then uses the tables there to construct byte range requests for specific tracks. - in the case of adaptive streaming based on a manifest, the different tracks may be in different files, even though they appear as in-band tracks from an HTML perspective. In these cases it *might* make sense to expose separate buffer and network states for the different in-band tracks in just the same way as out-of-band tracks. I strongly disagree. Having different tracks APIs for different container formats will be extremely confusing for developers, and I don't think it will add anything. A UA that chooses to support non-self contained media files should account for all samples when reporting readyState and networkState. Fair enough. I did say 'might' :-) eric
Re: [whatwg] How to handle multitrack media resources in HTML
On Fri, 8 Apr 2011, Jer Noble wrote: On Apr 7, 2011, at 11:54 PM, Ian Hickson wrote: The distinction between a master media element and a master media controller is, in my mind, mostly a distinction without a difference. However, a welcome addition to the media controller would be convenience APIs for the above properties (as well as playbackState, networkState, seekable, and buffered). I'm not sure what networkState in this context. playbackState, assuming you mean 'paused', is already exposed. Sorry, by playbackState, I meant readyState. And I was suggesting that, much in the same way that you've provided .buffered and .seekable properties which expose the intersection of the slaved media elements' corresponding ranges, that a readyState property could similarly reflect the readyState values of all the slaved media elements. In this case, the MediaController's hypothetical readyState wouldn't flip to HAVE_ENOUGH_DATA until all the constituent media element's ready states reached at least the same value. So basically it would return the lowest possible value amongst the slaved elements? I guess we could expose such a convenience accessor, but what's the use case? It seems easy enough to implement manually in JS, so unless there's a compelling case, I'd be reluctant to add it. Of course, this would imply that the load events fired by a media element (e.g. loadedmetadata, canplaythrough) were also fired by the MediaController, and I would support this change as well. I don't see why it would imply that, but certainly we could add events like that to the controller. Again though, what's the use case? Again, this would be just a convenience for authors, as this information is already available in other forms and could be relatively easily calculated on-the-fly in scripts. But UAs are likely going to have do these calculations anyway to support things like autoplay, so adding explicit support for them in API form would not (imho) be unduly burdensome. Autoplay is handled without having to do these calculations, as far as I can tell. I don't see any reason why the UA would need to do these calculations actually. If there are compelling use cases, though, I'm happy to add such accessors. On Fri, 8 Apr 2011, Eric Winkelman wrote: On Friday, April 08, 2011, Ian Hickson wrote: On Thu, 17 Feb 2011, Eric Winkelman wrote: MPEG transport streams, as used for commercial TV, will often contain multiple types of metadata: content advisories, ad insertion opportunities, interactive TV application triggers, etc. If we were getting this information out-of-band we would, as you suggest, know how to deal with it. We would use multiple @kind=metadata tracks, with the correct handler associated with each track. In our case, however, this information is all coming in-band. There is information within the MPEG transport stream that identifies the types of metadata being carried. This lets the video player know, for example, that the stream has a particular track with application triggers, and another one with content advisories. To be consistent with the out-of-band tracks, we envision the player creating separate TimedTrack elements for each type of metadata, and adding the associated data as cues. But there isn't a clear way for the player to indicate the type of metadata it's putting into each of these TimedTrack cues. Which brings us to the mime types. I have an event handler on the video tag that fires when the player creates a new metadata track, and this handler tries to figure out what to do with the track. Without a type on the track, I have to set another handler on the track that fires when the player creates a cue, and tries to figure out what to do from the cue. As there is no type on the cue either, I have to examine the cue location/text to see if it contains metadata I'm able to handle. This all works, but it requires event handlers on tracks that may have no interest to the application. On the player side, it depends on the player tagging the metadata in a consistent ad-hoc way, as well as requiring the player to create separate metadata tracks. (We also considered starting the cue's text with a mime type, but this has the same basic issues.) This is an interesting problem. What is the way that the MPEG streams identify these various metadata streams? Is it a MIME type? Some other identifier? Is this identifier separate from the track's label, or is it the track's label? The streams contain a Program Map Table (PMT) which contains a list of tuples (program id (PID) and a standard numeric type) for the program's tracks. This is how the user agent knows about this metadata and what is contained in it. We're envisioning that the combination of transport, e.g. MPEG-2 TS, and
Re: [whatwg] How to handle multitrack media resources in HTML
On Apr 11, 2011, at 5:26 PM, Ian Hickson wrote: On Fri, 8 Apr 2011, Jer Noble wrote: Sorry, by playbackState, I meant readyState. And I was suggesting that, much in the same way that you've provided .buffered and .seekable properties which expose the intersection of the slaved media elements' corresponding ranges, that a readyState property could similarly reflect the readyState values of all the slaved media elements. In this case, the MediaController's hypothetical readyState wouldn't flip to HAVE_ENOUGH_DATA until all the constituent media element's ready states reached at least the same value. So basically it would return the lowest possible value amongst the slaved elements? I guess we could expose such a convenience accessor, but what's the use case? It seems easy enough to implement manually in JS, so unless there's a compelling case, I'd be reluctant to add it. Yes, this would be just a convenience, as I tried to make clear below. So I don't want to seem like I'm pushing this too hard. But since you asked... Of course, this would imply that the load events fired by a media element (e.g. loadedmetadata, canplaythrough) were also fired by the MediaController, and I would support this change as well. I don't see why it would imply that, but certainly we could add events like that to the controller. Again though, what's the use case? The use case for the events is the same one as for the convenience property: without a convenience event, authors would have to add event listeners to every slave media element. So by imply, I simply meant that if the use case for the first was compelling enough to warrant new API, the second would be warranted as well. Lets say, for example, an author wants to change the color of a play button when the media in a media group all reaches the HAVE_ENOUGH_DATA readyState. Current API: function init() { var mediaGroupElements = document.querySelectorAll(*[mediaGroup=group1]); for (var i = 0; i mediaGroupElements.length; ++i) mediaGroupElements.item(i).addEventListener('canplaythrough', readyStateChangeListener, false); } function readyStateChangeListener(e) { var mediaGroupElements = document.querySelectorAll(*[mediaGroup=group1]); var ready = mediaGroupElements.length 0; for (var i = 0; i mediaGroupElements.length; ++i) if (mediaGroupElements.item(i).readyState HAVE_ENOUGH_DATA) ready = false; if (ready) changePlayButtonColor(); } Convenience API: function init() { var controller = document.querySelector(*[mediaGroup=group1]).controller; controller.addEventListener('canplaythrough'), changePlayButtonColor, true; } I think the convenience benefits are pretty obvious. Maybe not compelling enough, however. :) Again, this would be just a convenience for authors, as this information is already available in other forms and could be relatively easily calculated on-the-fly in scripts. But UAs are likely going to have do these calculations anyway to support things like autoplay, so adding explicit support for them in API form would not (imho) be unduly burdensome. Autoplay is handled without having to do these calculations, as far as I can tell. I don't see any reason why the UA would need to do these calculations actually. If there are compelling use cases, though, I'm happy to add such accessors. Well, how exactly is autoplay handled in a media group? Does the entire media group start playing when the first media element in a group with it's autoplay attribute set reaches HAVE_ENOUGH_DATA? -Jer Jer Noble jer.no...@apple.com
Re: [whatwg] How to handle multitrack media resources in HTML
On Fri, Apr 8, 2011 at 4:54 PM, Ian Hickson i...@hixie.ch wrote: On Thu, 10 Feb 2011, Silvia Pfeiffer wrote: One particular issue that hasn't had much discussion here yet is the issue of how to deal with multitrack media resources or media resources that have associated synchronized audio and video resources. I'm concretely referring to such things as audio descriptions, sign language video, and dubbed audio tracks. We require an API that can expose such extra tracks to the user and to JavaScript. This should be independent of whether the tracks are actually inside the media resource or are given as separate resources, I think there's a big difference between multiple tracks inside one resource and multiple tracks spread amongst multiple resources: in the former case, one would need a single set of network state APIs (load algorithm, ready state, network state, dimensions, buffering state, etc), whereas in the second case we'd need N set of these APIs, one for each media resource. Given that the current mechanism for exposing the load state of a media resource is a media element (video, audio), I think it makes sense to reuse these elements for loading each media resource even in a multitrack scenario. Thus I do not necessarily agree that exposing extra tracks should be done in a way that as independent of whether the tracks are in-band or out-of-band. but should be linked to the main media resource through markup. What is a main media resource? e.g. consider youtubedoubler.com; what is the main resource? Or similarly, when watching the director's commentary track on a movie, is the commentary the main track, or the movie? I am bringing this up now because solutions may have an influence on the inner workings of TimedTrack and the track element, so before we have any implementations of track, we should be very certain that we are happy with the way in which it works - in particular that track continues to stay an empty element. I don't really see why this would be related to text tracks. Those have their own status framework, and interact directly with a media element. Looking again at the youtubedoubler.com example, one could envisage both sides having text tracks. They wouldn't be joint tracks. I don't think youtubedoubler.com is the main use case here. In the youtubedoubler.com use case, you have two independent videos that make sense by themselves, but are only coupled together by their timeline. The cases that I listed above, audio descriptions, sign language video, and dubbed audio tracks, make no sense by themselves. They are produced with a clear reference to one specific video and its details and could be delivered either as in-band tracks or as external files. From a developer and user point of view - and in analogy to the track element - it makes no sense to regard them as independent media resources. They all refer to a main resource - the original video. On Mon, 14 Feb 2011, Jeroen Wijering wrote: In terms of solutions, I lean much towards the manifest approach. The other approaches are options that each add more elements to HTML5, which: * Won't work for situations outside of HTML5. * Postpone, and perhaps clash with, the addition of manifests. Manifests, and indeed any solution that relies on a single media element, would make it very difficult to render multiple video tracks independently (e.g. side by side vs picture-in-picture). That's not to say that manifests shouldn't work, but I think we'd need another solution as well. *) The CSS styling issue can be fixed by making a conceptual change to CSS and text tracks. Instead of styling text tracks, a single text rendering area for each video element can be exposed and styled. Any text tracks that are enabled push data in it, which is automatically styled according to the video.textStyle/etc rules. This wouldn't work well with positioned captions. *) Discoverability is indeed an issue, but this can be fixed by defining a common track API for signalling and enabling/disabling tracks: {{{ interface Track { readonly attribute DOMString kind; readonly attribute DOMString label; readonly attribute DOMString language; const unsigned short OFF = 0; const unsigned short HIDDEN = 1; const unsigned short SHOWING = 2; attribute unsigned short mode; }; interface HTMLMediaElement : HTMLElement { [...] readonly attribute Track[] tracks; }; }}} There's a big difference between text tracks, audio tracks, and video tracks. While it makes sense, for instance, to have text tracks enabled but not showing, it makes no sense to do that with audio tracks. Similarly, video tracks need their own display area, but text tracks need a video track's display area. A single video area can display one video (multiple overlapping videos being achieved by multiple playback areas), but multiple audio and text tracks can be mixed together without any
Re: [whatwg] How to handle multitrack media resources in HTML
On Apr 7, 2011, at 11:54 PM, Ian Hickson wrote: On Thu, 10 Feb 2011, Silvia Pfeiffer wrote: One particular issue that hasn't had much discussion here yet is the issue of how to deal with multitrack media resources or media resources that have associated synchronized audio and video resources. I'm concretely referring to such things as audio descriptions, sign language video, and dubbed audio tracks. We require an API that can expose such extra tracks to the user and to JavaScript. This should be independent of whether the tracks are actually inside the media resource or are given as separate resources, I think there's a big difference between multiple tracks inside one resource and multiple tracks spread amongst multiple resources: in the former case, one would need a single set of network state APIs (load algorithm, ready state, network state, dimensions, buffering state, etc), whereas in the second case we'd need N set of these APIs, one for each media resource. Given that the current mechanism for exposing the load state of a media resource is a media element (video, audio), I think it makes sense to reuse these elements for loading each media resource even in a multitrack scenario. Thus I do not necessarily agree that exposing extra tracks should be done in a way that as independent of whether the tracks are in-band or out-of-band. In the case of in-band tracks it may still be the case that they are retrieved independently over the network. This could happen two ways: - some file formats contain headers which enable precise navigation of the file, for example using HTTP byte ranges, so that the tracks could be retrieved independently. mp4 files would be an example. I don't know that anyone does this, though. - in the case of adaptive streaming based on a manifest, the different tracks may be in different files, even though they appear as in-band tracks from an HTML perspective. In these cases it *might* make sense to expose separate buffer and network states for the different in-band tracks in just the same way as out-of-band tracks. In fact the distinction between in-band and out-of-band tracks is mainly how you discover them: out-of-band the author is assumed to know about by some means of their own, in-band can be discovered by loading the metadata part of a single initial resource. ...Mark
Re: [whatwg] How to handle multitrack media resources in HTML
On Thu, 10 Feb 2011, Silvia Pfeiffer wrote: One particular issue that hasn't had much discussion here yet is the issue of how to deal with multitrack media resources or media resources that have associated synchronized audio and video resources. I'm concretely referring to such things as audio descriptions, sign language video, and dubbed audio tracks. We require an API that can expose such extra tracks to the user and to JavaScript. This should be independent of whether the tracks are actually inside the media resource or are given as separate resources, I think there's a big difference between multiple tracks inside one resource and multiple tracks spread amongst multiple resources: in the former case, one would need a single set of network state APIs (load algorithm, ready state, network state, dimensions, buffering state, etc), whereas in the second case we'd need N set of these APIs, one for each media resource. Given that the current mechanism for exposing the load state of a media resource is a media element (video, audio), I think it makes sense to reuse these elements for loading each media resource even in a multitrack scenario. Thus I do not necessarily agree that exposing extra tracks should be done in a way that as independent of whether the tracks are in-band or out-of-band. but should be linked to the main media resource through markup. What is a main media resource? e.g. consider youtubedoubler.com; what is the main resource? Or similarly, when watching the director's commentary track on a movie, is the commentary the main track, or the movie? I am bringing this up now because solutions may have an influence on the inner workings of TimedTrack and the track element, so before we have any implementations of track, we should be very certain that we are happy with the way in which it works - in particular that track continues to stay an empty element. I don't really see why this would be related to text tracks. Those have their own status framework, and interact directly with a media element. Looking again at the youtubedoubler.com example, one could envisage both sides having text tracks. They wouldn't be joint tracks. On Mon, 14 Feb 2011, Jeroen Wijering wrote: In terms of solutions, I lean much towards the manifest approach. The other approaches are options that each add more elements to HTML5, which: * Won't work for situations outside of HTML5. * Postpone, and perhaps clash with, the addition of manifests. Manifests, and indeed any solution that relies on a single media element, would make it very difficult to render multiple video tracks independently (e.g. side by side vs picture-in-picture). That's not to say that manifests shouldn't work, but I think we'd need another solution as well. *) The CSS styling issue can be fixed by making a conceptual change to CSS and text tracks. Instead of styling text tracks, a single text rendering area for each video element can be exposed and styled. Any text tracks that are enabled push data in it, which is automatically styled according to the video.textStyle/etc rules. This wouldn't work well with positioned captions. *) Discoverability is indeed an issue, but this can be fixed by defining a common track API for signalling and enabling/disabling tracks: {{{ interface Track { readonly attribute DOMString kind; readonly attribute DOMString label; readonly attribute DOMString language; const unsigned short OFF = 0; const unsigned short HIDDEN = 1; const unsigned short SHOWING = 2; attribute unsigned short mode; }; interface HTMLMediaElement : HTMLElement { [...] readonly attribute Track[] tracks; }; }}} There's a big difference between text tracks, audio tracks, and video tracks. While it makes sense, for instance, to have text tracks enabled but not showing, it makes no sense to do that with audio tracks. Similarly, video tracks need their own display area, but text tracks need a video track's display area. A single video area can display one video (multiple overlapping videos being achieved by multiple playback areas), but multiple audio and text tracks can be mixed together without any difficulty (mixing in one audio channel, or positioning over one video display area, respectively). So I'm not sure a single tracks API makes sense. On Wed, 16 Feb 2011, Eric Winkelman wrote: We're working with multitrack MPEG transport streams, and have an implementation of the TimedTrack interface integrating with in-band metadata tracks. Our prototype uses the Metadata Cues to synchronize a JavaScript application with a video stream using the stream's embedded EISS signaling. This approach is working very well so far. The biggest issue we've faced is that there isn't an obvious way to tell the browser application what type of information is contained within the metadata track/cues. The Cues can contain
Re: [whatwg] How to handle multitrack media resources in HTML
On Apr 7, 2011, at 11:54 PM, Ian Hickson wrote: The distinction between a master media element and a master media controller is, in my mind, mostly a distinction without a difference. However, a welcome addition to the media controller would be convenience APIs for the above properties (as well as playbackState, networkState, seekable, and buffered). I'm not sure what networkState in this context. playbackState, assuming you mean 'paused', is already exposed. Sorry, by playbackState, I meant readyState. And I was suggesting that, much in the same way that you've provided .buffered and .seekable properties which expose the intersection of the slaved media elements' corresponding ranges, that a readyState property could similarly reflect the readyState values of all the slaved media elements. In this case, the MediaController's hypothetical readyState wouldn't flip to HAVE_ENOUGH_DATA until all the constituent media element's ready states reached at least the same value. Of course, this would imply that the load events fired by a media element (e.g. loadedmetadata, canplaythrough) were also fired by the MediaController, and I would support this change as well. Again, this would be just a convenience for authors, as this information is already available in other forms and could be relatively easily calculated on-the-fly in scripts. But UAs are likely going to have do these calculations anyway to support things like autoplay, so adding explicit support for them in API form would not (imho) be unduly burdensome. -Jer Jer Noble jer.no...@apple.com
Re: [whatwg] How to handle multitrack media resources in HTML
On Friday, April 08, 2011, Ian Hickson wrote: On Thu, 17 Feb 2011, Eric Winkelman wrote: MPEG transport streams, as used for commercial TV, will often contain multiple types of metadata: content advisories, ad insertion opportunities, interactive TV application triggers, etc. If we were getting this information out-of-band we would, as you suggest, know how to deal with it. We would use multiple @kind=metadata tracks, with the correct handler associated with each track. In our case, however, this information is all coming in-band. There is information within the MPEG transport stream that identifies the types of metadata being carried. This lets the video player know, for example, that the stream has a particular track with application triggers, and another one with content advisories. To be consistent with the out-of-band tracks, we envision the player creating separate TimedTrack elements for each type of metadata, and adding the associated data as cues. But there isn't a clear way for the player to indicate the type of metadata it's putting into each of these TimedTrack cues. Which brings us to the mime types. I have an event handler on the video tag that fires when the player creates a new metadata track, and this handler tries to figure out what to do with the track. Without a type on the track, I have to set another handler on the track that fires when the player creates a cue, and tries to figure out what to do from the cue. As there is no type on the cue either, I have to examine the cue location/text to see if it contains metadata I'm able to handle. This all works, but it requires event handlers on tracks that may have no interest to the application. On the player side, it depends on the player tagging the metadata in a consistent ad-hoc way, as well as requiring the player to create separate metadata tracks. (We also considered starting the cue's text with a mime type, but this has the same basic issues.) This is an interesting problem. What is the way that the MPEG streams identify these various metadata streams? Is it a MIME type? Some other identifier? Is this identifier separate from the track's label, or is it the track's label? The streams contain a Program Map Table (PMT) which contains a list of tuples (program id (PID) and a standard numeric type) for the program's tracks. This is how the user agent knows about this metadata and what is contained in it. We're envisioning that the combination of transport, e.g. MPEG-2 TS, and PMT type would be used by the UA to select a MIME type. We're proposing that this MIME type would be the track's label. We think it would be better if there were a type attribute for the track to use instead of the label, but using the label would work. Thanks, Eric --- e.winkel...@cablelabs.com
Re: [whatwg] How to handle multitrack media resources in HTML
Silvia, MPEG transport streams, as used for commercial TV, will often contain multiple types of metadata: content advisories, ad insertion opportunities, interactive TV application triggers, etc. If we were getting this information out-of-band we would, as you suggest, know how to deal with it. We would use multiple @kind=metadata tracks, with the correct handler associated with each track. In our case, however, this information is all coming in-band. There is information within the MPEG transport stream that identifies the types of metadata being carried. This lets the video player know, for example, that the stream has a particular track with application triggers, and another one with content advisories. To be consistent with the out-of-band tracks, we envision the player creating separate TimedTrack elements for each type of metadata, and adding the associated data as cues. But there isn't a clear way for the player to indicate the type of metadata it's putting into each of these TimedTrack cues. Which brings us to the mime types. I have an event handler on the video tag that fires when the player creates a new metadata track, and this handler tries to figure out what to do with the track. Without a type on the track, I have to set another handler on the track that fires when the player creates a cue, and tries to figure out what to do from the cue. As there is no type on the cue either, I have to examine the cue location/text to see if it contains metadata I'm able to handle. This all works, but it requires event handlers on tracks that may have no interest to the application. On the player side, it depends on the player tagging the metadata in a consistent ad-hoc way, as well as requiring the player to create separate metadata tracks. (We also considered starting the cue's text with a mime type, but this has the same basic issues.) Clear as mud, right? Thanks, Eric Winkelman --- CableLabs -Original Message- From: Silvia Pfeiffer [mailto:silviapfeiff...@gmail.com] Sent: Wednesday, February 16, 2011 1:34 PM To: Eric Winkelman Cc: WHAT Working Group Subject: Re: [whatwg] How to handle multitrack media resources in HTML Hi Eric, I'm curious: if you are using @kind=metadata - which is not generically applicable, but only has application-specific data in it - then this implies that the web page knows what type of data is in the track's cues and knows how to parse it. Why do you need a mime type on the cues then? Is it because MPEG has metadata cue tracks that can contain different types of structured content? Can you clarify? Cheers, Silvia. On Thu, Feb 17, 2011 at 6:44 AM, Eric Winkelman e.winkel...@cablelabs.com wrote: Silvia, all, We're working with multitrack MPEG transport streams, and have an implementation of the TimedTrack interface integrating with in-band metadata tracks. Our prototype uses the Metadata Cues to synchronize a JavaScript application with a video stream using the stream's embedded EISS signaling. This approach is working very well so far. The biggest issue we've faced is that there isn't an obvious way to tell the browser application what type of information is contained within the metadata track/cues. The Cues can contain arbitrary text, but neither the Cue, nor the associated TimedTrack, has functionality for specifying the format/meaning of that text. Our current implementation uses the Cue's @identifier for a MIME type, and puts the associated metadata into the Cue's text field using XML. This works, but requires the JavaScript browser application to examine the cues to see if they contain information it understands. It also requires the video player to follow this convention for Metadata TimedTracks. Adding a @type attribute to the Cues would certainly help, though it would still require the browser application to examine individual cues to see if they were useful. An alternate approach would be to add a @type attribute to the track tag/TimedTrack that would specify the mime type for the associated cues. This would allow a browser application to determine from the TimedTrack whether or not it needed to process the associated cues. Eric --- Eric Winkelman CableLabs -Original Message- From: whatwg-boun...@lists.whatwg.org [mailto:whatwg- boun...@lists.whatwg.org] On Behalf Of Silvia Pfeiffer Sent: Wednesday, February 09, 2011 5:41 PM To: WHAT Working Group Subject: [whatwg] How to handle multitrack media resources in HTML Hi all, One particular issue that hasn't had much discussion here yet is the issue of how to deal with multitrack media resources or media resources that have associated synchronized audio and video resources. I'm concretely referring to such things as audio descriptions, sign language video, and dubbed audio tracks. We require an API that can expose such extra tracks to the user
Re: [whatwg] How to handle multitrack media resources in HTML
Hi Eric, That is an interesting use case. I had not considered that there were any metadata tracks inside media resources that could be exposed, too. The first thing that we would need is the commitment of browser vendors to actually parse those metadata tracks and expose them to the Web page through the TimedTrack mechanism. Since I don't know about the types of tracks you are referring to, let me ask you some dumb questions. Are the metadata tracks that you are referring to encoded in the same way into MP4 files as caption tracks? If so, is there a mime type on that track? Or how do you identify what the different tracks contain? Is there other software that pulls out all of these tracks and does it in a generic way (i.e. not application-specific)? I am asking because the @type attribute is just one way to let JavaScript know about the content of the track. We also have the @label attribute, which may actually be more appropriate in this case, since it's data that is not meant for the browser to display, but to hand on to JavaScript. I'm trying to find a way in which this will work with the framework that we have created. Another thing that we haven't talked about yet is how to handle header-type meta data for WebVTT files, which is a similar problem to what you are proposing. It might be an idea to add a field for meta information to the API, but I am not sure yet if that is a good idea. Cheers, Silvia. On Fri, Feb 18, 2011 at 3:55 AM, Eric Winkelman e.winkel...@cablelabs.com wrote: Silvia, MPEG transport streams, as used for commercial TV, will often contain multiple types of metadata: content advisories, ad insertion opportunities, interactive TV application triggers, etc. If we were getting this information out-of-band we would, as you suggest, know how to deal with it. We would use multiple @kind=metadata tracks, with the correct handler associated with each track. In our case, however, this information is all coming in-band. There is information within the MPEG transport stream that identifies the types of metadata being carried. This lets the video player know, for example, that the stream has a particular track with application triggers, and another one with content advisories. To be consistent with the out-of-band tracks, we envision the player creating separate TimedTrack elements for each type of metadata, and adding the associated data as cues. But there isn't a clear way for the player to indicate the type of metadata it's putting into each of these TimedTrack cues. Which brings us to the mime types. I have an event handler on the video tag that fires when the player creates a new metadata track, and this handler tries to figure out what to do with the track. Without a type on the track, I have to set another handler on the track that fires when the player creates a cue, and tries to figure out what to do from the cue. As there is no type on the cue either, I have to examine the cue location/text to see if it contains metadata I'm able to handle. This all works, but it requires event handlers on tracks that may have no interest to the application. On the player side, it depends on the player tagging the metadata in a consistent ad-hoc way, as well as requiring the player to create separate metadata tracks. (We also considered starting the cue's text with a mime type, but this has the same basic issues.) Clear as mud, right? Thanks, Eric Winkelman --- CableLabs -Original Message- From: Silvia Pfeiffer [mailto:silviapfeiff...@gmail.com] Sent: Wednesday, February 16, 2011 1:34 PM To: Eric Winkelman Cc: WHAT Working Group Subject: Re: [whatwg] How to handle multitrack media resources in HTML Hi Eric, I'm curious: if you are using @kind=metadata - which is not generically applicable, but only has application-specific data in it - then this implies that the web page knows what type of data is in the track's cues and knows how to parse it. Why do you need a mime type on the cues then? Is it because MPEG has metadata cue tracks that can contain different types of structured content? Can you clarify? Cheers, Silvia. On Thu, Feb 17, 2011 at 6:44 AM, Eric Winkelman e.winkel...@cablelabs.com wrote: Silvia, all, We're working with multitrack MPEG transport streams, and have an implementation of the TimedTrack interface integrating with in-band metadata tracks. Our prototype uses the Metadata Cues to synchronize a JavaScript application with a video stream using the stream's embedded EISS signaling. This approach is working very well so far. The biggest issue we've faced is that there isn't an obvious way to tell the browser application what type of information is contained within the metadata track/cues. The Cues can contain arbitrary text, but neither the Cue, nor the associated TimedTrack, has functionality for specifying the format/meaning of that text. Our
Re: [whatwg] How to handle multitrack media resources in HTML
Silvia, all, We're working with multitrack MPEG transport streams, and have an implementation of the TimedTrack interface integrating with in-band metadata tracks. Our prototype uses the Metadata Cues to synchronize a JavaScript application with a video stream using the stream's embedded EISS signaling. This approach is working very well so far. The biggest issue we've faced is that there isn't an obvious way to tell the browser application what type of information is contained within the metadata track/cues. The Cues can contain arbitrary text, but neither the Cue, nor the associated TimedTrack, has functionality for specifying the format/meaning of that text. Our current implementation uses the Cue's @identifier for a MIME type, and puts the associated metadata into the Cue's text field using XML. This works, but requires the JavaScript browser application to examine the cues to see if they contain information it understands. It also requires the video player to follow this convention for Metadata TimedTracks. Adding a @type attribute to the Cues would certainly help, though it would still require the browser application to examine individual cues to see if they were useful. An alternate approach would be to add a @type attribute to the track tag/TimedTrack that would specify the mime type for the associated cues. This would allow a browser application to determine from the TimedTrack whether or not it needed to process the associated cues. Eric --- Eric Winkelman CableLabs -Original Message- From: whatwg-boun...@lists.whatwg.org [mailto:whatwg- boun...@lists.whatwg.org] On Behalf Of Silvia Pfeiffer Sent: Wednesday, February 09, 2011 5:41 PM To: WHAT Working Group Subject: [whatwg] How to handle multitrack media resources in HTML Hi all, One particular issue that hasn't had much discussion here yet is the issue of how to deal with multitrack media resources or media resources that have associated synchronized audio and video resources. I'm concretely referring to such things as audio descriptions, sign language video, and dubbed audio tracks. We require an API that can expose such extra tracks to the user and to JavaScript. This should be independent of whether the tracks are actually inside the media resource or are given as separate resources, but should be linked to the main media resource through markup. I am bringing this up now because solutions may have an influence on the inner workings of TimedTrack and the track element, so before we have any implementations of track, we should be very certain that we are happy with the way in which it works - in particular that track continues to stay an empty element. We've had some preliminary discussions about this in the W3C Accessibility Task Force and the alternatives that we could think about are captured in http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_API . This may not be the complete list of possible solutions, but it provides ideas for the different approaches that can be taken. I'd like to see what people's opinions are about them. Note there are also discussion threads about this at the W3C both in the Accessibility TF [1] and the HTML Working Group [2], but I am curious about input from the wider community. So check out http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_API and share your opinions. Cheers, Silvia. [1] http://lists.w3.org/Archives/Public/public-html-a11y/2011Feb/0057.html [2] http://lists.w3.org/Archives/Public/public-html/2011Feb/0205.html
Re: [whatwg] How to handle multitrack media resources in HTML
Hi Eric, I'm curious: if you are using @kind=metadata - which is not generically applicable, but only has application-specific data in it - then this implies that the web page knows what type of data is in the track's cues and knows how to parse it. Why do you need a mime type on the cues then? Is it because MPEG has metadata cue tracks that can contain different types of structured content? Can you clarify? Cheers, Silvia. On Thu, Feb 17, 2011 at 6:44 AM, Eric Winkelman e.winkel...@cablelabs.com wrote: Silvia, all, We're working with multitrack MPEG transport streams, and have an implementation of the TimedTrack interface integrating with in-band metadata tracks. Our prototype uses the Metadata Cues to synchronize a JavaScript application with a video stream using the stream's embedded EISS signaling. This approach is working very well so far. The biggest issue we've faced is that there isn't an obvious way to tell the browser application what type of information is contained within the metadata track/cues. The Cues can contain arbitrary text, but neither the Cue, nor the associated TimedTrack, has functionality for specifying the format/meaning of that text. Our current implementation uses the Cue's @identifier for a MIME type, and puts the associated metadata into the Cue's text field using XML. This works, but requires the JavaScript browser application to examine the cues to see if they contain information it understands. It also requires the video player to follow this convention for Metadata TimedTracks. Adding a @type attribute to the Cues would certainly help, though it would still require the browser application to examine individual cues to see if they were useful. An alternate approach would be to add a @type attribute to the track tag/TimedTrack that would specify the mime type for the associated cues. This would allow a browser application to determine from the TimedTrack whether or not it needed to process the associated cues. Eric --- Eric Winkelman CableLabs -Original Message- From: whatwg-boun...@lists.whatwg.org [mailto:whatwg- boun...@lists.whatwg.org] On Behalf Of Silvia Pfeiffer Sent: Wednesday, February 09, 2011 5:41 PM To: WHAT Working Group Subject: [whatwg] How to handle multitrack media resources in HTML Hi all, One particular issue that hasn't had much discussion here yet is the issue of how to deal with multitrack media resources or media resources that have associated synchronized audio and video resources. I'm concretely referring to such things as audio descriptions, sign language video, and dubbed audio tracks. We require an API that can expose such extra tracks to the user and to JavaScript. This should be independent of whether the tracks are actually inside the media resource or are given as separate resources, but should be linked to the main media resource through markup. I am bringing this up now because solutions may have an influence on the inner workings of TimedTrack and the track element, so before we have any implementations of track, we should be very certain that we are happy with the way in which it works - in particular that track continues to stay an empty element. We've had some preliminary discussions about this in the W3C Accessibility Task Force and the alternatives that we could think about are captured in http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_API . This may not be the complete list of possible solutions, but it provides ideas for the different approaches that can be taken. I'd like to see what people's opinions are about them. Note there are also discussion threads about this at the W3C both in the Accessibility TF [1] and the HTML Working Group [2], but I am curious about input from the wider community. So check out http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_API and share your opinions. Cheers, Silvia. [1] http://lists.w3.org/Archives/Public/public-html-a11y/2011Feb/0057.html [2] http://lists.w3.org/Archives/Public/public-html/2011Feb/0205.html
Re: [whatwg] How to handle multitrack media resources in HTML
Hello Silvia, all, First, thanks for the Multitrack wiki page. Very helpful for those who are not subscribed to the various lists. I also phrased below comments as feedback to this page: http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_API USE CASE The use case is spot on; this is an issue that blocks HTML5 video from being chosen over a solution like Flash. An elaborate list of tracks is important, to correctly scope the conditions / resolutions: 1. Tracks targeting device capabilities: * Different containers / codes / profiles * Multiview (3D) or surround sound * Playback rights and/or decryption possibilities 2. Tracks targeting content customization: * Alternate viewing angles or alternate music scores * Director's comments or storyboard video 3. Tracks targeting accessibility: * Dubbed audio or text subtitles * Audio descriptions or closed captions * Tracks cleared from cursing / nudity / violence 4. Tracks targeting the interface: * Chapterlists, bookmarks, timed annotations, midroll hints.. * .. and any other type of scripting queues Note I included the HTML5 text tracks. I believe there are four kinds of tracks, all inherent part of a media presentation. These types designate the output of the track, not its encoded representation: * audio (producing sound) * metadata (producing scripting queues) * text (producing rendered text) * video (producing images) In this taxonomy, the HTML5 subtitles and captions track kinds are text, the descriptions kind is audio and the chapters and metadata kinds are metadata. REQUIREMENTS The requirements are elaborate, but do note they span beyond HTML5. Everything that plays back audio/video needs multitrack support: * Broad- and narrowcasting playback devices of any kind * Native desktop, mobile and settop applications/apps * Devices that play media standalone (mediaplayers, pictureframes, airplay) Also, on e.g. the iPhone and Android devices, playback of video is triggered by HTML5, but subsequently detached from it. Think about the custom fullscreen controls, the obscuring of all HTML and events/cueues that are deliberately ignored or not sent (such as play() in iOS). I wonder whether this is a temporary state or something that will remain and should be provisioned. With this in mind, I think an additional requirement is that there should be a full solution outside the scope of HTML5. HTML5 has unique capabilities like customization of the layout (CSS) and interaction (JavaScript), but it must not be required. SIDE CONDITIONS In the side conditions, I'm not sure on the relative volume of audio or positioning of video. Automation by default might work better and requires no parameters. For audio, blending can be done through a ducking mechanism (like the JW Player does). For video, blending can be done through an alpha channel. At a later stage, an API/heuristics for PIP support and gain control can be added. SOLUTIONS In terms of solutions, I lean much towards the manifest approach. The other approaches are options that each add more elements to HTML5, which: * Won't work for situations outside of HTML5. * Postpone, and perhaps clash with, the addition of manifests. Without a manifest, there'll probably be no adaptive streaming, which renders HTML5 video much less useful. At the same time, standardization around manifests (DASH) is largely wrapping up. EXAMPLE Here's some code on the manifest approach. First the HTML5 side: video id=v1 poster=video.png controls source src=manifest.xml type=video/mpeg-dash /video Second the manifest side: MPD mediaPresentationDuration=PT645S type=OnDemand BaseURLhttp://cdn.example.com/myVideo//BaseURL Period Group mimeType=video/webm lang=en Representation sourceURL=video-1600.webm / /Group Group mimeType=video/mp4; codecs=avc1.42E00C,mp4a.40.2 lang=en Representation sourceURL=video-1600.mp4 / /Group Group mimeType=text/vvt lang=en Accessibility type=CC / Representation sourceURL=captions.vtt / /Group /Period /MPD (I should more look into accessibility parameters, but there is support for signalling captions, audiodescriptions, sign language etc.) Note that this approach moves the text track outside of HTML5, making it accessible for other clients as well. Both codecs are also in the manifest - this is just one of the device capability selectors of DASH clients. DISADVANTAGES The two listed disadvantages for the manifest approach in the wiki page are lack of CSS and discoverability: *) The CSS styling issue can be fixed by making a conceptual change to CSS and text tracks. Instead of styling text tracks, a single text rendering area for each video element can be exposed and styled. Any text tracks that are enabled push data in it, which is automatically styled according to the video.textStyle/etc rules. *)