Hi Gustavo,

I'm not sure what you mean by "non-overlapping" features exactly. For a maxbins 
of 700, the thing to avoid IMO is returning the first (or random) 700 features 
because this does not divide the segment up into "bins". If all 700 are in the 
first bin (which might conceivably be represented by a single pixel on the 
client), the client might still only be able to show one/a few, whilst the rest 
of the pixels show nothing.

Each procedure should divide the segment into 700 bins, and sort features into 
them by position. Some features might fall into only one bin, others might fall 
into more than one and may or may not be overlapping. A filter should then be 
applied to choose which features in each bin should be returned. Such a filter 
might be based on score (e.g. highest, lowest, sum, mean/median/mode averages), 
or some other factor. You could also create a new feature representing all 
those in the bin (e.g. a feature count).

In my mind there are two complicating factors:

1. How to decide which features to return if they are of different sizes and 
cover a different number of bins.

Perhaps it makes sense if a feature is selected for one bin that it should be 
selected for all the other bins it covers, for example, but what if the 
algorithm is using highest score and the feature selected for the first bin has 
a substantially lower score than another feature in the second bin?

2. Rounding. If you're tired or losing interest it's probably best to stop 
reading now...

Consider segment X:12345,98765 (86421 bases) and a maxbins of 1000. Assuming 
bins are of equal size, each is 86.421 bases long. Obviously it's not possible 
to express fractions of a base in DAS, so it is important that the server and 
the client interpret this in the same way. Firstly it's important not to round 
the bin size at the beginning, which would create an error in the total length 
or number of bins. So the first bin is >= 12345 and <= 12431.421, and the 
second is > 2431.421 and <= 12517.842. Which bin(s) does X:12431 fall into? You 
might be tempted to say "easy, it's in bin 1". But would your answer change if 
it was a feature at X:12400,12431 [which really means X:12400,12431.99999], or 
a feature at X:12431,12500? Basically what I am getting at is, do we count an 
end position as being 12431.0 or 12431.9999999? I believe Ensembl does the 
former but am not 100% sure and this is probably not strictly speaking accurate.

Sorry about the complicated numbers... 

Cheers,
Andy

On 6 Jul 2011, at 14:53, Gustavo Salazar wrote:

> Hey guys,
> 
> One of the topics in the workshop was the idea of having a set of strategies 
> for maxbins, and we said we will discuss it here... so this is my call to 
> hear ideas about it, i might have a some spare time soon and if we get a 
> couple of good strategies I can implement them in mydas as part of its core, 
> so a datasource provider can choose to use one of the predefined strategies 
> or to define a particular algorithm if is their wish. 
> 
> I suppose the easiest maxbins strategy is to return the X random 
> non-overlaping features in he segment. 
> 
> Any other Ideas?
> 
> Regards,
> 
> Gustavo.
> 
> 
> _______________________________________________
> DAS mailing list
> [email protected]
> http://lists.open-bio.org/mailman/listinfo/das


_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Reply via email to