Re: Two query questions on time bin function

Mike Carey Mon, 22 Feb 2016 16:32:59 -0800

Approaching Big Data!  :-)  Interesting....!
Q: Is each post-binning row a bin, meaning there are only ~86 rows per bin?
(Am I interpreting that correctly?)

If so, that's good news; materializing any one given bin shouldn't be aproblem for our runtime, so maybe we can indeed get this to work in theshort term.


Sorry for the hassles w/this.....!
Cheers,
Mike


On 2/22/16 7:09 AM, Yiran Wang wrote:

Mike,

The original dataset has 31132597 rows of records. After binning itinto 1-min time bin dataset, it has 363466 rows of records.


Thanks,
Yiran

On Sun, Feb 21, 2016 at 2:55 PM, Mike Carey <[email protected]<mailto:[email protected]>> wrote:


    ARGH!!!! This is what we would like you to *not* have to do. Sorry...
    Our aim is to be the Big Data antidote....

    @Yingyi:  Maybe you could take a quick peek at the query issue and
    see if there is any low-hanging hope there?

    @Yiran:  How big are your windows, typically?  (Number of data
    points.)

    Cheers,
    Mike


    On 2/21/16 2:42 PM, Yiran Wang wrote:

    Thank you Mike for your update and suggestions! And thank you Ian
    again for working with me.

    A little update from my end:

    I have been working on query (1) over the weekend for a
    work-around. I tried to simultaneously calculate the stdev on the
    new copy of the list of values with outliers removed, which in
    nature is the same as the query (2). So I ran into the same
    problem that the query did not compile.

    What I did was to export the entire dataset with the outliers
    removed into Excel and calculate the stdev in Excel. However, the
    entire dataset is now 363,466 x 200+ in dimension. Though they do
    not exceed the row x col limit in Excel, the memory of my
    computer is not big enough to do anything useful without
    crashing. So I've been breaking the dataset into smaller parts
    and working on each separately.

    Yiran



    On Sun, Feb 21, 2016 at 2:10 PM, Mike Carey <[email protected]
    <mailto:[email protected]>> wrote:

        Ian,

        Thanks working with Yiran on this!  I think there is "good"
        and bad news w.r.t these queries:

         - The bad news is that they go beyond what we are likely to
        optimize at all well at present,
            as they go beyond what typical DB aggregate functions
        like min/max/avg/count/sum do.
            (I would try forming the groups and then doing these
        things on the groups, but saying
            them in AQL will be tricky, and may lead to queries that
        hit edge cases in the optimizer.
            For some of these my thought was to try using a
        positional variable within a group...?)

         - The "good" news (only for AsterixDB) is that this is
        exactly the sort of inspiration that we
            are looking for in terms of understanding how to better
        for query-based analytics in real
            use cases (and this is a very real one!).

        To quote a short paper I reviewed just this AM on SQL queries
        kind of like these:  "Percentage
        queries are more complex than their conventional counterparts
        and introduce new challenges
        for optimization."  (The paper didn't have an applicable
        solution for us, sadly.)

        A more general facility that I wish we could offer was to do
        grouping in AsterixDB but then
        have the ability to pass a group to (e.g.) R and then get
        results back for the group.  When
        groups are small-ish (like Yiran's windows) that would be
        pretty cool - then one could do
        all sorts of advanced things per group.

        Cheers,
        Mike

        On 2/21/16 12:35 AM, Ian Maxon wrote:

        Yiran and I came up with possible answers for these...
        For 1) , a function could be used that looks something like this:

        declare function minmax($x){
        let $stdv := (avg(for $z in $x return $z*$z) - avg($x) * avg($x))^(0.5)
        for $y in $x
        where $y < (2*$stdv) + avg($x)
        and $y > avg($x) - (2*$stdv)
        return $y
        }



        And then applied to return a new copy of the list of values, removing 
ones
        that are outside of 2 stdev.

        For 2), we also did come up with a potential solution ,but the query 
fails
        to compile (Filed 
ashttps://issues.apache.org/jira/browse/ASTERIXDB-1308  )


        Any thoughts on these queries would be welcome :) 1) especially seems
        inefficient to do as a function.

        - Ian

        On Fri, Feb 19, 2016 at 3:37 PM, Yiran Wang<[email protected]> 
<mailto:[email protected]>  wrote:

        Hi Asterix team,

        I have two queries I'm struggling with. I'm hoping you could provide a
        direction for me. Thanks in advance!

        Here is what the data structure looks like:

        create type HRMType as closed {

           row_id: int32,

           sid: int32,

           date: date,

           day: int32,

           time: time,

           bpm: int32,

           RR: float

        };

        create dataset HRM (HRMType)

        primary key row_id;


        Previously I have used the time bin function to calculate the standard
        deviation of bpm for each time bin:

        for $i in dataset HRM

        group by $sid := $i.sid, $gdate := $i.date, $gday := $i.day, $timebin :=
        interval-bin($i.time, time("00:00:00"), day-time-duration("PT1M")) with 
$i

        return {

        "sid": $sid,

        "gdate": $gdate,

        "gday": $gday,

        "timebin": $timebin,

        "stdv": (avg(for $ii in $i return $ii.RR * $ii.RR) - avg(for $ii in $i
        return $ii.RR) * avg(for $ii in $i return $ii.RR))^(0.5)};

        Now I have two things I am hoping to do but need help with:

        1. For each 1-min time bin, remove the bpm values that are above the top
        5% or below the bottom 5%. I thought about using the min/max function 
for a
        few times to achieve this, but realized that it was not a good idea 
because
        in each time bin, the number of instances are not always the same. So 
for
        each 1-min time bin, we do need to calculate the 5% and 95% threshold, 
and
        remove instances accordingly, which I don't know how to do.

        2. After removing the outliers of bpm for each 1-min time bin, 
calculate a
        median absolute deviation (MAD) for each 1-min time bin (as another 
measure
        of variation besides the standard deviation). MAD =
        median(abs(x-median(x)). I'm not sure how to write a query to do the 
median
        function in Asterix.

        Thank you so much in advance. Let me know if my questions are clear.

        Yiran

        --
        Best,
        Yiran

        --
        You received this message because you are subscribed to the Google 
Groups
        "asterixdb-dev" group.
        To unsubscribe from this group and stop receiving emails from it, send 
an
        email [email protected]
        <mailto:[email protected]>.
        For more options, visithttps://groups.google.com/d/optout.

--Best,

    Yiran





--
Best,
Yiran

Re: Two query questions on time bin function

Reply via email to