Re: [Bioc-devel] Behavior of select function in AnnotationDbi

Hervé Pagès Sun, 22 Nov 2015 04:09:52 -0800

Hi Jim,

I understand the convenience of having an output that is parallel to
the input, especially in the context of the primary use case which is
mapping/cbinding ids to a rectangular object. However for that use
case, "as close to parallel as possible" is not good enough. IMO the
user needs a tool that guarantees parallelism.


mapIds() seems to be such tool. It seems to produce an output that is
always parallel to the input, at least by default (i.e. when the
'multiVals' argument is not specified). BTW this is such an important
property that it would deserve to be explicitly stated somewhere in
the man page.

So the user already has mapIds() for mapping/cbinding ids to a
rectangular object.

mapIds() builds on top of select() and achieves parallelism by adding a
mechanism to resolve ambiguous mappings (via the 'multiVals' argument).

In contrast, select() currently doesn't provide such mechanism. Instead
it systematically prints a message about the cardinality of the mapping
(1:1, 1:many, many:1, or many:many). This is not very useful. What's
really important is whether the output is parallel to the input or not.
If it is (i.e. if the mapping is 1:1 or many:1), then everything is fine
and no message or warning is needed. If it's not (i.e. if the mapping is
1:many or many:many), then it makes sense to warn the user. A warning
seems more appropriate than a message for this because it's about a
potential gotcha with the output. Also, unlike a message, a warning is
delayed and printed at the bottom of the output which makes it less
likely to be missed, especially if the output fits more than 1 screen.

Anyway, the current select() is in this gray area where it sometimes
returns something that I can use for mapping/cbinding ids to my
rectangular object, and sometimes returns something that I can't use
for that (unless I transform it first). There is no way to predict that.
It's only a question of luck. If I'm out of luck, then I use mapIds()
(and the warning should tell me that), which is what I should have used
in the 1st place. In other words, if I care about parallelism, then
select() is the wrong tool.

This is why I was suggesting that a simple and straightforward approach
would be to just give up on parallelism for select() i.e. make it behave
like SQL SELECT and biomaRt::getBM(), stop worrying about the output not
being parallel to the input, and make it clear in the man page that the
user should not have such expectation (if parallel output is important
to him/her, then s/he should use mapIds()). With this approach select()
is a little bit lower level than it is right now and mapIds() is the
higher level tool that builds on top of it to achieve parallelism. This
low-level select() is still extremely useful and adds a lot of value
over letting the users write their own SQL queries. It is the same
level of usefulness that biomaRt::getBM() adds for querying Ensembl
(compared to typing SQL queries in an MySQL client to query the Ensembl
MySQL db). With biomaRt::getBM() and with this low-level select(), you
don't need to know any SQL. All you need to know is that the output
is not guaranteed to be parallel to the input.

Another approach is the exact opposite one: make select() reliable for
the mapping/cbinding use case i.e. make it produce an output that is
always parallel to the input by default. That means it would also need
a mechanism to resolve ambiguous mappings like mapIds() does (e.g.
via a 'multiVals' argument).

Right now select() sits in the middle of these 2 approaches.

If select()'s behavior cannot be changed, maybe we can try to make it
a little less noisy i.e. have warnings (and not messages) only when
the output is not parallel to the input. Also how about having a
warning that suggests the use of mapIds()? Something like "Hey, your
output is not parallel to your input! Use mapIds() if that matters
to you."

Thanks,
H.


On 11/21/2015 08:28 AM, James W. MacDonald wrote:

Hi Hervé,

I think you make a valid point, but I am not sure it conforms to either
the historical expectations for the annotation packages, nor the
expected use case. In other words, the annotation packages have, since
the beginning, returned something that was as close to parallel to the
input as possible. This started with the env-based packages, where e.g.,
mget() returned things parallel to the input, rather than randomly, as
the SQL model would have done. This was actually true for biomaRt back
in the day when you could use either the MySQL or CURL interfaces. The
SQL interface returned something parallel to input, and it was a bit of
a change when Steffen went to just using the CURL interface, and people
had to learn to add the query key into the return in order to re-sort.

So from a historical perspective it would be a real change to go from
returning something as close to parallel to the input to a SQL type
model. I agree that makes sense for those who are versed in SQL, but if
we were really trying to help people familiar with SQL, we would just be
giving them the RSQLite DB and telling them to have at it, no? The whole
rationale behind wrapping the DB in a package is to shield the end user
from having to understand SQL, which includes knowing that SQL doesn't
enforce the order of the returned data.

 From a use case perspective, the only use case that I am familiar with
is the one where people have a rectangular array of data, where the rows
are genes or whatnot, and they want to append additional columns to the
array, where the added columns conform to the existing array. This is
not always possible to do in the case of 1:many mappings, obviously, but
that is what mapIds() is for - there is an implicit contract there that
you can input a set of keys, and get out an ordered result that you can
then tack onto the side of your array of data and go forward. I think
that is a really useful thing to be able to do. It is obviously quite
easy to take two matrices that are ordered differently, reorder one, and
then bind together. But if you don't have to do that, then it's an
improvement, and quite helpful for naive users who might not have
considered that the returned result might not actually be parallel to
their input.

And being the one who convinced Marc to put the warnings back in, I will
have to respectively disagree about whether or not it is useful to emit
the warnings. Well, I agree it isn't useful at all for those of us who
already know that the results might not be parallel. But that isn't the
reason for a warning is it? To warn those who already understand what is
going on? My argument is (and was) that a warning is intended to tell
people that something that they /might not have expected/ just happened.
I don't think it is unreasonable to think that a significant proportion
of naive end users would get tripped up by that if the warning weren't
there. They still might be, but at least we gave them a chance.

Anyway, it took like three lines of code to make it work consistently
for many:many mappings, so that's what I did.

Best,

Jim



On Fri, Nov 20, 2015 at 6:32 PM, Hervé Pagès <hpa...@fredhutch.org
<mailto:hpa...@fredhutch.org>> wrote:

    On 11/20/2015 03:21 PM, Hervé Pagès wrote:

        Hi Jim,

        I think we should choose the biomaRt model, that is, duplicated are
        allowed but silently ignored.

        Note that this is also the SQL model. When you do

            SELECT * FROM ... WHERE key IN c('key1', 'key2', ...)

                                         ^
    I meant:

          SELECT * FROM ... WHERE key IN ('key1', 'key2', ...)

    No c() (too much R lately...)

    H.



        duplicated keys don't generate duplicates in the output.

        Also note that, like SELECT, even if the keys supplied to
        biomaRt::getBM() (via the 'values' arg) don't contain duplicates
        and if all the mappings are 1-to-1, biomaRt::getBM() is not
        guarantee to preserve order.

        Generally speaking having duplicates in the input produce duplicates
        in the output is useful in vectorized operations when the output
        is expected to be parallel to the input. Vectorized operations also
        need to propagate NAs and to preserve order. However, like SELECT
        and biomaRt::getBM(), select() cannot produce an output that is
        parallel to the input *in general*.

        It seems that the current philosophy for select() is to emit a note
        or a warning every time the output is not parallel to the input.
        Personally I find this too noisy and not that useful.

        Thanks,
        H.


        On 11/20/2015 02:30 PM, James W. MacDonald wrote:

            There is an inconsistency in how select() works in
            AnnotationDbi when a
            user passes in duplicated keys to be mapped, depending on if the
            mapping is
            1:1 or 1:many. It's easiest to show using an example.

                select(org.Hs.eg.db, rep("1", 3), "SYMBOL")

            'select()' returned many:1 mapping between keys and columns
                ENTREZID SYMBOL
            1        1   A1BG
            2        1   A1BG
            3        1   A1BG

                select(org.Hs.eg.db, rep("1", 3), "GO")

            'select()' returned many:many mapping between keys and columns
                ENTREZID         GO EVIDENCE ONTOLOGY
            1        1 GO:0003674       ND       MF
            2        1 GO:0003674       ND       MF
            3        1 GO:0003674       ND       MF

            This is obviously a bug. A single query for that ID results
            in this:

                select(org.Hs.eg.db, "1", "GO")

            'select()' returned 1:many mapping between keys and columns
                ENTREZID         GO EVIDENCE ONTOLOGY
            1        1 GO:0003674       ND       MF
            2        1 GO:0005576      IDA       CC
            3        1 GO:0005615      IDA       CC
            4        1 GO:0008150       ND       BP
            5        1 GO:0070062      IDA       CC
            6        1 GO:0072562      IDA       CC

            So the returned results are completely borked.

            However, the question I have is what should be returned? To
            be consistent
            with the first example, it should be the output expected for
            a single
            key,
            repeated three times, which I have patched AnnotationDbi to do:

                select(org.Hs.eg.db, rep("1", 3), "GO")

            'select()' returned many:many mapping between keys and columns
                 ENTREZID         GO EVIDENCE ONTOLOGY
            1         1 GO:0003674       ND       MF
            2         1 GO:0005576      IDA       CC
            3         1 GO:0005615      IDA       CC
            4         1 GO:0008150       ND       BP
            5         1 GO:0070062      IDA       CC
            6         1 GO:0072562      IDA       CC
            7         1 GO:0003674       ND       MF
            8         1 GO:0005576      IDA       CC
            9         1 GO:0005615      IDA       CC
            10        1 GO:0008150       ND       BP
            11        1 GO:0070062      IDA       CC
            12        1 GO:0072562      IDA       CC
            13        1 GO:0003674       ND       MF
            14        1 GO:0005576      IDA       CC
            15        1 GO:0005615      IDA       CC
            16        1 GO:0008150       ND       BP
            17        1 GO:0070062      IDA       CC
            18        1 GO:0072562      IDA       CC

            So, two questions.


                 1. Should duplicate keys be allowed, or should
            duplicates be removed
                 before querying the database, preferably with a message
            saying
            that dups
                 were removed?
                 2. If the answer to #1 is yes, then to be consistent, I
            will just
            commit
                 the patch I have made to both devel and release.

            Best,

            Jim





    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>




--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Behavior of select function in AnnotationDbi

Reply via email to