Re: [R] ROCR predictions

Assa Yeroslaviz Thu, 19 Aug 2010 02:04:33 -0700

Hello everybody,

yes I'm sorry. I can see it is not so easy to understand.
I'l try to explain a bit more. The experiment was used to compare two
(protein domain) data bases and find out whether or not the results founded
in one are comparable to the second DB.
the first column shows the list of the various inputs in the DB, the second
lists the various domains for each gene. the p-value column calculates the
probability that the found in column four (Expected) to be found by chance.
in column five the expected values was listed.
The calculation of the TP,TN,FP,FN was made many times, each time with a
different p-value (from p=1,...,p=10E-12) as a threshold to calculate the
various values of TP,TN, etc.


The goal of this calculation was to find the optimal p-value wit ha maximum
of TP and a minimum of FP.

To do so I thought about making the column of p-values my predictions and
the values in the column Is.Expected (TRUE,FALSE) to my labels.
This how I calculated my first ROC curve:
> pValue <- read.delim(file = "p=1.txt", as.is= TRUE)
> desc1 <- pValue[["p.value"]]
> label1 <- pValue[["Is.Expected"]] # after changing the values of TRUE = 0,
FALSE = 1

> pred <- prediction(desc1, label1)
> perf <- performance(pred, "tpr", "fpr")
> plot(perf, colorsize = TRUE)

my question are as follow:
1. Am I right in my way of thinkning, that the p-values here are
predictions?
I know you said I need to decided it for myself, but I'm not sure. If they
are, than I will have the same predictions for each and every calculation of
ROCR. Will it make any difference at the prediction?
2. how can i calculate the other p-values thresholds? Do I need to do each
separately, or is there a way of combining them?

I hope you can still help we with some hints or further advieces.

Thanks

Assa

On Wed, Aug 18, 2010 at 07:55, Claudia Beleites <cbelei...@units.it> wrote:

> Dear Assa,
>
> you need to call prediction with continuous predictions and a _binary_ true
> class label.
>
> You are the only one who can tell whether the p-values are actually
> predictions  and what the class labels are. For the list readers p is just
> the name of whatever variable, and you didn't even vaguely say what you try
> to classify, nor did you offer any explanation of what the columns are.
>
> The only information we get from your table is that p-value has small and
> continuous values. From what I see the p-values could also be fitting errors
> of the predictions (e.g. expressed as a probability that the similarity to
> the predicted class is random).
>
> Claudia
>
> Assa Yeroslaviz wrote:
>
>> Dear Claudia,
>>
>> thank you for your fast answer.
>> I add again the table of the data as an example.
>>
>> Protein ID      Pfam Domain     p-value         Expected        Is
>> Expected     True Postive False Negative     False Positive  True Negative
>> NP_000011.2     APH     1.15E-05        APH     TRUE    1       0       0
>>       0
>> NP_000011.2     MutS_V  0.0173  APH     FALSE   0       0       1       0
>> NP_000062.1     CBS     9.40E-08        CBS     TRUE    1       0       0
>>       0
>> NP_000066.1     APH     3.83E-06        APH     TRUE    1       0       0
>>       0
>> NP_000066.1     CobU    0.009   APH     FALSE   0       0       1       0
>> NP_000066.1     FeoA    0.3975  APH     FALSE   0       0       1       0
>> NP_000066.1     Phage_integr_N  0.0219  APH     FALSE   0       0       1
>>       0
>> NP_000161.2     Beta_elim_lyase         6.25E-12        Beta_elim_lyase
>>       TRUE    1       0       0       0
>> NP_000161.2     Glyco_hydro_6   0.002   Beta_elim_lyase         FALSE   0
>>       0       1       0
>> NP_000161.2     SurE    0.0059  Beta_elim_lyase         FALSE   0       0
>>       1       0
>> NP_000161.2     SapB_2  0.0547  Beta_elim_lyase         FALSE   0       0
>>       1       0
>> NP_000161.2     Runt    0.1034  Beta_elim_lyase         FALSE   0       0
>>       1       0
>> NP_000204.3     EGF     0.004666118     EGF     TRUE    1       0       0
>>       0
>> NP_000229.1     PAS     3.13E-06        PAS     TRUE    1       0       0
>>       0
>> NP_000229.1     zf-CCCH         0.2067  PAS     FALSE   0       1       1
>>       0
>> NP_000229.1     E_raikovi_mat   0.0206  PAS     FALSE   0       0       0
>>       0
>> NP_000388.2     NAD_binding_1   8.21E-24        NAD_binding_1   TRUE    1
>>       0       0       0
>> NP_000388.2     ABM     1.40E-08        NAD_binding_1   FALSE   0       0
>>       1       0
>> NP_000483.3     MMR_HSR1        1.98E-05        MMR_HSR1        TRUE    1
>>       0       0       0
>> NP_000483.3     DEAD    2.30E-05        MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000483.3     APS_kinase      1.80E-09        MMR_HSR1        FALSE   0
>>       0       1       0
>> NP_000483.3     CbiA    0.0003  MMR_HSR1        FALSE   0       0       1
>>       0
>> NP_000483.3     CoaE    1.28E-07        MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000483.3     FMN_red         4.61E-08        MMR_HSR1        FALSE   0
>>       0       1       0
>> NP_000483.3     Fn_bind         0.3855  MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000483.3     Invas_SpaK      0.2431  MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000483.3     PEP-utilizers   0.127   MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000483.3     NIR_SIR_ferr    0.1661  MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000483.3     AAA     0.0031  MMR_HSR1        FALSE   0       0       1
>>       0
>> NP_000483.3     DUF448  0.0021  MMR_HSR1        FALSE   0       0       1
>>       0
>> NP_000483.3     CBF_beta        0.1201  MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000483.3     zf-C3HC4        0.0959  MMR_HSR1        FALSE   0       0
>>       1       0
>> NP_000560.5     ig      5.69E-39        ig      TRUE    1       0       0
>>       0
>> NP_000704.1     Epimerase       4.40E-21        Epimerase       TRUE    1
>>       0       0       0
>> NP_000704.1     Lipase_GDSL     6.63E-11        Epimerase       FALSE   0
>>       0       1       0
>>
>> ...
>>
>> this is a shorted list from one of the 10 lists I have for different
>> p-values.
>>
>> As you can see I have separate p-value experiments and probably need to
>> calculate for each of them a separate ROC. But I don't know how to calculate
>> these characteristics for the p-values.
>> How do I assign the predictions to each of the single p-value experiments?
>>
>> I would appreciate any help
>>
>> Thanks
>> Assa
>>
>>
>> On Tue, Aug 17, 2010 at 12:55, Claudia Beleites <cbelei...@units.it<mailto:
>> cbelei...@units.it>> wrote:
>>
>>    Dear Assa,
>>
>>
>>
>>        I am having a problem building a ROC curve with my data using
>>        the ROCR
>>        package.
>>
>>        I have 10 lists of proteins such as attached (proteinlist.xls).
>>        each of the
>>
>>    your file didn't make it to the list.
>>
>>
>>
>>        lists was calculated with a different p-value.
>>        The goal is to find the optimal p-value for the highest number
>>        of true
>>        positives as well as lowaest number of false positives.
>>
>>
>>        As far as I understood the explanations from the vignette of
>>        ROCR, my data
>>        of TP and FP are the labels of the prediction function. But I
>>        don't know how
>>        to assign the right predictions to these labels.
>>
>>
>>    I assume the p-values are different cutoffs that you use for
>>    "hardening" (= making yes/no predictions) from some soft (=
>>    continuous class membership) output of your classifier.
>>
>>    Usually, ROCR calculates the curves as function of the
>>    cutoff/threshold itself from the continuos predictions. If you have
>>    these soft predictions, let ROCR do the calculation for you.
>>
>>    If you don't have them, ROCR can calculate your characteristics
>>    (sens, spec, precision, recall, whatever) for each of the p-values.
>>    While you could combine the results "by hand" into a
>>    ROCR-performance object and let ROCR do the plotting, it is then
>>    probably easier if you plot directly yourself.
>>
>>    Don't be shy to look into the prediction and performance objects, I
>>    find them pretty obvious. Maybe start with the objects produced by
>>    the examples.
>>
>>    Also, note ROCR works with binary validation data only. If your data
>>    has more than one class, you need to make two-class-problems first
>>    (e.g. protein xy ./. not protein xy).
>>
>>
>>
>>        BTW, Is there a way of finding the optimum in the curve? I mean
>>        to find the
>>        exact value in the ROC curve (see sheet 2 in the excel file for
>>        the ROC
>>        curve).
>>
>>
>>    Someone asked for optimum on ROC a couple of months ago, RSiteSearch
>>    on the mailing list with ROC and optimal or optimum should get you
>>    answers.
>>
>>
>>
>>        I would like to thank for any help in advance
>>
>>    You're welcome.
>>
>>    Claudia
>>
>>    --     Claudia Beleites
>>    Dipartimento dei Materiali e delle Risorse Naturali
>>    UniversitÃ  degli Studi di Trieste
>>    Via Alfonso Valerio 6/a
>>    I-34127 Trieste
>>
>>    phone: +39 0 40 5 58-37 68
>>    email: cbelei...@units.it <mailto:cbelei...@units.it>
>>
>>
>>
>
> --
> Claudia Beleites
> Dipartimento dei Materiali e delle Risorse Naturali
> UniversitÃ  degli Studi di Trieste
> Via Alfonso Valerio 6/a
> I-34127 Trieste
>
> phone: +39 0 40 5 58-37 68
> email: cbelei...@units.it
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ROCR predictions

Reply via email to