Re: [R] ROCR predictions

2010-08-19 Thread Assa Yeroslaviz
Hello everybody,

yes I'm sorry. I can see it is not so easy to understand.
I'l try to explain a bit more. The experiment was used to compare two
(protein domain) data bases and find out whether or not the results founded
in one are comparable to the second DB.
the first column shows the list of the various inputs in the DB, the second
lists the various domains for each gene. the p-value column calculates the
probability that the found in column four (Expected) to be found by chance.
in column five the expected values was listed.
The calculation of the TP,TN,FP,FN was made many times, each time with a
different p-value (from p=1,...,p=10E-12) as a threshold to calculate the
various values of TP,TN, etc.

The goal of this calculation was to find the optimal p-value wit ha maximum
of TP and a minimum of FP.

To do so I thought about making the column of p-values my predictions and
the values in the column Is.Expected (TRUE,FALSE) to my labels.
This how I calculated my first ROC curve:
 pValue - read.delim(file = p=1.txt, as.is= TRUE)
 desc1 - pValue[[p.value]]
 label1 - pValue[[Is.Expected]] # after changing the values of TRUE = 0,
FALSE = 1

 pred - prediction(desc1, label1)
 perf - performance(pred, tpr, fpr)
 plot(perf, colorsize = TRUE)

my question are as follow:
1. Am I right in my way of thinkning, that the p-values here are
predictions?
I know you said I need to decided it for myself, but I'm not sure. If they
are, than I will have the same predictions for each and every calculation of
ROCR. Will it make any difference at the prediction?
2. how can i calculate the other p-values thresholds? Do I need to do each
separately, or is there a way of combining them?

I hope you can still help we with some hints or further advieces.

Thanks

Assa

On Wed, Aug 18, 2010 at 07:55, Claudia Beleites cbelei...@units.it wrote:

 Dear Assa,

 you need to call prediction with continuous predictions and a _binary_ true
 class label.

 You are the only one who can tell whether the p-values are actually
 predictions  and what the class labels are. For the list readers p is just
 the name of whatever variable, and you didn't even vaguely say what you try
 to classify, nor did you offer any explanation of what the columns are.

 The only information we get from your table is that p-value has small and
 continuous values. From what I see the p-values could also be fitting errors
 of the predictions (e.g. expressed as a probability that the similarity to
 the predicted class is random).

 Claudia

 Assa Yeroslaviz wrote:

 Dear Claudia,

 thank you for your fast answer.
 I add again the table of the data as an example.

 Protein ID  Pfam Domain p-value ExpectedIs
 Expected True Postive False Negative False Positive  True Negative
 NP_11.2 APH 1.15E-05APH TRUE1   0   0
   0
 NP_11.2 MutS_V  0.0173  APH FALSE   0   0   1   0
 NP_62.1 CBS 9.40E-08CBS TRUE1   0   0
   0
 NP_66.1 APH 3.83E-06APH TRUE1   0   0
   0
 NP_66.1 CobU0.009   APH FALSE   0   0   1   0
 NP_66.1 FeoA0.3975  APH FALSE   0   0   1   0
 NP_66.1 Phage_integr_N  0.0219  APH FALSE   0   0   1
   0
 NP_000161.2 Beta_elim_lyase 6.25E-12Beta_elim_lyase
   TRUE1   0   0   0
 NP_000161.2 Glyco_hydro_6   0.002   Beta_elim_lyase FALSE   0
   0   1   0
 NP_000161.2 SurE0.0059  Beta_elim_lyase FALSE   0   0
   1   0
 NP_000161.2 SapB_2  0.0547  Beta_elim_lyase FALSE   0   0
   1   0
 NP_000161.2 Runt0.1034  Beta_elim_lyase FALSE   0   0
   1   0
 NP_000204.3 EGF 0.004666118 EGF TRUE1   0   0
   0
 NP_000229.1 PAS 3.13E-06PAS TRUE1   0   0
   0
 NP_000229.1 zf-CCCH 0.2067  PAS FALSE   0   1   1
   0
 NP_000229.1 E_raikovi_mat   0.0206  PAS FALSE   0   0   0
   0
 NP_000388.2 NAD_binding_1   8.21E-24NAD_binding_1   TRUE1
   0   0   0
 NP_000388.2 ABM 1.40E-08NAD_binding_1   FALSE   0   0
   1   0
 NP_000483.3 MMR_HSR11.98E-05MMR_HSR1TRUE1
   0   0   0
 NP_000483.3 DEAD2.30E-05MMR_HSR1FALSE   0   0
   1   0
 NP_000483.3 APS_kinase  1.80E-09MMR_HSR1FALSE   0
   0   1   0
 NP_000483.3 CbiA0.0003  MMR_HSR1FALSE   0   0   1
   0
 NP_000483.3 CoaE1.28E-07MMR_HSR1FALSE   0   0
   1   0
 NP_000483.3 FMN_red 4.61E-08MMR_HSR1FALSE   0
   0   1   0
 NP_000483.3 Fn_bind 0.3855  MMR_HSR1

Re: [R] ROCR predictions

2010-08-19 Thread Frank Harrell


At the heart of this you have a problem in incomplete conditioning. 
You are computing things like Prob(X  x) when you know X=x.  Working 
with a statistician who is well versed in probability models will 
undoubtedly help.


Frank

Frank E Harrell Jr   Professor and ChairmanSchool of Medicine
 Department of Biostatistics   Vanderbilt University

On Thu, 19 Aug 2010, Assa Yeroslaviz wrote:


Hello everybody,

yes I'm sorry. I can see it is not so easy to understand.
I'l try to explain a bit more. The experiment was used to compare two
(protein domain) data bases and find out whether or not the results founded
in one are comparable to the second DB.
the first column shows the list of the various inputs in the DB, the second
lists the various domains for each gene. the p-value column calculates the
probability that the found in column four (Expected) to be found by chance.
in column five the expected values was listed.
The calculation of the TP,TN,FP,FN was made many times, each time with a
different p-value (from p=1,...,p=10E-12) as a threshold to calculate the
various values of TP,TN, etc.

The goal of this calculation was to find the optimal p-value wit ha maximum
of TP and a minimum of FP.

To do so I thought about making the column of p-values my predictions and
the values in the column Is.Expected (TRUE,FALSE) to my labels.
This how I calculated my first ROC curve:

pValue - read.delim(file = p=1.txt, as.is= TRUE)
desc1 - pValue[[p.value]]
label1 - pValue[[Is.Expected]] # after changing the values of TRUE = 0,

FALSE = 1


pred - prediction(desc1, label1)
perf - performance(pred, tpr, fpr)
plot(perf, colorsize = TRUE)


my question are as follow:
1. Am I right in my way of thinkning, that the p-values here are
predictions?
I know you said I need to decided it for myself, but I'm not sure. If they
are, than I will have the same predictions for each and every calculation of
ROCR. Will it make any difference at the prediction?
2. how can i calculate the other p-values thresholds? Do I need to do each
separately, or is there a way of combining them?

I hope you can still help we with some hints or further advieces.

Thanks

Assa

On Wed, Aug 18, 2010 at 07:55, Claudia Beleites cbelei...@units.it wrote:


Dear Assa,

you need to call prediction with continuous predictions and a _binary_ true
class label.

You are the only one who can tell whether the p-values are actually
predictions  and what the class labels are. For the list readers p is just
the name of whatever variable, and you didn't even vaguely say what you try
to classify, nor did you offer any explanation of what the columns are.

The only information we get from your table is that p-value has small and
continuous values. From what I see the p-values could also be fitting errors
of the predictions (e.g. expressed as a probability that the similarity to
the predicted class is random).

Claudia

Assa Yeroslaviz wrote:


Dear Claudia,

thank you for your fast answer.
I add again the table of the data as an example.

Protein ID  Pfam Domain p-value ExpectedIs
Expected True Postive False Negative False Positive  True Negative
NP_11.2 APH 1.15E-05APH TRUE1   0   0
  0
NP_11.2 MutS_V  0.0173  APH FALSE   0   0   1   0
NP_62.1 CBS 9.40E-08CBS TRUE1   0   0
  0
NP_66.1 APH 3.83E-06APH TRUE1   0   0
  0
NP_66.1 CobU0.009   APH FALSE   0   0   1   0
NP_66.1 FeoA0.3975  APH FALSE   0   0   1   0
NP_66.1 Phage_integr_N  0.0219  APH FALSE   0   0   1
  0
NP_000161.2 Beta_elim_lyase 6.25E-12Beta_elim_lyase
  TRUE1   0   0   0
NP_000161.2 Glyco_hydro_6   0.002   Beta_elim_lyase FALSE   0
  0   1   0
NP_000161.2 SurE0.0059  Beta_elim_lyase FALSE   0   0
  1   0
NP_000161.2 SapB_2  0.0547  Beta_elim_lyase FALSE   0   0
  1   0
NP_000161.2 Runt0.1034  Beta_elim_lyase FALSE   0   0
  1   0
NP_000204.3 EGF 0.004666118 EGF TRUE1   0   0
  0
NP_000229.1 PAS 3.13E-06PAS TRUE1   0   0
  0
NP_000229.1 zf-CCCH 0.2067  PAS FALSE   0   1   1
  0
NP_000229.1 E_raikovi_mat   0.0206  PAS FALSE   0   0   0
  0
NP_000388.2 NAD_binding_1   8.21E-24NAD_binding_1   TRUE1
  0   0   0
NP_000388.2 ABM 1.40E-08NAD_binding_1   FALSE   0   0
  1   0
NP_000483.3 MMR_HSR11.98E-05MMR_HSR1TRUE1
  0   0   0
NP_000483.3 DEAD2.30E-05MMR_HSR1FALSE   0   0
  1   0
NP_000483.3 APS_kinase  1.80E-09MMR_HSR1

[R] ROCR predictions

2010-08-17 Thread Assa Yeroslaviz
Hi everybody,

I am having a problem building a ROC curve with my data using the ROCR
package.

I have 10 lists of proteins such as attached (proteinlist.xls). each of the
lists was calculated with a different p-value.
The goal is to find the optimal p-value for the highest number of true
positives as well as lowaest number of false positives.

As far as I understood the explanations from the vignette of ROCR, my data
of TP and FP are the labels of the prediction function. But I don't know how
to assign the right predictions to these labels.

BTW, Is there a way of finding the optimum in the curve? I mean to find the
exact value in the ROC curve (see sheet 2 in the excel file for the ROC
curve).

I would like to thank for any help in advance

Assa
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROCR predictions

2010-08-17 Thread Claudia Beleites

Dear Assa,



I am having a problem building a ROC curve with my data using the ROCR
package.

I have 10 lists of proteins such as attached (proteinlist.xls). each of the

your file didn't make it to the list.



lists was calculated with a different p-value.
The goal is to find the optimal p-value for the highest number of true
positives as well as lowaest number of false positives.



As far as I understood the explanations from the vignette of ROCR, my data
of TP and FP are the labels of the prediction function. But I don't know how
to assign the right predictions to these labels.


I assume the p-values are different cutoffs that you use for hardening (= 
making yes/no predictions) from some soft (= continuous class membership) output 
of your classifier.


Usually, ROCR calculates the curves as function of the cutoff/threshold itself 
from the continuos predictions. If you have these soft predictions, let ROCR do 
the calculation for you.


If you don't have them, ROCR can calculate your characteristics (sens, spec, 
precision, recall, whatever) for each of the p-values. While you could combine 
the results by hand into a ROCR-performance object and let ROCR do the 
plotting, it is then probably easier if you plot directly yourself.


Don't be shy to look into the prediction and performance objects, I find them 
pretty obvious. Maybe start with the objects produced by the examples.


Also, note ROCR works with binary validation data only. If your data has more 
than one class, you need to make two-class-problems first (e.g. protein xy ./. 
not protein xy).




BTW, Is there a way of finding the optimum in the curve? I mean to find the
exact value in the ROC curve (see sheet 2 in the excel file for the ROC
curve).


Someone asked for optimum on ROC a couple of months ago, RSiteSearch on the 
mailing list with ROC and optimal or optimum should get you answers.




I would like to thank for any help in advance

You're welcome.

Claudia

--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbelei...@units.it

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROCR predictions

2010-08-17 Thread Assa Yeroslaviz
Dear Claudia,

thank you for your fast answer.
I add again the table of the data as an example.

 Protein ID Pfam Domain p-value Expected Is Expected True Postive False
Negative False Positive True Negative  NP_11.2 APH 1.15E-05 APH TRUE 1 0
0 0  NP_11.2 MutS_V 0.0173 APH FALSE 0 0 1 0  NP_62.1 CBS 9.40E-08
CBS TRUE 1 0 0 0  NP_66.1 APH 3.83E-06 APH TRUE 1 0 0 0  NP_66.1
CobU 0.009 APH FALSE 0 0 1 0  NP_66.1 FeoA 0.3975 APH FALSE 0 0 1 0
NP_66.1 Phage_integr_N 0.0219 APH FALSE 0 0 1 0  NP_000161.2
Beta_elim_lyase 6.25E-12 Beta_elim_lyase TRUE 1 0 0 0  NP_000161.2
Glyco_hydro_6 0.002 Beta_elim_lyase FALSE 0 0 1 0  NP_000161.2 SurE 0.0059
Beta_elim_lyase FALSE 0 0 1 0  NP_000161.2 SapB_2 0.0547 Beta_elim_lyase
FALSE 0 0 1 0  NP_000161.2 Runt 0.1034 Beta_elim_lyase FALSE 0 0 1 0
NP_000204.3 EGF 0.004666118 EGF TRUE 1 0 0 0  NP_000229.1 PAS 3.13E-06 PAS
TRUE 1 0 0 0  NP_000229.1 zf-CCCH 0.2067 PAS FALSE 0 1 1 0  NP_000229.1
E_raikovi_mat 0.0206 PAS FALSE 0 0 0 0  NP_000388.2 NAD_binding_1 8.21E-24
NAD_binding_1 TRUE 1 0 0 0  NP_000388.2 ABM 1.40E-08 NAD_binding_1 FALSE 0 0
1 0  NP_000483.3 MMR_HSR1 1.98E-05 MMR_HSR1 TRUE 1 0 0 0  NP_000483.3 DEAD
2.30E-05 MMR_HSR1 FALSE 0 0 1 0  NP_000483.3 APS_kinase 1.80E-09 MMR_HSR1
FALSE 0 0 1 0  NP_000483.3 CbiA 0.0003 MMR_HSR1 FALSE 0 0 1 0  NP_000483.3
CoaE 1.28E-07 MMR_HSR1 FALSE 0 0 1 0  NP_000483.3 FMN_red 4.61E-08 MMR_HSR1
FALSE 0 0 1 0  NP_000483.3 Fn_bind 0.3855 MMR_HSR1 FALSE 0 0 1 0
NP_000483.3 Invas_SpaK 0.2431 MMR_HSR1 FALSE 0 0 1 0  NP_000483.3
PEP-utilizers 0.127 MMR_HSR1 FALSE 0 0 1 0  NP_000483.3 NIR_SIR_ferr 0.1661
MMR_HSR1 FALSE 0 0 1 0  NP_000483.3 AAA 0.0031 MMR_HSR1 FALSE 0 0 1 0
NP_000483.3 DUF448 0.0021 MMR_HSR1 FALSE 0 0 1 0  NP_000483.3 CBF_beta
0.1201 MMR_HSR1 FALSE 0 0 1 0  NP_000483.3 zf-C3HC4 0.0959 MMR_HSR1 FALSE 0
0 1 0  NP_000560.5 ig 5.69E-39 ig TRUE 1 0 0 0  NP_000704.1 Epimerase
4.40E-21 Epimerase TRUE 1 0 0 0  NP_000704.1 Lipase_GDSL 6.63E-11 Epimerase
FALSE 0 0 1 0
 ...

this is a shorted list from one of the 10 lists I have for different
p-values.

As you can see I have separate p-value experiments and probably need to
calculate for each of them a separate ROC. But I don't know how to calculate
these characteristics for the p-values.
How do I assign the predictions to each of the single p-value experiments?

I would appreciate any help

Thanks
Assa


On Tue, Aug 17, 2010 at 12:55, Claudia Beleites cbelei...@units.it wrote:

 Dear Assa,



 I am having a problem building a ROC curve with my data using the ROCR
 package.

 I have 10 lists of proteins such as attached (proteinlist.xls). each of
 the

 your file didn't make it to the list.



  lists was calculated with a different p-value.
 The goal is to find the optimal p-value for the highest number of true
 positives as well as lowaest number of false positives.


  As far as I understood the explanations from the vignette of ROCR, my data
 of TP and FP are the labels of the prediction function. But I don't know
 how
 to assign the right predictions to these labels.


 I assume the p-values are different cutoffs that you use for hardening (=
 making yes/no predictions) from some soft (= continuous class membership)
 output of your classifier.

 Usually, ROCR calculates the curves as function of the cutoff/threshold
 itself from the continuos predictions. If you have these soft predictions,
 let ROCR do the calculation for you.

 If you don't have them, ROCR can calculate your characteristics (sens,
 spec, precision, recall, whatever) for each of the p-values. While you could
 combine the results by hand into a ROCR-performance object and let ROCR do
 the plotting, it is then probably easier if you plot directly yourself.

 Don't be shy to look into the prediction and performance objects, I find
 them pretty obvious. Maybe start with the objects produced by the examples.

 Also, note ROCR works with binary validation data only. If your data has
 more than one class, you need to make two-class-problems first (e.g. protein
 xy ./. not protein xy).



  BTW, Is there a way of finding the optimum in the curve? I mean to find
 the
 exact value in the ROC curve (see sheet 2 in the excel file for the ROC
 curve).


 Someone asked for optimum on ROC a couple of months ago, RSiteSearch on the
 mailing list with ROC and optimal or optimum should get you answers.



  I would like to thank for any help in advance

 You're welcome.

 Claudia

 --
 Claudia Beleites
 Dipartimento dei Materiali e delle Risorse Naturali
 Università degli Studi di Trieste
 Via Alfonso Valerio 6/a
 I-34127 Trieste

 phone: +39 0 40 5 58-37 68
 email: cbelei...@units.it


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROCR predictions

2010-08-17 Thread Claudia Beleites

Dear Assa,

you need to call prediction with continuous predictions and a _binary_ true 
class label.


You are the only one who can tell whether the p-values are actually predictions 
 and what the class labels are. For the list readers p is just the name of 
whatever variable, and you didn't even vaguely say what you try to classify, nor 
did you offer any explanation of what the columns are.


The only information we get from your table is that p-value has small and 
continuous values. From what I see the p-values could also be fitting errors of 
the predictions (e.g. expressed as a probability that the similarity to the 
predicted class is random).


Claudia

Assa Yeroslaviz wrote:

Dear Claudia,

thank you for your fast answer.
I add again the table of the data as an example.

Protein ID 	Pfam Domain 	p-value 	Expected 	Is Expected 	True Postive 
False Negative 	False Positive 	True Negative

NP_11.2 APH 1.15E-05APH TRUE1   0   0   0
NP_11.2 MutS_V  0.0173  APH FALSE   0   0   1   0
NP_62.1 CBS 9.40E-08CBS TRUE1   0   0   0
NP_66.1 APH 3.83E-06APH TRUE1   0   0   0
NP_66.1 CobU0.009   APH FALSE   0   0   1   0
NP_66.1 FeoA0.3975  APH FALSE   0   0   1   0
NP_66.1 Phage_integr_N  0.0219  APH FALSE   0   0   1   0
NP_000161.2 Beta_elim_lyase 6.25E-12Beta_elim_lyase 
TRUE1   0   0   0
NP_000161.2 Glyco_hydro_6   0.002   Beta_elim_lyase FALSE   0   
0   1   0
NP_000161.2 SurE0.0059  Beta_elim_lyase FALSE   0   0   
1   0
NP_000161.2 SapB_2  0.0547  Beta_elim_lyase FALSE   0   0   
1   0
NP_000161.2 Runt0.1034  Beta_elim_lyase FALSE   0   0   
1   0
NP_000204.3 EGF 0.004666118 EGF TRUE1   0   0   0
NP_000229.1 PAS 3.13E-06PAS TRUE1   0   0   0
NP_000229.1 zf-CCCH 0.2067  PAS FALSE   0   1   1   0
NP_000229.1 E_raikovi_mat   0.0206  PAS FALSE   0   0   0   0
NP_000388.2 NAD_binding_1   8.21E-24NAD_binding_1   TRUE1   
0   0   0
NP_000388.2 ABM 1.40E-08NAD_binding_1   FALSE   0   0   
1   0
NP_000483.3 MMR_HSR11.98E-05MMR_HSR1TRUE1   
0   0   0
NP_000483.3 DEAD2.30E-05MMR_HSR1FALSE   0   0   
1   0
NP_000483.3 APS_kinase  1.80E-09MMR_HSR1FALSE   0   
0   1   0
NP_000483.3 CbiA0.0003  MMR_HSR1FALSE   0   0   1   0
NP_000483.3 CoaE1.28E-07MMR_HSR1FALSE   0   0   
1   0
NP_000483.3 FMN_red 4.61E-08MMR_HSR1FALSE   0   
0   1   0
NP_000483.3 Fn_bind 0.3855  MMR_HSR1FALSE   0   0   
1   0
NP_000483.3 Invas_SpaK  0.2431  MMR_HSR1FALSE   0   0   
1   0
NP_000483.3 PEP-utilizers   0.127   MMR_HSR1FALSE   0   0   
1   0
NP_000483.3 NIR_SIR_ferr0.1661  MMR_HSR1FALSE   0   0   
1   0
NP_000483.3 AAA 0.0031  MMR_HSR1FALSE   0   0   1   0
NP_000483.3 DUF448  0.0021  MMR_HSR1FALSE   0   0   1   0
NP_000483.3 CBF_beta0.1201  MMR_HSR1FALSE   0   0   
1   0
NP_000483.3 zf-C3HC40.0959  MMR_HSR1FALSE   0   0   
1   0
NP_000560.5 ig  5.69E-39ig  TRUE1   0   0   0
NP_000704.1 Epimerase   4.40E-21Epimerase   TRUE1   
0   0   0
NP_000704.1 Lipase_GDSL 6.63E-11Epimerase   FALSE   0   
0   1   0

...

this is a shorted list from one of the 10 lists I have for different 
p-values.


As you can see I have separate p-value experiments and probably need to 
calculate for each of them a separate ROC. But I don't know how to 
calculate these characteristics for the p-values.

How do I assign the predictions to each of the single p-value experiments?

I would appreciate any help

Thanks
Assa


On Tue, Aug 17, 2010 at 12:55, Claudia Beleites cbelei...@units.it 
mailto:cbelei...@units.it wrote:


Dear Assa,



I am having a problem building a ROC curve with my data using
the ROCR
package.

I have 10 lists of proteins such as attached (proteinlist.xls).
each of the

your file didn't make it to the list.



lists was calculated with a different p-value.
The goal is to find the optimal p-value for the highest number
of true
positives as well as lowaest number of false positives.