Re: [spctools-discuss] Very low number of identifications Comet+PeptideProphet

Valeriia Vasylieva Wed, 19 Mar 2025 20:48:33 -0700

Hi Jimmy,

Thank you for your response!


I worked with kind of samples like b1369p080_sample_01_a . There are 14 of 
them, each divided to fractions _a and _b. I merged the fractions and did 
search on 14 raw files.
I did not take into account that those are high-res MS/MS. Thank you for a 
valuable notice and suggested parameters. 
The fact that you also identified just a few high-scoring PSMs suggests 
that maybe my input parameters were not far from yours.I attach comet.params 
file. I will appreciate your comments on it.

Valeriia

# comet_version 2021.01 rev. 0
# Comet MS/MS search engine parameters file.
# Everything following the '#' symbol is treated as a comment.

database_name = 
decoy_search = 1                       # 0=no (default), 1=concatenated 
search, 2=separate search
peff_format = 0                        # 0=no (normal fasta, default), 
1=PEFF PSI-MOD, 2=PEFF Unimod
peff_obo =                             # path to PSI Mod or Unimod OBO file

num_threads = 0                        # 0=poll CPU to set num threads; 
else specify num threads directly (max 128)

#
# masses
#
peptide_mass_tolerance = 20.00
peptide_mass_units = 2                 # 0=amu, 1=mmu, 2=ppm
mass_type_parent = 1                   # 0=average masses, 1=monoisotopic 
masses
mass_type_fragment = 1                 # 0=average masses, 1=monoisotopic 
masses
precursor_tolerance_type = 1           # 0=MH+ (default), 1=precursor m/z; 
only valid for amu/mmu tolerances
isotope_error = 1                      # 0=off, 1=0/1 (C13 error), 2=0/1/2, 
3=0/1/2/3, 4=-8/-4/0/4/8 (for +4/+8 labeling)

#
# search enzyme
#
search_enzyme_number = 1               # choose from list at end of this 
params file
search_enzyme2_number = 0              # second enzyme; set to 0 if no 
second enzyme
num_enzyme_termini = 2                 # 1 (semi-digested), 2 (fully 
digested, default), 8 C-term unspecific , 9 N-term unspecific
allowed_missed_cleavage = 2            # maximum value is 5; for enzyme 
search

#
# Up to 9 variable modifications are supported
# format:  <mass> <residues> <0=variable/else binary> 
<max_mods_per_peptide> <term_distance> <n/c-term> <required> <neutral_loss>
#     e.g. 79.966331 STY 0 3 -1 0 0 97.976896
#
variable_mod01 = 15.9949 M 0 3 -1 0 0 0.0
variable_mod02 = 0.0 X 0 3 -1 0 0 0.0
variable_mod03 = 0.0 X 0 3 -1 0 0 0.0
variable_mod04 = 0.0 X 0 3 -1 0 0 0.0
variable_mod05 = 0.0 X 0 3 -1 0 0 0.0
variable_mod06 = 0.0 X 0 3 -1 0 0 0.0
variable_mod07 = 0.0 X 0 3 -1 0 0 0.0
variable_mod08 = 0.0 X 0 3 -1 0 0 0.0
variable_mod09 = 0.0 X 0 3 -1 0 0 0.0
max_variable_mods_in_peptide = 5
require_variable_mod = 0

#
# fragment ions
#
# ion trap ms/ms:  1.0005 tolerance, 0.4 offset (mono masses), 
theoretical_fragment_ions = 1
# high res ms/ms:    0.02 tolerance, 0.0 offset (mono masses), 
theoretical_fragment_ions = 0, spectrum_batch_size = 15000
#

fragment_bin_tol = 1.0005              # binning to use on fragment ions
fragment_bin_offset = 0.0              # offset position to start the 
binning (0.0 to 1.0)
theoretical_fragment_ions = 0          # 0=use flanking peaks, 1=M peak only
use_A_ions = 0
use_B_ions = 1
use_C_ions = 0
use_X_ions = 0
use_Y_ions = 1
use_Z_ions = 0
use_Z1_ions = 0
use_NL_ions = 1                        # 0=no, 1=yes to consider NH3/H2O 
neutral loss peaks

#
# output
#
output_sqtfile = 0                     # 0=no, 1=yes  write sqt file
output_txtfile = 0                     # 0=no, 1=yes  write tab-delimited 
txt file
output_pepxmlfile = 1                  # 0=no, 1=yes  write pepXML file
output_mzidentmlfile = 0               # 0=no, 1=yes  write mzIdentML file
output_percolatorfile = 0              # 0=no, 1=yes  write Percolator pin 
file
print_expect_score = 1                 # 0=no, 1=yes to replace Sp with 
expect in out & sqt
num_output_lines = 5                   # num peptide results to show

sample_enzyme_number = 1               # Sample enzyme which is possibly 
different than the one applied to the search.
                                       # Used to calculate NTT & NMC in 
pepXML output (default=1 for trypsin).

#
# mzXML parameters
#
scan_range = 0 0                       # start and end scan range to 
search; either entry can be set independently
precursor_charge = 0 0                 # precursor charge range to analyze; 
does not override any existing charge; 0 as 1st entry ignores parameter
override_charge = 0                    # 0=no, 1=override precursor charge 
states, 2=ignore precursor charges outside precursor_charge range, 3=see 
online
ms_level = 2                           # MS level to analyze, valid are 
levels 2 (default) or 3
activation_method = ALL               # activation method; used if 
activation method set; allowed ALL, CID, ECD, ETD, ETD+SA, PQD, HCD, IRMPD, 
SID

#
# misc parameters
#
digest_mass_range = 500.0 6000.0       # MH+ peptide mass range to analyze
peptide_length_range = 7 30            # minimum and maximum peptide length 
to analyze (default 1 63; max length 63)
num_results = 50                      # number of search hits to store 
internally
max_duplicate_proteins = 20            # maximum number of additional 
duplicate protein names to report for each peptide ID; -1 reports all 
duplicates
max_fragment_charge = 3                # set maximum fragment charge state 
to analyze (allowed max 5)
max_precursor_charge = 6               # set maximum precursor charge state 
to analyze (allowed max 9)
nucleotide_reading_frame = 0           # 0=proteinDB, 1-6, 7=forward three, 
8=reverse three, 9=all six
clip_nterm_methionine = 1              # 0=leave sequences as-is; 1=also 
consider sequence w/o N-term methionine
spectrum_batch_size = 30000            # max. # of spectra to search at a 
time; 0 to search the entire scan range in one loop
decoy_prefix = DECOY_                  # decoy entries are denoted by this 
string which is pre-pended to each protein accession
equal_I_and_L = 1                      # 0=treat I and L as different; 
1=treat I and L as same
output_suffix =                        # add a suffix to output base names 
i.e. suffix "-C" generates base-C.pep.xml from base.mzXML input
mass_offsets =                         # one or more mass offsets to search 
(values substracted from deconvoluted precursor mass)
precursor_NL_ions =                    # one or more precursor neutral loss 
masses, will be added to xcorr analysis

#
# spectral processing
#
minimum_peaks = 10                     # required minimum number of peaks 
in spectrum to search (default 10)
minimum_intensity = 0                  # minimum intensity value to read in
remove_precursor_peak = 0              # 0=no, 1=yes, 2=all charge reduced 
precursor peaks (for ETD), 3=phosphate neutral loss peaks
remove_precursor_tolerance = 1.5       # +- Da tolerance for precursor 
removal
clear_mz_range = 0.0 0.0               # for iTRAQ/TMT type data; will 
clear out all peaks in the specified m/z range

#
# additional modifications
#

add_Cterm_peptide = 0.0
add_Nterm_peptide = 0.0
add_Cterm_protein = 0.0
add_Nterm_protein = 0.0

add_G_glycine = 0.0000                 # added to G - avg.  57.0513, mono. 
 57.02146
add_A_alanine = 0.0000                 # added to A - avg.  71.0779, mono. 
 71.03711
add_S_serine = 0.0000                  # added to S - avg.  87.0773, mono. 
 87.03203
add_P_proline = 0.0000                 # added to P - avg.  97.1152, mono. 
 97.05276
add_V_valine = 0.0000                  # added to V - avg.  99.1311, mono. 
 99.06841
add_T_threonine = 0.0000               # added to T - avg. 101.1038, mono. 
101.04768
add_C_cysteine = 57.021464             # added to C - avg. 103.1429, mono. 
103.00918
add_L_leucine = 0.0000                 # added to L - avg. 113.1576, mono. 
113.08406
add_I_isoleucine = 0.0000              # added to I - avg. 113.1576, mono. 
113.08406
add_N_asparagine = 0.0000              # added to N - avg. 114.1026, mono. 
114.04293
add_D_aspartic_acid = 0.0000           # added to D - avg. 115.0874, mono. 
115.02694
add_Q_glutamine = 0.0000               # added to Q - avg. 128.1292, mono. 
128.05858
add_K_lysine = 0.0000                  # added to K - avg. 128.1723, mono. 
128.09496
add_E_glutamic_acid = 0.0000           # added to E - avg. 129.1140, mono. 
129.04259
add_M_methionine = 0.0000              # added to M - avg. 131.1961, mono. 
131.04048
add_H_histidine = 0.0000               # added to H - avg. 137.1393, mono. 
137.05891
add_F_phenylalanine = 0.0000           # added to F - avg. 147.1739, mono. 
147.06841
add_U_selenocysteine = 0.0000          # added to U - avg. 150.0379, mono. 
150.95363
add_R_arginine = 0.0000                # added to R - avg. 156.1857, mono. 
156.10111
add_Y_tyrosine = 0.0000                # added to Y - avg. 163.0633, mono. 
163.06333
add_W_tryptophan = 0.0000              # added to W - avg. 186.0793, mono. 
186.07931
add_O_pyrrolysine = 0.0000             # added to O - avg. 237.2982, mono 
 237.14773
add_B_user_amino_acid = 0.0000         # added to B - avg.   0.0000, mono. 
  0.00000
add_J_user_amino_acid = 0.0000         # added to J - avg.   0.0000, mono. 
  0.00000
add_X_user_amino_acid = 0.0000         # added to X - avg.   0.0000, mono. 
  0.00000
add_Z_user_amino_acid = 0.0000         # added to Z - avg.   0.0000, mono. 
  0.00000

#
# COMET_ENZYME_INFO _must_ be at the end of this parameters file
#
[COMET_ENZYME_INFO]
0.  Cut_everywhere         0      -           -
1.  Trypsin                1      KR          P
2.  Trypsin/P              1      KR          -
3.  Lys_C                  1      K           P
4.  Lys_N                  0      K           -
5.  Arg_C                  1      R           P
6.  Asp_N                  0      D           -
7.  CNBr                   1      M           -
8.  Glu_C                  1      DE          P
9.  PepsinA                1      FL          P
10. Chymotrypsin           1      FWYL        P
11. No_cut                 1      @           @



суббота, 15 марта 2025 г. в 00:08:09 UTC+2, Jimmy Eng: 

> Valeriia,
>
> That PXD003594 experiment has 76 raw files associated with.  Were there a 
> subset of raw files that you analyzed here or does your analysis include 
> all 76 runs?    Just for a quick test, I downloaded 4 raw files from that 
> experiment and searched it with Comet against the UniProt human database.  
> Here's a very basic summary:
>
>    -    b1369p080_sample_01_a.raw:  high res MS/MS, almost no IDs (less 
>    than 100 positive PSMs)
>    - b1369p601_DMSO_G1_B2_S10.RAW:  ion trap MS/MS, ~6000 PSMs at 1% 
>    error rate
>    -  b1369p601_GDC_G2_B3_S23.RAW:  ion trap MS/MS, ~6000 PSMs at 1% 
>    error rate
>    -           b1369p65_PP4_R.RAW:  ion trap MS/MS, ~500 PSMs at 1% error 
>    rate
>
> So there are runs with thousands of good PSM IDs in them.  Note that in 
> the 4 raw files that I sampled, there was a mix of high-res and low-res 
> MS/MS spectra so hopefully you adjusted the fragment ion settings 
> appropriately for each raw file using my suggested parameter settings shown 
> below.  If the fragment ion settings aren't the issue, feel free to 
> follow-up including attaching the contents of your comet.params file.
>
> high-res:
>    fragment_bin_tol = 0.02 
>    fragment_bin_offset = 0.0
>    theoretical_fragment_ions = 0
>
> low-res: 
>    fragment_bin_tol = 1.0005
>    fragment_bin_offset = 0.4
>    theoretical_fragment_ions = 1
>
> Jimmy
>
> On Thu, Mar 13, 2025 at 5:02 PM Valeriia Vasylieva <[email protected]> 
> wrote:
>
>> Hi.
>> I run Comet and Peptideprophet on two public datasets with TDC with 
>> UniProt. I calculated the q-value in Python based on fval distribution and 
>> filtered data with a threshold 1%. Like that:
>> #df - PeptideProphet outputdf = df.sort_values(by='fval', 
>> ascending=False).reset_index(drop=True) 
>> # Calculate cumulative counts of targets and decoys df['cum_targets'] = 
>> (df['database'] == 'T').cumsum() df['cum_decoys'] = (df['database'] == 
>> 'D').cumsum() # Calculate FDR df['FDR'] = df['cum_decoys'] / 
>> df['cum_targets'] # cumulative minimum from bottom to top df['q-value'] = 
>> df['FDR'][::-1].cummin()[::-1] 
>>
>> Lower you see the proportion of PSMs annotated as targets and decoys 
>> which passed the value threshold or not.  One of the datasets (PXD03594) 
>> has a very low number of identifications. It also has a wide distribution 
>> of decoys (on the graph the raw files are plotted together). Could anyone 
>> suggest what could have happened here? I used default parameters, just 
>> changed peptide length from 7 to 30 aa, and peptide mass range 
>> 500.0-6000.0, and also enabled Methioning clipping. 
>> Thanks!
>>
>> [image: Capture9.PNG][image: Capture8.PNG]
>>
>>
>> [image: Capture6.PNG][image: Capture7.PNG]
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "spctools-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/spctools-discuss/e8201f2e-0fd1-4c36-ac34-dbeada186d13n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/spctools-discuss/e8201f2e-0fd1-4c36-ac34-dbeada186d13n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"spctools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/spctools-discuss/b91c825d-8f23-4ffc-afcc-947cda2f79e3n%40googlegroups.com.

Re: [spctools-discuss] Very low number of identifications Comet+PeptideProphet

Reply via email to