No worries, a corrected Mascot2XML will be in the next TPP release.
Brian
On Wed, Nov 11, 2009 at 3:21 PM, Simon Michnowicz <
simon.michnow...@gmail.com> wrote:
>
>
> Unfortunately we have no control over what goes in the FASTA
> databases! Matrix Science's pepXML generation code escapes the XML
Unfortunately we have no control over what goes in the FASTA
databases! Matrix Science's pepXML generation code escapes the XML
if ($thisScript->param($urlParams{'prot_desc'})) {
$prot_desc = &noXmlTag(&mustGetProteinDescription
($protein_list[0], \%fastaTitles));
}
Where no
I'll add the substitutions to the getdb.* scripts in the TPP src/util directory.
> Should a substitution be added to the IPI retrieval utility scripts in
> the TPP distribution so that the problem doesn't show it's face if
> they are being used?
--~--~-~--~~~---~--~--
Well, I'll go ahead and modify the mascot converter to emit proper XML for
proteins with reserved XML characters, but it does sound like folks would do
well to make that <> / [] substitution upstream from the search engines.
The fact that the EBI IPI site does the substitution confirms my suspicion
Unfortunately the offending entries are present in commonly used
public DBs. We recently bumped into exactly this problem, as there are
4 entries containing in the IPI human v3.66 fasta file:
IPI00465120 Gene_Symbol=- 3-HSD 1 protein
IPI00816409 Gene_Symbol=- V1 protein (Fragment)
IPI00816761 Ge
Yes, one would want to escape everything properly - happily there's a
library call for that. And certainly it's only right to emit valid XML.
But I do think that it might be wisest to sidestep the whole mess - it's
valid FASTA but also unconventional (based on many years of TPP not bumping
into t
What about the other reserved characters in XML that are valid in FASTA?
"
'
&
Not escaping could also break downstream software - especially with &
which should always begin an escape sequence. :(
-Matt
Brian Pratt wrote:
> Granted, this is a defect - but that's still an unfortunate choice o
Granted, this is a defect - but that's still an unfortunate choice of
characters. Even with the correction I can imagine this tripping up other
software downstream since the properly escaped XML would no longer match the
FASTA on a literal basis. I don't suppose your users could be induced to
use