Re: xerces translates entity characters(>...) automatically but I don't want to

Alberto Massari Tue, 21 Oct 2008 04:13:32 -0700

Hi Nahid,

there is no standalone tutorial for XMLFormatter; but you can look athow DOMWriterImpl (in 2.8) and DOMLSSerializerImpl (in 3.0) use it to

understand its  features.
In short:


LocalFileFormatTarget fileTarget("c:\\tmp\\tmp.xml");
XMLFormatter formatter("utf-8", &fileTarget);

formatter << XMLFormatter::NoEscapes << "<root attr=\"" <<XMLFormatter::AttrEscapes << "value" << XMLFormatter::NoEscapes << "\">";

formatter.formatBuf("literal", 7, XMLFormatter::CharEscapes);
formatter  << XMLFormatter::NoEscapes << "</root>";

Alberto

Nahid wrote:

Thanks you Alberto.
Is it possible to give me any tutorial link for XMLFormatter?
-Nahid


Alberto Massari wrote:
Hi Nahid,
if you are directly writing the text to a final XML file, you can use anXMLFormatter object that takes care of the conversion in a moreefficient manner (instead of scanning the input string 5 times and thenreallocating it as you are doing now). Depending on the type offiltering you are doing, you could achieve better performances by usinga grep-like tool, if what you are filtering is not XML-aware.
Alberto

Nahid wrote:
Hi Alberto,
Thanks for your reply.Actually my program is a filter so I'm just taking a XML to output
another
XML.
So it would better If I can output as it was. That is &.amp; instead of
&.

Right now I'm using this function to convert it again before make the
output.

string escape(string str) {
  string a[] = {"&", "<", ">", "\"", "'"};
  string b[] = {"&amp;",  "&lt;",  "&gt;",  "&quot;",  "&apos;"};
  for (int i=0; i<5; i++) {
    size_t pos=0;
    while ( ( pos = str.find(a[i], pos ) ) != std::string::npos ) {
      str.replace( pos, a[i].length(), b[i] );
      pos += b[i].length();
    }
  }
  return str;
}

But problem is as I told before I have to parse a huge file ( like 100 GB
to
1 TB ) so doing the conversion twice is costly (This function increase
10%
of running time :( )

So it would better if I can tell SAX2XMLReader not to convert &.amp; to &
,
which saves double conversion time.

Thanks you again
-Nahid


Alberto Massari wrote:
Hi Nahid,
an XML document cannot contain a & character, as it has a specialmeaning (beginning of an entity reference); why do you need to see theraw text instead of its meaning?
Alberto

Nahid wrote:
Hi,
Before posting, I've searched for the solution but can't find any. May
be
it
has a trivial solution.
I'm using SAX2XMLReader for parsing a huge XML file which contains
entity
characters(&gt, &lt etc...)
"<title>abc &amp; cde</title>" which is converted to "<title>abc &
cde</title>"I don't want this auto conversion. I just want the actual text.
Do you have any idea, how can I do it?
Thanks
Regards
-Nahid

Re: xerces translates entity characters(>...) automatically but I don't want to

Reply via email to