Re: Need help with strings
Hi Ariel, Regina Henschel schrieb: Hi Ariel, thanks for your hints. It seems that the class OUString has the needed methods. But I need some time to test it. It is still about the file trunk\main\starmath\source\smdetect.cxx Problem in detail: The existing code has const sal_uInt16 nReadSize(4095); sal_Char aBuffer[nReadSize+1]; pStrm->Seek( STREAM_SEEK_TO_BEGIN ); const sal_uLong nBytesRead(pStrm->Read( aBuffer, nReadSize )); aBuffer[nBytesRead + 1] = 0; If the stream is actually UTF-8 encoded, then OUString sFragment(aBuffer,nBytesRead,RTL_TEXTENCODING_UTF8); gives a correct OUString and then my further ideas work. But if the stream is actually UTF-16, then converting fails. I can detect, that the first two elements of the variable aBuffer are a BOM. But I don't know how to get an OUString from aBuffer in that case. This OUString sFragment(aBuffer,nBytesRead,RTL_TEXTENCODING_UNICODE); does not work. Kind regards Regina Kind regards Regina Ariel Constenla-Haile schrieb: Hello Regina, On Wed, Apr 08, 2015 at 09:02:06PM +0200, Regina Henschel wrote: Hi all, I'm going to improve the MathML type detection. Currently there exist files, that can be opened or imported fine, when the type detection would allow it. https://bz.apache.org/ooo/show_bug.cgi?id=126230 I have attached a C++ file to show what I want to do. The problem is, that MathML does not need to be encoded in utf-8 but can have any other encoding. For example MS Windows "Math Input Control" exports formulas in utf-16. So my question is, which kind of string can I use, that is able to detect/use utf-16 and has the needed methods similar to C++ string methods find, rfind, insert, substring, clear, erase? Does AOO has such kind of string? You can use OpenOffice's rtl string and string buffer classes, together with the lower lever text conversion from https://www.openoffice.org/api/docs/cpp/ref/names/o-textcvt.h.html It is possible to get the encoding from the MathML file or set default utf-8, in case that information is needed for to instantiate a string object. If the file has no information about its encoding, you will have to perform some kind of encoding detection, see Writer's ASCII filter for example: bool SwIoSystem::IsDetectableText main/sw/source/filter/basflt/iodetect.cxx used in sal_uLong SwASCIIParser::ReadChars() main/sw/source/filter/ascii/parasc.cxx Searching rtl_convertTextToUnicode in OpenGrok might give other useful hints. Regards - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
Re: Need help with strings
Hi Dennis, Dennis E. Hamilton schrieb: Hi! You are digging into my favorite subject. We talked about your interest, but with my small spare time, my progress is slow. I am assuming you are talking about strings within the MathML and that it is in some form of XML. In that case: If it is XML, the encoding can be specified in the XML prologue. Sniffing for this prologue will determine such things as whether UTF8 or UTF16, and big-endian or little-endian. If single-byte, that will usually mean some kind of code page which has a subset of ASCII as a common subset of a larger encoding, such as Western European. In that case, one can read the content of the prefix to see what it says, because it should be in a simple, pure ASCII form. Even if it is a double-byte character encoding, such as Shift-JIS, the prologue only needs the single-byte portions that are the same as ASCII. MathML is XML. But because formulas are seldom used as stand-alone files, when users e.g. copy&paste a formula from a website, they get not a complete file but only a fragment. Such fragments can be used via Tools > Import Formula in module Math. That had worked in OO1.1.5, (where users need to choose the filter themselves) and it works in LO, but currently fails in AOO. The default, however, depends on the MIME type of the XML file. Text/xml and application/xml have different defaults. Also, MIME types can have parameters that specify character sets. If no BOM and no encoding is given, UTF-8 can be assumed. (I would need to search for the correct reference for MathML 2, but see http://www.w3.org/TR/2009/WD-MathML3-20090604/chapter6.html#world-int-transf-flavors, last sentence.) The way Windows manages this also includes using a Unicode prefix on UTF8 (big-endian, I think). These are not uniformly used across platforms. Not even unique for MS applications. The "Math Input Control" produces UTF-16 and "Word" produces UTF-8. The parser can handle both. I have tested it already. Internally, because ODF and AOO are Unicode based, it is necessary to translate all arriving text into Unicode for internal storage and use by the application. To do otherwise, lies madness. There are difficulties with this, because Unicode allows local specializations. This comes up in craziness around Symbol fonts that do not have common Unicode correspondence. (Bullets in AOO have this disease.) There is no problem in this aspect. I only need to examine the input stream, whether it can be used with the smath-filter. I have probably provided more information than you require. I love this subject. Me too. I have not looked at your code. No need to spent your time on it now. I have attached it only to show what kind of methods I need. Kind regards Regina - Dennis PS: The default representation of XML inside OOXML is UTF16 as I recall. I could be mistaken. -Original Message- From: Regina Henschel [mailto:rb.hensc...@t-online.de] Sent: Wednesday, April 8, 2015 12:02 To: AOO dev Subject: Need help with strings Hi all, I'm going to improve the MathML type detection. Currently there exist files, that can be opened or imported fine, when the type detection would allow it. https://bz.apache.org/ooo/show_bug.cgi?id=126230 I have attached a C++ file to show what I want to do. The problem is, that MathML does not need to be encoded in utf-8 but can have any other encoding. For example MS Windows "Math Input Control" exports formulas in utf-16. So my question is, which kind of string can I use, that is able to detect/use utf-16 and has the needed methods similar to C++ string methods find, rfind, insert, substring, clear, erase? Does AOO has such kind of string? It is possible to get the encoding from the MathML file or set default utf-8, in case that information is needed for to instantiate a string object. Kind regards Regina - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
Re: Need help with strings
Hi Ariel, thanks for your hints. It seems that the class OUString has the needed methods. But I need some time to test it. Kind regards Regina Ariel Constenla-Haile schrieb: Hello Regina, On Wed, Apr 08, 2015 at 09:02:06PM +0200, Regina Henschel wrote: Hi all, I'm going to improve the MathML type detection. Currently there exist files, that can be opened or imported fine, when the type detection would allow it. https://bz.apache.org/ooo/show_bug.cgi?id=126230 I have attached a C++ file to show what I want to do. The problem is, that MathML does not need to be encoded in utf-8 but can have any other encoding. For example MS Windows "Math Input Control" exports formulas in utf-16. So my question is, which kind of string can I use, that is able to detect/use utf-16 and has the needed methods similar to C++ string methods find, rfind, insert, substring, clear, erase? Does AOO has such kind of string? You can use OpenOffice's rtl string and string buffer classes, together with the lower lever text conversion from https://www.openoffice.org/api/docs/cpp/ref/names/o-textcvt.h.html It is possible to get the encoding from the MathML file or set default utf-8, in case that information is needed for to instantiate a string object. If the file has no information about its encoding, you will have to perform some kind of encoding detection, see Writer's ASCII filter for example: bool SwIoSystem::IsDetectableText main/sw/source/filter/basflt/iodetect.cxx used in sal_uLong SwASCIIParser::ReadChars() main/sw/source/filter/ascii/parasc.cxx Searching rtl_convertTextToUnicode in OpenGrok might give other useful hints. Regards - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
RE: Need help with strings
Hi! You are digging into my favorite subject. I am assuming you are talking about strings within the MathML and that it is in some form of XML. In that case: If it is XML, the encoding can be specified in the XML prologue. Sniffing for this prologue will determine such things as whether UTF8 or UTF16, and big-endian or little-endian. If single-byte, that will usually mean some kind of code page which has a subset of ASCII as a common subset of a larger encoding, such as Western European. In that case, one can read the content of the prefix to see what it says, because it should be in a simple, pure ASCII form. Even if it is a double-byte character encoding, such as Shift-JIS, the prologue only needs the single-byte portions that are the same as ASCII. The default, however, depends on the MIME type of the XML file. Text/xml and application/xml have different defaults. Also, MIME types can have parameters that specify character sets. The way Windows manages this also includes using a Unicode prefix on UTF8 (big-endian, I think). These are not uniformly used across platforms. Internally, because ODF and AOO are Unicode based, it is necessary to translate all arriving text into Unicode for internal storage and use by the application. To do otherwise, lies madness. There are difficulties with this, because Unicode allows local specializations. This comes up in craziness around Symbol fonts that do not have common Unicode correspondence. (Bullets in AOO have this disease.) I have probably provided more information than you require. I love this subject. I have not looked at your code. - Dennis PS: The default representation of XML inside OOXML is UTF16 as I recall. I could be mistaken. -Original Message- From: Regina Henschel [mailto:rb.hensc...@t-online.de] Sent: Wednesday, April 8, 2015 12:02 To: AOO dev Subject: Need help with strings Hi all, I'm going to improve the MathML type detection. Currently there exist files, that can be opened or imported fine, when the type detection would allow it. https://bz.apache.org/ooo/show_bug.cgi?id=126230 I have attached a C++ file to show what I want to do. The problem is, that MathML does not need to be encoded in utf-8 but can have any other encoding. For example MS Windows "Math Input Control" exports formulas in utf-16. So my question is, which kind of string can I use, that is able to detect/use utf-16 and has the needed methods similar to C++ string methods find, rfind, insert, substring, clear, erase? Does AOO has such kind of string? It is possible to get the encoding from the MathML file or set default utf-8, in case that information is needed for to instantiate a string object. Kind regards Regina - To unsubscribe, e-mail: dev-unsubscr...@openoffice.apache.org For additional commands, e-mail: dev-h...@openoffice.apache.org
Re: Need help with strings
Hello Regina, On Wed, Apr 08, 2015 at 09:02:06PM +0200, Regina Henschel wrote: > Hi all, > > I'm going to improve the MathML type detection. Currently there exist files, > that can be opened or imported fine, when the type detection would allow it. > https://bz.apache.org/ooo/show_bug.cgi?id=126230 > > I have attached a C++ file to show what I want to do. > The problem is, that MathML does not need to be encoded in utf-8 but can > have any other encoding. For example MS Windows "Math Input Control" exports > formulas in utf-16. > > So my question is, which kind of string can I use, that is able to > detect/use utf-16 and has the needed methods similar to C++ string methods > find, rfind, insert, substring, clear, erase? Does AOO has such kind of > string? You can use OpenOffice's rtl string and string buffer classes, together with the lower lever text conversion from https://www.openoffice.org/api/docs/cpp/ref/names/o-textcvt.h.html > It is possible to get the encoding from the MathML file or set default > utf-8, in case that information is needed for to instantiate a string > object. If the file has no information about its encoding, you will have to perform some kind of encoding detection, see Writer's ASCII filter for example: bool SwIoSystem::IsDetectableText main/sw/source/filter/basflt/iodetect.cxx used in sal_uLong SwASCIIParser::ReadChars() main/sw/source/filter/ascii/parasc.cxx Searching rtl_convertTextToUnicode in OpenGrok might give other useful hints. Regards -- Ariel Constenla-Haile La Plata, Argentina signature.asc Description: Digital signature