Re: Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)
Mark, Thank you so much On 12/28/2017 12:45 PM, Mark Waddingham via use-livecode wrote: > On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote: >> I'm pretty sure it would be possible to write a handler which takes >> the styledText array of a field in 6.7.11 and a list of old indicies, >> returning a list of new char indicies... Would that help? > > Paul expressed an interest in how this might work - and he provided > some more background: > > -*- > > Our main application, HyperRESEARCH, a tool for academics and others > doing qualitative research, relies completely on chunk ranges. It is > essentially a bookmarking tool where users can select some content from > a document, the character position (chunk) is grabbed and the user gives > it a text label and HyperRESEARCH remembers that label "Early Childhood > Behavior X" points to char S to T of document "ABC". All documents, > native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into > a LiveCode field, from which the selection is made and the chunk > obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of > these labels and chunks and documents names. > > As part of our migration from LC464, which is what the current release > of HyperRESEARCH is based on, we needed a way to convert a character > range created under LC4.6.4 to a range under LC6.7.11 that point to the > exact same string for the same file. Curry Kenworthy, whose libraries we > license for reading MS-Word and Open Office documents built a library > based on an algorithm I came up with to send the original LC464 ranges > to a helper application using sockets or IPC. The helper application > retrieves the strings associated with each chunk, strips white space and > passes the string back to the LC6.7.11 version of the main app, which > then finds the whitespace stripped strings in the same file loaded under > LC6.7.11 with an indexing mechanism to adjust the positions for the > stripped whitespace. It is a bit complicated, but it works reliably. > > -*- > > From this I infer the following: > > 1) The study file is a list of triples - label, char chunk, document > filename > > 2) When using the study file, the original document is loaded into a > field, and the char chunks are used to display labels which the user > can jump to. > > 3) The char chunks are old-style (pre-5.5) byte indicies not codeunit > indicies > > The crux of the problem Paul is having comes down to (3) which has > some background to explain. > > Before 7.0, the field was the only part of the engine which naturally > handled Unicode. In these older versions the field stored text as > mixed sequence of style runs of either single bytes (native text) or > double bytes (unicode text). > > Between 5.5 and 7.0, indicies used when referencing chars in fields > corresponded to codeunits - this meant that the indicies were > independent of the encoding of the runs in the field. In this case > char N referred to the Nth codeunit in the field, whether up until > that point was all unicode, all native or a mixture of both. > > Before 5.5, indicies used when referencing chars in fields > corresponded to bytes - this meant that you had to take into account > the encoding of the runs in the field. In this case, char N referred > to the Nth byte in the field. So if your field had: > > abcXYZabc (where XYZ are two byte unicode chars) > > Then char 4 would refer to the first byte of the X unicode char and > *not* the two bytes it would have actually taken up. > > Now, importantly, the internal structure of the field did not change > between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11, > the internal structure of the field is still the mixed runs of > unicode/native bytes just as it was in 4.6.4 - the only difference is > what happens if you reference char X to Y of the field. > > So solving this problem comes down to finding a means to 'get at' the > internal encoding style runs of a field in 6.7.11. We want a handler: > > mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo) > > Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a > char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y > range *for the same range* in 6.7.11. > > -*- > > Before going into the details, an easy way to see the internal mixed > encoding of a field containing unicode in 6.7.11, is to put some text > which is a mixture of native text and unicode text in a field and then > look at its 'text' property. Putting: > > Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион > игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор > пхилосопхиа. Феугаитconsulatu disputando comprehensam вивендум вис > ет, мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit > interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус > яуаеяуе, ет елитр цорпора пер. > > Into a 6.7.11 field and then doing 'put the text of field 1' gives: > > ? ???
Re: Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)
On 12/28/17 2:42 PM, Mark Wieder via use-livecode wrote: On 12/28/2017 09:45 AM, Mark Waddingham via use-livecode wrote: 3) The char chunks are old-style (pre-5.5) byte indicies not codeunit indicies The crux of the problem Paul is having comes down to (3) which has some background to explain. OMG! This is what Mr. Waddingham comes up with while on break!? He can't help it, it's the only way his brain works... For which we are all grateful. -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)
On 12/28/2017 09:45 AM, Mark Waddingham via use-livecode wrote: 3) The char chunks are old-style (pre-5.5) byte indicies not codeunit indicies The crux of the problem Paul is having comes down to (3) which has some background to explain. OMG! This is what Mr. Waddingham comes up with while on break!? -- Mark Wieder ahsoftw...@gmail.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)
On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote: I'm pretty sure it would be possible to write a handler which takes the styledText array of a field in 6.7.11 and a list of old indicies, returning a list of new char indicies... Would that help? Paul expressed an interest in how this might work - and he provided some more background: -*- Our main application, HyperRESEARCH, a tool for academics and others doing qualitative research, relies completely on chunk ranges. It is essentially a bookmarking tool where users can select some content from a document, the character position (chunk) is grabbed and the user gives it a text label and HyperRESEARCH remembers that label "Early Childhood Behavior X" points to char S to T of document "ABC". All documents, native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into a LiveCode field, from which the selection is made and the chunk obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of these labels and chunks and documents names. As part of our migration from LC464, which is what the current release of HyperRESEARCH is based on, we needed a way to convert a character range created under LC4.6.4 to a range under LC6.7.11 that point to the exact same string for the same file. Curry Kenworthy, whose libraries we license for reading MS-Word and Open Office documents built a library based on an algorithm I came up with to send the original LC464 ranges to a helper application using sockets or IPC. The helper application retrieves the strings associated with each chunk, strips white space and passes the string back to the LC6.7.11 version of the main app, which then finds the whitespace stripped strings in the same file loaded under LC6.7.11 with an indexing mechanism to adjust the positions for the stripped whitespace. It is a bit complicated, but it works reliably. -*- From this I infer the following: 1) The study file is a list of triples - label, char chunk, document filename 2) When using the study file, the original document is loaded into a field, and the char chunks are used to display labels which the user can jump to. 3) The char chunks are old-style (pre-5.5) byte indicies not codeunit indicies The crux of the problem Paul is having comes down to (3) which has some background to explain. Before 7.0, the field was the only part of the engine which naturally handled Unicode. In these older versions the field stored text as mixed sequence of style runs of either single bytes (native text) or double bytes (unicode text). Between 5.5 and 7.0, indicies used when referencing chars in fields corresponded to codeunits - this meant that the indicies were independent of the encoding of the runs in the field. In this case char N referred to the Nth codeunit in the field, whether up until that point was all unicode, all native or a mixture of both. Before 5.5, indicies used when referencing chars in fields corresponded to bytes - this meant that you had to take into account the encoding of the runs in the field. In this case, char N referred to the Nth byte in the field. So if your field had: abcXYZabc (where XYZ are two byte unicode chars) Then char 4 would refer to the first byte of the X unicode char and *not* the two bytes it would have actually taken up. Now, importantly, the internal structure of the field did not change between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11, the internal structure of the field is still the mixed runs of unicode/native bytes just as it was in 4.6.4 - the only difference is what happens if you reference char X to Y of the field. So solving this problem comes down to finding a means to 'get at' the internal encoding style runs of a field in 6.7.11. We want a handler: mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo) Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y range *for the same range* in 6.7.11. -*- Before going into the details, an easy way to see the internal mixed encoding of a field containing unicode in 6.7.11, is to put some text which is a mixture of native text and unicode text in a field and then look at its 'text' property. Putting: Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор пхилосопхиа. Феугаитconsulatu disputando comprehensam вивендум вис ет, мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус яуаеяуе, ет елитр цорпора пер. Into a 6.7.11 field and then doing 'put the text of field 1' gives: ? ? Lorem ipsum dolor sit amet, pr ? ??? , ??? ?? ?? ?? ??. ??? ?? ??? doctus necessitatibus ? ???. ???consulatu disputando comprehensam ??? ??, ??? ? ??? ??. ??? ??