Re: Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)

2017-12-28 Thread Paul Dupuis via use-livecode
Mark,

Thank you so much


On 12/28/2017 12:45 PM, Mark Waddingham via use-livecode wrote:
> On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote:
>> I'm pretty sure it would be possible to write a handler which takes
>> the styledText array of a field in 6.7.11 and a list of old indicies,
>> returning a list of new char indicies... Would that help?
>
> Paul expressed an interest in how this might work - and he provided
> some more background:
>
> -*-
>
> Our main application, HyperRESEARCH, a tool for academics and others
> doing qualitative research, relies completely on chunk ranges. It is
> essentially a bookmarking tool where users can select some content from
> a document, the character position (chunk) is grabbed and the user gives
> it a text label and HyperRESEARCH remembers that label "Early Childhood
> Behavior X" points to char S to T of document "ABC". All documents,
> native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into
> a LiveCode field, from which the selection is made and the chunk
> obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of
> these labels and chunks and documents names.
>
> As part of our migration from LC464, which is what the current release
> of HyperRESEARCH is based on, we needed a way to convert a character
> range created under LC4.6.4 to a range under LC6.7.11 that point to the
> exact same string for the same file. Curry Kenworthy, whose libraries we
> license for reading MS-Word and Open Office documents built a library
> based on an algorithm I came up with to send the original LC464 ranges
> to a helper application using sockets or IPC. The helper application
> retrieves the strings associated with each chunk, strips white space and
> passes the string back to the LC6.7.11 version of the main app, which
> then finds the whitespace stripped strings in the same file loaded under
> LC6.7.11 with an indexing mechanism to adjust the positions for the
> stripped whitespace. It is a bit complicated, but it works reliably.
>
> -*-
>
> From this I infer the following:
>
> 1) The study file is a list of triples - label, char chunk, document
> filename
>
> 2) When using the study file, the original document is loaded into a
> field, and the char chunks are used to display labels which the user
> can jump to.
>
> 3) The char chunks are old-style (pre-5.5) byte indicies not codeunit
> indicies
>
> The crux of the problem Paul is having comes down to (3) which has
> some background to explain.
>
> Before 7.0, the field was the only part of the engine which naturally
> handled Unicode. In these older versions the field stored text as
> mixed sequence of style runs of either single bytes (native text) or
> double bytes (unicode text).
>
> Between 5.5 and 7.0, indicies used when referencing chars in fields
> corresponded to codeunits - this meant that the indicies were
> independent of the encoding of the runs in the field. In this case
> char N referred to the Nth codeunit in the field, whether up until
> that point was all unicode, all native or a mixture of both.
>
> Before 5.5, indicies used when referencing chars in fields
> corresponded to bytes - this meant that you had to take into account
> the encoding of the runs in the field. In this case, char N referred
> to the Nth byte in the field. So if your field had:
>
>  abcXYZabc (where XYZ are two byte unicode chars)
>
> Then char 4 would refer to the first byte of the X unicode char and
> *not* the two bytes it would have actually taken up.
>
> Now, importantly, the internal structure of the field did not change
> between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11,
> the internal structure of the field is still the mixed runs of
> unicode/native bytes just as it was in 4.6.4 - the only difference is
> what happens if you reference char X to Y of the field.
>
> So solving this problem comes down to finding a means to 'get at' the
> internal encoding style runs of a field in 6.7.11. We want a handler:
>
>   mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo)
>
> Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a
> char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y
> range *for the same range* in 6.7.11.
>
> -*-
>
> Before going into the details, an easy way to see the internal mixed
> encoding of a field containing unicode in 6.7.11, is to put some text
> which is a mixture of native text and unicode text in a field and then
> look at its 'text' property. Putting:
>
> Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион
> игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор
> пхилосопхиа. Феугаитconsulatu disputando comprehensam  вивендум вис
> ет, мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit
> interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус
> яуаеяуе, ет елитр цорпора пер.
>
> Into a 6.7.11 field and then doing 'put the text of field 1' gives:
>
> ? ???

Re: Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)

2017-12-28 Thread J. Landman Gay via use-livecode

On 12/28/17 2:42 PM, Mark Wieder via use-livecode wrote:

On 12/28/2017 09:45 AM, Mark Waddingham via use-livecode wrote:

3) The char chunks are old-style (pre-5.5) byte indicies not codeunit 
indicies


The crux of the problem Paul is having comes down to (3) which has 
some background to explain.


OMG! This is what Mr. Waddingham comes up with while on break!?



He can't help it, it's the only way his brain works... For which we are 
all grateful.


--
Jacqueline Landman Gay | jac...@hyperactivesw.com
HyperActive Software   | http://www.hyperactivesw.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)

2017-12-28 Thread Mark Wieder via use-livecode

On 12/28/2017 09:45 AM, Mark Waddingham via use-livecode wrote:

3) The char chunks are old-style (pre-5.5) byte indicies not codeunit 
indicies


The crux of the problem Paul is having comes down to (3) which has some 
background to explain.


OMG! This is what Mr. Waddingham comes up with while on break!?

--
 Mark Wieder
 ahsoftw...@gmail.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Mapping Field Text Ranges (was Re: Interprocess Communication (IPC) under OSX)

2017-12-28 Thread Mark Waddingham via use-livecode

On 2017-12-19 19:43, Mark Waddingham via use-livecode wrote:

I'm pretty sure it would be possible to write a handler which takes
the styledText array of a field in 6.7.11 and a list of old indicies,
returning a list of new char indicies... Would that help?


Paul expressed an interest in how this might work - and he provided some 
more background:


-*-

Our main application, HyperRESEARCH, a tool for academics and others
doing qualitative research, relies completely on chunk ranges. It is
essentially a bookmarking tool where users can select some content from
a document, the character position (chunk) is grabbed and the user gives
it a text label and HyperRESEARCH remembers that label "Early Childhood
Behavior X" points to char S to T of document "ABC". All documents,
native text, unicode (utf8 or utf16), rtf, docx, odt, etc. are read into
a LiveCode field, from which the selection is made and the chunk
obtained. HyoperRESEARCH saves a "Study" file that contains a LOT of
these labels and chunks and documents names.

As part of our migration from LC464, which is what the current release
of HyperRESEARCH is based on, we needed a way to convert a character
range created under LC4.6.4 to a range under LC6.7.11 that point to the
exact same string for the same file. Curry Kenworthy, whose libraries we
license for reading MS-Word and Open Office documents built a library
based on an algorithm I came up with to send the original LC464 ranges
to a helper application using sockets or IPC. The helper application
retrieves the strings associated with each chunk, strips white space and
passes the string back to the LC6.7.11 version of the main app, which
then finds the whitespace stripped strings in the same file loaded under
LC6.7.11 with an indexing mechanism to adjust the positions for the
stripped whitespace. It is a bit complicated, but it works reliably.

-*-

From this I infer the following:

1) The study file is a list of triples - label, char chunk, document 
filename


2) When using the study file, the original document is loaded into a 
field, and the char chunks are used to display labels which the user can 
jump to.


3) The char chunks are old-style (pre-5.5) byte indicies not codeunit 
indicies


The crux of the problem Paul is having comes down to (3) which has some 
background to explain.


Before 7.0, the field was the only part of the engine which naturally 
handled Unicode. In these older versions the field stored text as mixed 
sequence of style runs of either single bytes (native text) or double 
bytes (unicode text).


Between 5.5 and 7.0, indicies used when referencing chars in fields 
corresponded to codeunits - this meant that the indicies were 
independent of the encoding of the runs in the field. In this case char 
N referred to the Nth codeunit in the field, whether up until that point 
was all unicode, all native or a mixture of both.


Before 5.5, indicies used when referencing chars in fields corresponded 
to bytes - this meant that you had to take into account the encoding of 
the runs in the field. In this case, char N referred to the Nth byte in 
the field. So if your field had:


 abcXYZabc (where XYZ are two byte unicode chars)

Then char 4 would refer to the first byte of the X unicode char and 
*not* the two bytes it would have actually taken up.


Now, importantly, the internal structure of the field did not change 
between 4.6.4 and 5.5, just how the 'char' chunk worked - in 6.7.11, the 
internal structure of the field is still the mixed runs of 
unicode/native bytes just as it was in 4.6.4 - the only difference is 
what happens if you reference char X to Y of the field.


So solving this problem comes down to finding a means to 'get at' the 
internal encoding style runs of a field in 6.7.11. We want a handler:


  mapByteRangeToCharRange(pFieldId, pByteFrom, pByteTo)

Returning a pair pCharFrom, pCharTo - where pByteFrom, pByteTo are a 
char X to Y range from 4.6.4 and pCharFrom, pCharTo are a char X to Y 
range *for the same range* in 6.7.11.


-*-

Before going into the details, an easy way to see the internal mixed 
encoding of a field containing unicode in 6.7.11, is to put some text 
which is a mixture of native text and unicode text in a field and then 
look at its 'text' property. Putting:


Лорем ипсум Lorem ipsum dolor sit amet, pr долор сит амет, вел татион 
игнота сцрибентур еи. Вих еа феугиат doctus necessitatibus ассентиор 
пхилосопхиа. Феугаитconsulatu disputando comprehensam  вивендум вис ет, 
мел еррем малорум ат. Хас но видерер лобортис, suscipit detraxit 
interesset eum аппетере инсоленс салутатус усу не. Еи дуо лудус яуаеяуе, 
ет елитр цорпора пер.


Into a 6.7.11 field and then doing 'put the text of field 1' gives:

? ? Lorem ipsum dolor sit amet, pr ? ??? , ??? ?? 
?? ?? ??. ??? ?? ??? doctus necessitatibus ? 
???. ???consulatu disputando comprehensam   ??? ??, 
??? ? ??? ??. ??? ??