Re: Massive Search and Replace in Binary Files

Scott Raney Thu, 19 Apr 2001 20:43:06 -0700
Sivakatirswami <[EMAIL PROTECTED]> wrote:
>
> This is a "yes it's doable" or "No forget it" query
> 
> I am facing a challenge with some very large QuarkXpress documents that need
> to be repurposed for either "simple" text distribution,  Web distribution or
> MC eBook distribution, where we want to remove all non-cross platform
> "Diacriticalized" characters
> 
> e.g. 
>  "jñâna" should be come "jnana" and
> Íiva becomes Siva
> 
> etc. 
> 
> the current "Batch" finds and replace extension we are using for Quark
> Express "chokes" on large files (10-23 megabyte files)  where we are trying
> to get it to find some 45 plus different characters or strings and replace
> them with 45 cross platform characters/strings (I should learn the proper
> name for the different Key sets)

The character chooser stack should help you at least learn
what works cross platform and what doesn't.  If you're really
interested in esoterica like the ISO character names, I've got
the original tables the character chooser was built with which
has this information in it.  I could email them to you if you
want.

> and the projected number of instances will
> reach to 5,000-6,000 changes... conceively, if we can do it, I can work with
> smaller files.

My first reaction is: Big files!  Lots of changes!

(snip)

> a) Can you do this on binary "native" applications files like a Quark
> Express document, without corrupting the original document?
> envision:
> 
> put URL ("binary:whatever path") tBlob
> 
> repeat here a replace "old char-string" with "new char-string)

I would guess no.  It depends on the exact format, though.  If
it were something like HTML or RTF, this should work OK, but not
if the data has markup in binary format because you'll end up
changing some of the markup when you only want to change the
data.

> b) If the answer is yes, then, is there any performance gain in loading the
> strings into a custom property array rather than simply loading them in a
> field with a tab or comma separator? Us low level scripter's struggle with
> some of the higher level options (we just don't know what they are...)

Loading them into a variable and doing the work there would
be the most efficient way.  If the files are close in size
to available RAM, you'll want to do the work in a loop with
read/replace/write, working on a MB or two at a time.

> c) What is the best way to "handle the data" read in a portion, run the
> replace on it and write out to a new variable of file? Or can MC "Crunch"
> the entire file in Ram, all 6000 changes and simply write out the file back
> to disk?

It can, if you've got the RAM ;-)

> d) Question is: if you can simply enter the strings to a field, or if a
> charToNum conversion is needed first?

You don't want to do that: putting stuff in fields is much
slower and takes a lot more RAM than just dealing with it
in variables.  And if possible, avoid using charToNum or
anything that works a character at a time.  The replace
command is *much* faster.
   Scott


> 
> Hinduism Today
> 
> Sivakatirswami
> Editor's Assistant/Production Manager
> [EMAIL PROTECTED] 
> www.HinduismToday.com, www.HimalayanAcademy.com,
> www.Gurudeva.org, www.hindu.org


Archives: http://www.mail-archive.com/metacard@lists.runrev.com/
Info: http://www.xworlds.com/metacard/mailinglist.htm
Please send bug reports to <[EMAIL PROTECTED]>, not this list.
Re: Massive Search and Replace in Binary Files

Reply via email to