Hi Brandon,

On Tue, 14 Feb 2012 12:12:07 -0500
Brandon McCaig <bamcc...@gmail.com> wrote:

> Hello:
> 
> tl;dr: A program indended to split a large MS SQL script into
> smaller ones appears to be losing data. See below for the script.
> I guess that maybe I'm making a silly mistake with newlines or
> file encoding or something..
> 
> My colleague is working with a relatively large generated SQL
> script that Visual Studio doesn't seem to be able to handle. The
> script is approximately 370 MB[1] in UTF-16 Little Endian (the
> default encoding of whatever generated the script; likely SQL
> Server Management Studio). Converted to UTF-8 (using Vim) it's
> unsurprisingly about half of that[2].
> 
> There are various ways to solve this. My colleagues decided to
> manually split the file up into manageable pieces, but that
> seemed tedious to me. I opted to hack up a Perl program to split
> it up for me. Initially (perhaps with insufficient thought) I
> decided to split it up into "batches". I don't really know the
> details, but MS SQL has a batch keyword, by default 'GO', that it
> uses to split scripts up. I don't really know what it means or
> why it's used, but I know that it is necessary at times.. I
> figured this was the most reliable boundary to split on so the
> scripts would still have a hope of being executable without
> errors. Anyway, within 30 minutes or so I had written a Perl
> program to split on this boundary (there are probably easier
> ways, in hindsight, albeit from Windows those ways might not be
> as easy as on Unix-like platforms).
> 
> That resulted in some 12000+ files, which was obviously not very
> practical to be manually executed by a human. It might suffice to
> automatically execute the batches one by one, but I have
> practically no experience communicating with DBMSes with Perl (or
> with that much SQL at a time) and I wouldn't be able to convince
> my colleagues to give me a chance to try. It could also
> potentially be dangerous if I get something wrong so I don't know
> if I'd really want to rush such a solution anyway, but I digress.
> 
> In order to cut the number of files down in size I added a size
> parameter to the program and modified it to bundle "batches" into
> files up to that maximum size each. The program appears to work
> at first glance, producing approximately the right number of
> files of approximately the right sizes, and terminating on batch
> boundaries (by default, a line containing only 'GO'). I felt
> triumphant, but realized that I should try to validate that the
> split scripts are exactly equivalent to the single, extra large
> script. I decided to calculate SHA1 and MD5 hashes for the
> original SQL script and a concatenation of the split SQL files.
> AFAIK, those hashes should match if my Perl program and
> concatenation were done properly.  Unfortuantely, the SHA1 does
> not match.
> 
> I used MSYS cat to do the concatenation, and MSYS sha1sum to
> calculate the hashes:
> 
> C:\Users\bamccaig\src\projectX>cat data/large.sql | sha1sum
> [sha1sum of large.sql spit out]
> C:\Users\bamccaig\src\projectX>cat data/out*.sql | sha1sum
> [sha1sum of the concatenation of data/out_0000.sql - 
> data/out_0016.sql]
> 
> These hashes don't match and now I'm stuck trying to figure out
> why or abandoning this program and leaving my colleague to do it
> by hand. Even if we don't use this program to solve this problem
> I would like to find my mistake(s) and correct them.
> 

What I would do is write another program (possibly in Perl) to compare the
differences and report immediately the location within the two files, and the
context before it and after it. It may be a \r\n vs. \n problem or it may be
something more subtle. Make sure you open the files in a "binmode $fh, 1" mode
so it will read binary data. It may also be variations on Unicode encodings,
which would be harmless.

Regards,

        Shlomi Fish

[SNIPPED]

-- 
-----------------------------------------------------------------
Shlomi Fish       http://www.shlomifish.org/
My Public Domain Photos - http://www.flickr.com/photos/shlomif/

Tel Aviv, a functional definition: free parking space‐free space.
    — Shachar Shemesh ( http://blog.shemesh.biz/?p=435 )

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to