Re: binary vs. text?

2006-12-12 Thread Chris Sheffield
Thanks to everyone for your suggestions. I'll probably go with this  
one from Mark, even though Sarah's was very good. Since my files will  
have to be in a certain format anyway, it's easy for me just to  
verify the data.


Chris


On Dec 11, 2006, at 4:03 PM, Mark Schonewille wrote:

When you import a file, you always want to do something with its  
contents. Just check to see if the text contents fits the  
destination. If not, it might be a binary file and you may need to  
handle it differently.


--
Chris Sheffield
Read Naturally
The Fluency Company
http://www.readnaturally.com
--


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Dar Scott


On Dec 11, 2006, at 3:09 PM, Chris Sheffield wrote:

Does anyone have a sure fire way to determine if a file is binary  
or text?


I have need to create an import utility that will import data from  
a text file (csv, tab-delimited, etc) into a database, but I'd like  
to check the file before doing anything else just to make sure it  
is in fact text and not binary.


In general, there is no way.

However, all is not lost.

A text file is a special case of a binary file consisting of a  
sequence of characters whose representations are binary.


For very short files, it is hard to tell.  However, if you have some  
idea of the pattern you are expecting you can increase your  
confidence that some file is binary or text.


Many file formats have magic words and header data that indicate the  
type.  These provide a hint and an additional check can provide some  
confidence.  For example, a magic word plus a required element can  
identify a .png file, that is, check to see whether it starts with  
this: format("\211PNG\r\n\032\n\000\000\000\015IHDR").


Unicode files often have BOM markers at the start, but they are not  
required in some cases and the BOM shouldn't be there in others.  I  
have a function I use to differentiate among Unicode files, but that  
already assumes I know they are unicode and even then it has trouble  
with some perverse files.  (It does get it right more often than  
Microsoft programs do.)  UTF-8 files also have other limitations  
among the characters, so that can help.


Text files should have certain patterns.  For example, if the file is  
ASCII and is comma-delimited or tab-delimited, there are some  
indicators.  You should see only certain control characters.  You  
should see the expected delimiter.  You should see either CR or LF or  
both.  All characters have codes less than 128.  You might want to  
require the same number of delimiters per line.


So, given some specified pattern of what you expect in binary or  
text, you should be able to differentiate.


However, an alternate approach would be to parse the file and if the  
file does not pass, then reject it no matter the form of the data.


Dar

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Mark Schonewille

Chris,

When you import a file, you always want to do something with its  
contents. Just check to see if the text contents fits the  
destination. If not, it might be a binary file and you may need to  
handle it differently.


There is another way. You could do a guess about the percentage of  
spaces, returns, and alphanumerical characters in a normal text file  
(nearly 100%) and in a binary file (significantly less). In those  
cases that the actual percentage is lower than some treshold value,  
assume it is a binary file. If the actual percentage is higher,  
assume it is a text file. If the actual percentage is approximately  
equal to the treshold value, ask the user.


You can store a copy of (a part of) the data in another variable, use  
replaceText to remove all non-alfanumerical characters and calculate  
the percentage. If you have a really large file, you don't need to  
analyse the entire file.


Best,

Mark

--

Economy-x-Talk
Consultancy and Software Engineering
http://economy-x-talk.com
http://www.salery.biz

Get your store on-line within minutes with Salery Web Store software.  
Download at http://www.salery.biz


Op 11-dec-2006, om 23:09 heeft Chris Sheffield het volgende geschreven:

Does anyone have a sure fire way to determine if a file is binary  
or text?


I have need to create an import utility that will import data from  
a text file (csv, tab-delimited, etc) into a database, but I'd like  
to check the file before doing anything else just to make sure it  
is in fact text and not binary.


Any thoughts?

Thanks,
Chris


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Dave Cragg


On 11 Dec 2006, at 22:21, Chris Sheffield wrote:

Thanks, Sarah. Very cool idea. Seems to work for me. Can anyone  
thing of any cases that this might fail?


I think this would only work if you can be sure that the line endings  
in the text files are unix style (numToChar(10)). If, for example,  
the file had crlf as line endings, they would be converted when  
opening as a file, but not as a binfile. The comparison would fail  
even though it was a text file. (Not tested, so please give it a try.)


Cheers
Dave



___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Ruslan Zasukhin
On 12/12/06 12:55 AM, "Mark Schonewille" <[EMAIL PROTECTED]>
wrote:

> Ruslan,
> 
> You can't do that, because there are about a dozen different unicode
> signatures and some streams have no unicode signature at all.

Hi Mark,

We do this for Valentina studio. :-)

Of course can be file without signature. It is optional ...


-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Mark Schonewille

Ruslan,

You can't do that, because there are about a dozen different unicode  
signatures and some streams have no unicode signature at all.


Best,

Mark

--

Economy-x-Talk
Consultancy and Software Engineering
http://economy-x-talk.com
http://www.salery.biz

Get your store on-line within minutes with Salery Web Store software.  
Download at http://www.salery.biz


Op 11-dec-2006, om 23:15 heeft Ruslan Zasukhin het volgende geschreven:




If file follow unicode rules, it have special signature on start

I did not hear about such ability define kind of file..

--
Best regards,

Ruslan Zasukhin


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Mark Schonewille

Hi Sarah,

You can't do that. Crlf is translated by Rev into linefeed on DOS cr  
is also translated into linefeed oin Mac OS 9, which means that a  
text file is never equal to its binary equivalent.


Best,

Mark

--

Economy-x-Talk
Consultancy and Software Engineering
http://economy-x-talk.com
http://www.salery.biz

Get your store on-line within minutes with Salery Web Store software.  
Download at http://www.salery.biz


Op 11-dec-2006, om 23:14 heeft Sarah Reichelt het volgende geschreven:


Just a guess here but how about reading the file twice: once as file:
and once as binfile:
If the two are identical, then I assume it's text only.

I have no idea if this will work, but it's worth a try :-)
Sarah

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Geir A. Myrestrand

Sarah Reichelt wrote:

On 12/12/06, Chris Sheffield <[EMAIL PROTECTED]> wrote:

Does anyone have a sure fire way to determine if a file is binary or
text?

I have need to create an import utility that will import data from a
text file (csv, tab-delimited, etc) into a database, but I'd like to
check the file before doing anything else just to make sure it is in
fact text and not binary.



Just a guess here but how about reading the file twice: once as file:
and once as binfile:
If the two are identical, then I assume it's text only.

I have no idea if this will work, but it's worth a try :-)
Sarah


The solution won't scale, and it also depends whether it would handle 
multi-byte characters (use for Kanji for example) and certain UNICODE 
formats that use the NULL character.


If this is on a particular platform, then there are various ways to do 
this. On Linux/UNIX systems you can run `file ` in order to 
classify a file for example.


--

Geir A. Myrestrand
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Chris Sheffield
Thanks, Eric. I shouldn't have a problem with Sarah's method since  
these files will not be downloaded from the internet at all.


On Dec 11, 2006, at 3:23 PM, Eric Chatonet wrote:


Hi Chris,

Sarah's answer sounds good (I mean fully reliable) but may appear  
unusable with heavy files you download from the internet.
As for us we use another method based on statistics by checking  
charToNum's chars.
On Windows, it appears that checking 60 chars and finding more than  
3 chars the charToNum of which is less than 9 or greater than 175  
gives an "almost" fully reliable result ;-)


Le 11 déc. 06 à 23:09, Chris Sheffield a écrit :

Does anyone have a sure fire way to determine if a file is binary  
or text?


I have need to create an import utility that will import data from  
a text file (csv, tab-delimited, etc) into a database, but I'd  
like to check the file before doing anything else just to make  
sure it is in fact text and not binary.


Any thoughts?



Best Regards from Paris,
Eric Chatonet
-- 


http://www.sosmartsoftware.com/[EMAIL PROTECTED]/


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your  
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-revolution


--
Chris Sheffield
Read Naturally
The Fluency Company
http://www.readnaturally.com
--


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Geir A. Myrestrand

Chris Sheffield wrote:

Does anyone have a sure fire way to determine if a file is binary or text?


There is no such thing as a sure fire way to determine if it is text or 
not, unless the definition of text can be clearly defined.


I have need to create an import utility that will import data from a 
text file (csv, tab-delimited, etc) into a database, but I'd like to 
check the file before doing anything else just to make sure it is in 
fact text and not binary.


Any thoughts?


I would just assume it is text, and then handle "wrong" input 
gracefully. Trying to parse the contents first to verify it would just 
add overhead, and is unnecessary if you do the former.


Consider my input as general, and not Revolution specific --my exposure 
to Revolution is modest, at least at this point...


--

Geir A. Myrestrand
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Eric Chatonet

Hi Chris,

Sarah's answer sounds good (I mean fully reliable) but may appear  
unusable with heavy files you download from the internet.
As for us we use another method based on statistics by checking  
charToNum's chars.
On Windows, it appears that checking 60 chars and finding more than 3  
chars the charToNum of which is less than 9 or greater than 175 gives  
an "almost" fully reliable result ;-)


Le 11 déc. 06 à 23:09, Chris Sheffield a écrit :

Does anyone have a sure fire way to determine if a file is binary  
or text?


I have need to create an import utility that will import data from  
a text file (csv, tab-delimited, etc) into a database, but I'd like  
to check the file before doing anything else just to make sure it  
is in fact text and not binary.


Any thoughts?



Best Regards from Paris,
Eric Chatonet
 
--

http://www.sosmartsoftware.com/[EMAIL PROTECTED]/


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Ruslan Zasukhin
On 12/12/06 12:15 AM, "Ruslan Zasukhin" <[EMAIL PROTECTED]> wrote:

> On 12/12/06 12:09 AM, "Chris Sheffield" <[EMAIL PROTECTED]> wrote:
> 
> Hi Chris,
> 
>> Does anyone have a sure fire way to determine if a file is binary or
>> text?
>> 
>> I have need to create an import utility that will import data from a
>> text file (csv, tab-delimited, etc) into a database, but I'd like to
>> check the file before doing anything else just to make sure it is in
>> fact text and not binary.
>> 
>> Any thoughts?
> 
> If file follow unicode rules, it have special signature on start
> 
> I did not hear about such ability define kind of file..

Well, I think you can try next:

1) check unicode signature.

if you have find it -- this is unicode textfile

2) ELSE 

you can scan the whole file or its part,
byte for byte to see if you meat a ZERO byte.

if YES - its a binary file.


-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Chris Sheffield
Thanks, Sarah. Very cool idea. Seems to work for me. Can anyone thing  
of any cases that this might fail?


On Dec 11, 2006, at 3:14 PM, Sarah Reichelt wrote:


On 12/12/06, Chris Sheffield <[EMAIL PROTECTED]> wrote:

Does anyone have a sure fire way to determine if a file is binary or
text?

I have need to create an import utility that will import data from a
text file (csv, tab-delimited, etc) into a database, but I'd like to
check the file before doing anything else just to make sure it is in
fact text and not binary.



Just a guess here but how about reading the file twice: once as file:
and once as binfile:
If the two are identical, then I assume it's text only.

I have no idea if this will work, but it's worth a try :-)
Sarah
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your  
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-revolution


--
Chris Sheffield
Read Naturally
The Fluency Company
http://www.readnaturally.com
--


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Ruslan Zasukhin
On 12/12/06 12:09 AM, "Chris Sheffield" <[EMAIL PROTECTED]> wrote:

Hi Chris,

> Does anyone have a sure fire way to determine if a file is binary or
> text?
> 
> I have need to create an import utility that will import data from a
> text file (csv, tab-delimited, etc) into a database, but I'd like to
> check the file before doing anything else just to make sure it is in
> fact text and not binary.
> 
> Any thoughts?

If file follow unicode rules, it have special signature on start

I did not hear about such ability define kind of file..

-- 
Best regards,

Ruslan Zasukhin
VP Engineering and New Technology
Paradigma Software, Inc

Valentina - Joining Worlds of Information
http://www.paradigmasoft.com

[I feel the need: the need for speed]


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: binary vs. text?

2006-12-11 Thread Sarah Reichelt

On 12/12/06, Chris Sheffield <[EMAIL PROTECTED]> wrote:

Does anyone have a sure fire way to determine if a file is binary or
text?

I have need to create an import utility that will import data from a
text file (csv, tab-delimited, etc) into a database, but I'd like to
check the file before doing anything else just to make sure it is in
fact text and not binary.



Just a guess here but how about reading the file twice: once as file:
and once as binfile:
If the two are identical, then I assume it's text only.

I have no idea if this will work, but it's worth a try :-)
Sarah
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


binary vs. text?

2006-12-11 Thread Chris Sheffield
Does anyone have a sure fire way to determine if a file is binary or  
text?


I have need to create an import utility that will import data from a  
text file (csv, tab-delimited, etc) into a database, but I'd like to  
check the file before doing anything else just to make sure it is in  
fact text and not binary.


Any thoughts?

Thanks,
Chris


--
Chris Sheffield
Read Naturally
The Fluency Company
http://www.readnaturally.com
--



___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution