Re: [fossil-users] Fossil interprets plain-text file as a binary file

2017-03-28 Thread Ross Berteig



On 3/27/2017 6:21 PM, Richard Hipp wrote:

On 3/27/17, Ross Berteig  wrote:

I believe that a line is too long if it is more than about 8191 ASCII
characters, a restriction based on the size of the buffer used in the
diff engine.

Technically, that restriction is due to the way hashes are computed on
individual lines during the diff.  For diffing, the file is broken up
into individual lines, and every line is given a 32-bit hash that
helps to speed up locating the differences.  The lower 13 bits of the
hash are the length of the line in bytes.  The upper 19 bytes are the
actual hash.


Interesting. I didn't read further into the code than the definition of 
LENGTH_MASK and the comment that describes it in diff.c. I did wonder 
slightly at the name of that symbol, but it was described as the length 
of a line so I just ran with it. In lookslike.c we have 
UTF16_LENGTH_MASK which is described by the comment as being the same 
quantity expressed for UTF16 chars.


But the comment and definition don't seem to agree. Richard, take a look at
https://www.fossil-scm.org/index.html/artifact?name=3ac38fafa91d274c=220-226
Line 225 would compute UTF16_LENGTH_MASK to be 13-2-1 or 10, and get 
1023 for UTF16_LENGTH_MASK. But the comment says 4096


Either the code, the comment, or I are confused here. Since I'm poking 
at test cases for this stuff. I'll see if I can add one that probes the 
UTF16 line length question.


--
Ross Berteig   r...@cheshireeng.com
Cheshire Engineering Corp.   http://www.CheshireEng.com/
+1 626 303 1602

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil interprets plain-text file as a binary file

2017-03-28 Thread Ross Berteig


On 3/27/2017 6:40 PM, Byron Sanchez wrote:

That was it!

I ran the command and received the output:

Starts with UTF-8 BOM: no
Starts with UTF-16 BOM: no
Looks like UTF-8: no
Has flag LOOK_NUL: yes
Has flag LOOK_CR: no
Has flag LOOK_LONE_CR: no
Has flag LOOK_LF: yes
Has flag LOOK_LONE_LF: yes
Has flag LOOK_CRLF: no
Has flag LOOK_LONG: no
Has flag LOOK_INVALID: no
Has flag LOOK_ODD: no
Has flag LOOK_SHORT: no

I deleted the null characters. I didn't have to address any of the 
other flags in my case, just the null characters. After that, fossil 
recognized the file as plain text again.


Unexpected NUL characters in a field of normal text will definitely 
cause fossil to treat a file as binary.


If you really do need to store a NUL byte in a text file (as some sort 
of delimiter, perhaps) fossil permits the over-long two byte UTF-8 
encoding 0xC0 0x80 even though that is a technical violation of the 
UTF-8 specification. Allowing that particular over-long encoding is a 
common extension of UTF-8.


The other flags just indicate that you have normal *nix line endings 
rather than CR LF endings used by DOS and Windows (and many many older 
systems) or the CR only endings used by older Macs.


--

Ross Berteig   r...@cheshireeng.com
Cheshire Engineering Corp.   http://www.CheshireEng.com/
+1 626 303 1602

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil interprets plain-text file as a binary file

2017-03-27 Thread Byron Sanchez
That was it!

I ran the command and received the output:

Starts with UTF-8 BOM: no
Starts with UTF-16 BOM: no
Looks like UTF-8: no
Has flag LOOK_NUL: yes
Has flag LOOK_CR: no
Has flag LOOK_LONE_CR: no
Has flag LOOK_LF: yes
Has flag LOOK_LONE_LF: yes
Has flag LOOK_CRLF: no
Has flag LOOK_LONG: no
Has flag LOOK_INVALID: no
Has flag LOOK_ODD: no
Has flag LOOK_SHORT: no

I deleted the null characters. I didn't have to address any of the other
flags in my case, just the null characters. After that, fossil recognized
the file as plain text again.

Thank you for the help!


On Mon, Mar 27, 2017 at 9:09 PM, Ross Berteig  wrote:

>
> On 3/27/2017 5:44 PM, Byron Sanchez wrote:
>
>> I'm tracking several plain-text files in a repository. These are emacs
>> org-mode files.
>>
>> Fossil sees most of the files in this repo as normal plain-text files and
>> as such, they can be diffed via the fossil web interface.
>>
>> Recently, however, fossil has started interpreting one of these org-mode
>> files as a binary file. Now, fossil prompts with it's binary-file warning
>> each time I update the file. In addition, this file can no longer be diffed
>> in the web interface, since fossil believes it to be a binary file.
>>
>> I'm wondering what steps I should take to debug this, or if there are any
>> common causes for this sort of thing? Very long lines perhaps or possibly
>> unicode characters?
>>
>
> Try the command "fossil test-looks-like-utf" to see the conditions that
> fossil tests for your file. That should help narrow down what to look for
> in the file that caused it to suddenly smell binary. It usually decides a
> file is binary if it has a line that is "too long", or has a NUL byte and
> is not UTF-16.
>
> I believe that a line is too long if it is more than about 8191 ASCII
> characters, a restriction based on the size of the buffer used in the diff
> engine.
>
> The other thing that can happen is to accidentally save a text file in an
> encoding other than UTF-8, with some character not included in the base
> 7-bit ASCII set. In my experience this was usually some accented letter
> from LATIN1, or a symbol such as 'µ' or '°'. Your editor will likely calmly
> edit and save the file, everything looks fine, but the saved file has bytes
> that make an invalid UTF-8 sequence. That does have a different warning
> message than binary data (likely "invalid UTF-8") so isn't your problem
> with this file.
>
>
> The file in question is about 3.3 megabytes in size, and as far as I am
>> aware, a normal plain-text org-mode file.
>>
>> Any ideas would be very appreciated!
>>
>>
> --
> Ross Berteig   r...@cheshireeng.com
> Cheshire Engineering Corp.   http://www.CheshireEng.com/
> +1 626 303 1602
>
> ___
> fossil-users mailing list
> fossil-users@lists.fossil-scm.org
> http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users
>
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil interprets plain-text file as a binary file

2017-03-27 Thread Richard Hipp
On 3/27/17, Ross Berteig  wrote:
>
> I believe that a line is too long if it is more than about 8191 ASCII
> characters, a restriction based on the size of the buffer used in the
> diff engine.

Technically, that restriction is due to the way hashes are computed on
individual lines during the diff.  For diffing, the file is broken up
into individual lines, and every line is given a 32-bit hash that
helps to speed up locating the differences.  The lower 13 bits of the
hash are the length of the line in bytes.  The upper 19 bytes are the
actual hash.
-- 
D. Richard Hipp
d...@sqlite.org
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil interprets plain-text file as a binary file

2017-03-27 Thread Ross Berteig


On 3/27/2017 5:44 PM, Byron Sanchez wrote:
I'm tracking several plain-text files in a repository. These are emacs 
org-mode files.


Fossil sees most of the files in this repo as normal plain-text files 
and as such, they can be diffed via the fossil web interface.


Recently, however, fossil has started interpreting one of these 
org-mode files as a binary file. Now, fossil prompts with it's 
binary-file warning each time I update the file. In addition, this 
file can no longer be diffed in the web interface, since fossil 
believes it to be a binary file.


I'm wondering what steps I should take to debug this, or if there are 
any common causes for this sort of thing? Very long lines perhaps or 
possibly unicode characters?


Try the command "fossil test-looks-like-utf" to see the conditions that 
fossil tests for your file. That should help narrow down what to look 
for in the file that caused it to suddenly smell binary. It usually 
decides a file is binary if it has a line that is "too long", or has a 
NUL byte and is not UTF-16.


I believe that a line is too long if it is more than about 8191 ASCII 
characters, a restriction based on the size of the buffer used in the 
diff engine.


The other thing that can happen is to accidentally save a text file in 
an encoding other than UTF-8, with some character not included in the 
base 7-bit ASCII set. In my experience this was usually some accented 
letter from LATIN1, or a symbol such as 'µ' or '°'. Your editor will 
likely calmly edit and save the file, everything looks fine, but the 
saved file has bytes that make an invalid UTF-8 sequence. That does have 
a different warning message than binary data (likely "invalid UTF-8") so 
isn't your problem with this file.


The file in question is about 3.3 megabytes in size, and as far as I 
am aware, a normal plain-text org-mode file.


Any ideas would be very appreciated!



--
Ross Berteig   r...@cheshireeng.com
Cheshire Engineering Corp.   http://www.CheshireEng.com/
+1 626 303 1602

___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


Re: [fossil-users] Fossil interprets plain-text file as a binary file

2017-03-27 Thread Scott Robison
On Mar 27, 2017 6:44 PM, "Byron Sanchez"  wrote:

Recently, however, fossil has started interpreting one of these org-mode
files as a binary file. Now, fossil prompts with it's binary-file warning
each time I update the file. In addition, this file can no longer be diffed
in the web interface, since fossil believes it to be a binary file.

I'm wondering what steps I should take to debug this, or if there are any
common causes for this sort of thing? Very long lines perhaps or possibly
unicode characters?


Long lines, invalid unicode sequences, or many control codes.

What type of data is it? Source code, poetry?
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users


[fossil-users] Fossil interprets plain-text file as a binary file

2017-03-27 Thread Byron Sanchez
I'm tracking several plain-text files in a repository. These are emacs
org-mode files.

Fossil sees most of the files in this repo as normal plain-text files and
as such, they can be diffed via the fossil web interface.

Recently, however, fossil has started interpreting one of these org-mode
files as a binary file. Now, fossil prompts with it's binary-file warning
each time I update the file. In addition, this file can no longer be diffed
in the web interface, since fossil believes it to be a binary file.

I'm wondering what steps I should take to debug this, or if there are any
common causes for this sort of thing? Very long lines perhaps or possibly
unicode characters?

The file in question is about 3.3 megabytes in size, and as far as I am
aware, a normal plain-text org-mode file.

Any ideas would be very appreciated!

Thanks,

Byron Sanchez
___
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users