Re: bug in busybox sed with non-ascii chars

2014-05-07 Thread Rich Felker
On Mon, May 05, 2014 at 08:08:32PM +0100, Sam Liddicott wrote:
> One of the advantages of utf-8 encoding was that it was easy to re-sync
> after an invalid sequence.
> 
> It's a bit of a waste to then not do that. Minus points for musl.

An application can resync, although the C multibyte interfaces are not
really designed to be used this way (and you have to be careful if the
locale's encoding might be state-dependent, e.g. some legacy CJK
encodings). However the implementation cannot silently resync behind
your back. Doing so introduces serious bugs, some of which may be
security-relevant, since you either silently miss seeing some bytes
from the input when processing input via conversion to wide
characters, or some invalid sequences appear to the application as
valid. Either possibility is dangerous. In particular, it's wrong for
the regex "." to match anything that's an illegal sequence, and wrong
for "^.*$" to match a line containing any illegal sequences (since the
"." can't match it).

> Can you not run sed with LANG=C or LANG=POSIX?

That's not what they're doing, but it's not a solution anyway. ISO C
leaves the character encoding of the C locale implementation-defined,
and the Rationale text from the 1995 amendments to C explicitly allows
for the possibility that the C locale's character encoding has
multibyte characters (e.g. is UTF-8).

musl presently does not support byte-based characters at all, only
UTF-8. This conforms to the current versions of ISO C and POSIX, but
the Austin Group has adopted a requirement that the C locale be "8
bit clean" as a future requirement, which musl will probably support
at some time in the future.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: bug in busybox sed with non-ascii chars

2014-05-05 Thread Sam Liddicott
One of the advantages of utf-8 encoding was that it was easy to re-sync
after an invalid sequence.

It's a bit of a waste to then not do that. Minus points for musl.

Can you not run sed with LANG=C or LANG=POSIX?

Sam
On 4 May 2014 15:57, "Rich Felker"  wrote:

> On Sun, May 04, 2014 at 04:44:10PM +0200, Denys Vlasenko wrote:
> > On Sat, May 3, 2014 at 5:07 PM, Rich Felker  wrote:
> > >> Lets refuse to find end of line if there is a non UTF-8 sequence
> inside that line?
> > >> Sounds wrong to me...
> > >
> > > sed (also regcomp and regexec) requires text input. Byte streams with
> > > illegal sequences are not text. Actually since the regex is not trying
> > > to match the illegal sequence, just the end-of-line, it would
> > > theoretically be possible to make this work (and it will once we
> > > overhaul the regex implementation to work with byte-based DFA's rather
> > > than character-based ones), but that doesn't change the fact that it's
> > > an invalid test.
> >
> > Language lawyering is less important that real world usage.
>
> Indeed it's nice to support additional real-world usage when doing so
> does not harm any other usage. But we're not talking about real-world
> usage here. We're talking about a buggy configure test.
>
> I'd love to improve or even rewrite the regex engine but that's a lot
> of work and lower priority than a number of other things on the musl
> roadmap.
>
> Rich
> ___
> busybox mailing list
> busybox@busybox.net
> http://lists.busybox.net/mailman/listinfo/busybox
>
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: bug in busybox sed with non-ascii chars

2014-05-04 Thread Rich Felker
On Sun, May 04, 2014 at 04:44:10PM +0200, Denys Vlasenko wrote:
> On Sat, May 3, 2014 at 5:07 PM, Rich Felker  wrote:
> >> Lets refuse to find end of line if there is a non UTF-8 sequence inside 
> >> that line?
> >> Sounds wrong to me...
> >
> > sed (also regcomp and regexec) requires text input. Byte streams with
> > illegal sequences are not text. Actually since the regex is not trying
> > to match the illegal sequence, just the end-of-line, it would
> > theoretically be possible to make this work (and it will once we
> > overhaul the regex implementation to work with byte-based DFA's rather
> > than character-based ones), but that doesn't change the fact that it's
> > an invalid test.
> 
> Language lawyering is less important that real world usage.

Indeed it's nice to support additional real-world usage when doing so
does not harm any other usage. But we're not talking about real-world
usage here. We're talking about a buggy configure test.

I'd love to improve or even rewrite the regex engine but that's a lot
of work and lower priority than a number of other things on the musl
roadmap.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: bug in busybox sed with non-ascii chars

2014-05-04 Thread Denys Vlasenko
On Sat, May 3, 2014 at 5:07 PM, Rich Felker  wrote:
>> Lets refuse to find end of line if there is a non UTF-8 sequence inside that 
>> line?
>> Sounds wrong to me...
>
> sed (also regcomp and regexec) requires text input. Byte streams with
> illegal sequences are not text. Actually since the regex is not trying
> to match the illegal sequence, just the end-of-line, it would
> theoretically be possible to make this work (and it will once we
> overhaul the regex implementation to work with byte-based DFA's rather
> than character-based ones), but that doesn't change the fact that it's
> an invalid test.

Language lawyering is less important that real world usage.

Adding a char to each line of text is a quite reasonable thing to do.

Having occasional UTF-8 violations in text files is not rare too.
Linux kernel source code has 57 instances of it in *.c and *.h files.
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: bug in busybox sed with non-ascii chars

2014-05-03 Thread Rich Felker
On Sat, May 03, 2014 at 03:17:49PM +0200, Denys Vlasenko wrote:
> On Saturday 03 May 2014 05:10, Rich Felker wrote:
> > On Wed, Apr 30, 2014 at 10:31:00AM +0200, Natanael Copa wrote:
> > > Hi,
> > > 
> > > I came across a bug (or feature) in busybox sed when trying to build 
> > > firefox-29.
> > > 
> > > Testcase based on what firefox's configure scripts does:
> > > 
> > > ASCII='AA'
> > > NONASCII=$'\246\246'
> > > 
> > > echo -e "($ASCII)\n($NONASCII)" | busybox sed 's/$/,/'
> > 
> > The above script is invalid; \246\246 is an illegal sequence and thus
> > is rejected by regexec. It will work only on non-UTF-8 systems/locales
> > (which musl does not support).
> 
> Lets refuse to find end of line if there is a non UTF-8 sequence inside that 
> line?
> Sounds wrong to me...

sed (also regcomp and regexec) requires text input. Byte streams with
illegal sequences are not text. Actually since the regex is not trying
to match the illegal sequence, just the end-of-line, it would
theoretically be possible to make this work (and it will once we
overhaul the regex implementation to work with byte-based DFA's rather
than character-based ones), but that doesn't change the fact that it's
an invalid test.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: bug in busybox sed with non-ascii chars

2014-05-03 Thread Denys Vlasenko
On Saturday 03 May 2014 05:10, Rich Felker wrote:
> On Wed, Apr 30, 2014 at 10:31:00AM +0200, Natanael Copa wrote:
> > Hi,
> > 
> > I came across a bug (or feature) in busybox sed when trying to build 
> > firefox-29.
> > 
> > Testcase based on what firefox's configure scripts does:
> > 
> > ASCII='AA'
> > NONASCII=$'\246\246'
> > 
> > echo -e "($ASCII)\n($NONASCII)" | busybox sed 's/$/,/'
> 
> The above script is invalid; \246\246 is an illegal sequence and thus
> is rejected by regexec. It will work only on non-UTF-8 systems/locales
> (which musl does not support).

Lets refuse to find end of line if there is a non UTF-8 sequence inside that 
line?
Sounds wrong to me...
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: bug in busybox sed with non-ascii chars

2014-05-02 Thread Rich Felker
On Wed, Apr 30, 2014 at 10:31:00AM +0200, Natanael Copa wrote:
> Hi,
> 
> I came across a bug (or feature) in busybox sed when trying to build 
> firefox-29.
> 
> Testcase based on what firefox's configure scripts does:
> 
> ASCII='AA'
> NONASCII=$'\246\246'
> 
> echo -e "($ASCII)\n($NONASCII)" | busybox sed 's/$/,/'

The above script is invalid; \246\246 is an illegal sequence and thus
is rejected by regexec. It will work only on non-UTF-8 systems/locales
(which musl does not support).

Please file a bug with Firefox.

Rich


P.S. I think you got my response to this on #musl but it's nice to
have the resolution here for the record anyway.
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: bug in busybox sed with non-ascii chars

2014-05-01 Thread Natanael Copa
On Fri, 2 May 2014 07:34:57 +0200
Denys Vlasenko  wrote:

> On Wednesday 30 April 2014 10:31, Natanael Copa wrote:
> > Hi,
> > 
> > I came across a bug (or feature) in busybox sed when trying to build 
> > firefox-29.
> > 
> > Testcase based on what firefox's configure scripts does:
> > 
> > ASCII='AA'
> > NONASCII=$'\246\246'
> > 
> > echo -e "($ASCII)\n($NONASCII)" | busybox sed 's/$/,/'
> > 
> > 
> > Expected result is a comma (,) after both lines. Actual result is that
> > the line with non-ascii does not get any comma.
> 
> Can't reproduce with uclibc-based busybox:
> 
> ASCII='AA'
> NONASCII=$'\246\246'
> # GNU sed version 4.1.5
> echo -e "($ASCII)\n($NONASCII)" | /usr/bin/sed 's/$/,/' | hexdump -C
> echo -e "($ASCII)\n($NONASCII)" | ./busybox sed 's/$/,/' | hexdump -C
> 
> Result:
> 
>   28 41 41 29 2c 0a 28 a6  a6 29 2c 0a  |(AA),.(..),.|
> 000c
>   28 41 41 29 2c 0a 28 a6  a6 29 2c 0a  |(AA),.(..),.|
> 000c
> 
> 
> > With gnu sed both lines gets a trailing comma.
> > 
> > BusyBox v1.22.1 compiled against musl libc.
> > 
> > Ideas?
> 
> (1) Post your .config

See below.

> (2) Does the same happen if built against glibc?

no. So this smells like a bug in musl libc.

The config:
#
# Automatically generated make config: don't edit
# Busybox version: 1.22.0
# Thu Jan  2 13:04:57 2014
#
CONFIG_HAVE_DOT_CONFIG=y

#
# Busybox Settings
#

#
# General Configuration
#
CONFIG_DESKTOP=y
CONFIG_EXTRA_COMPAT=y
# CONFIG_INCLUDE_SUSv2 is not set
# CONFIG_USE_PORTABLE_CODE is not set
CONFIG_PLATFORM_LINUX=y
CONFIG_FEATURE_BUFFERS_USE_MALLOC=y
# CONFIG_FEATURE_BUFFERS_GO_ON_STACK is not set
# CONFIG_FEATURE_BUFFERS_GO_IN_BSS is not set
CONFIG_SHOW_USAGE=y
CONFIG_FEATURE_VERBOSE_USAGE=y
CONFIG_FEATURE_COMPRESS_USAGE=y
CONFIG_FEATURE_INSTALLER=y
# CONFIG_INSTALL_NO_USR is not set
CONFIG_LOCALE_SUPPORT=y
CONFIG_UNICODE_SUPPORT=y
CONFIG_UNICODE_USING_LOCALE=y
CONFIG_FEATURE_CHECK_UNICODE_IN_ENV=y
CONFIG_SUBST_WCHAR=63
CONFIG_LAST_SUPPORTED_WCHAR=767
CONFIG_UNICODE_COMBINING_WCHARS=y
CONFIG_UNICODE_WIDE_WCHARS=y
# CONFIG_UNICODE_BIDI_SUPPORT is not set
# CONFIG_UNICODE_NEUTRAL_TABLE is not set
CONFIG_UNICODE_PRESERVE_BROKEN=y
CONFIG_LONG_OPTS=y
CONFIG_FEATURE_DEVPTS=y
# CONFIG_FEATURE_CLEAN_UP is not set
CONFIG_FEATURE_UTMP=y
CONFIG_FEATURE_WTMP=y
CONFIG_FEATURE_PIDFILE=y
CONFIG_PID_FILE_PATH="/var/run"
CONFIG_FEATURE_SUID=y
# CONFIG_FEATURE_SUID_CONFIG is not set
# CONFIG_FEATURE_SUID_CONFIG_QUIET is not set
# CONFIG_SELINUX is not set
# CONFIG_FEATURE_PREFER_APPLETS is not set
CONFIG_BUSYBOX_EXEC_PATH="/bin/busybox"
CONFIG_FEATURE_SYSLOG=y
# CONFIG_FEATURE_HAVE_RPC is not set

#
# Build Options
#
# CONFIG_STATIC is not set
CONFIG_PIE=y
# CONFIG_NOMMU is not set
# CONFIG_BUILD_LIBBUSYBOX is not set
# CONFIG_FEATURE_INDIVIDUAL is not set
# CONFIG_FEATURE_SHARED_BUSYBOX is not set
CONFIG_LFS=y
CONFIG_CROSS_COMPILER_PREFIX=""
CONFIG_SYSROOT=""
CONFIG_EXTRA_CFLAGS=""
CONFIG_EXTRA_LDFLAGS=""
CONFIG_EXTRA_LDLIBS=""

#
# Debugging Options
#
# CONFIG_DEBUG is not set
# CONFIG_DEBUG_PESSIMIZE is not set
# CONFIG_WERROR is not set
CONFIG_NO_DEBUG_LIB=y
# CONFIG_DMALLOC is not set
# CONFIG_EFENCE is not set

#
# Installation Options ("make install" behavior)
#
# CONFIG_INSTALL_APPLET_SYMLINKS is not set
# CONFIG_INSTALL_APPLET_HARDLINKS is not set
# CONFIG_INSTALL_APPLET_SCRIPT_WRAPPERS is not set
CONFIG_INSTALL_APPLET_DONT=y
# CONFIG_INSTALL_SH_APPLET_SYMLINK is not set
# CONFIG_INSTALL_SH_APPLET_HARDLINK is not set
# CONFIG_INSTALL_SH_APPLET_SCRIPT_WRAPPER is not set
CONFIG_PREFIX="/home/ncopa/aports/main/busybox/pkg/busybox"

#
# Busybox Library Tuning
#
# CONFIG_FEATURE_SYSTEMD is not set
CONFIG_FEATURE_RTMINMAX=y
CONFIG_PASSWORD_MINLEN=6
CONFIG_MD5_SMALL=0
CONFIG_SHA3_SMALL=0
CONFIG_FEATURE_FAST_TOP=y
# CONFIG_FEATURE_ETC_NETWORKS is not set
CONFIG_FEATURE_USE_TERMIOS=y
CONFIG_FEATURE_EDITING=y
CONFIG_FEATURE_EDITING_MAX_LEN=1024
CONFIG_FEATURE_EDITING_VI=y
CONFIG_FEATURE_EDITING_HISTORY=255
CONFIG_FEATURE_EDITING_SAVEHISTORY=y
# CONFIG_FEATURE_EDITING_SAVE_ON_EXIT is not set
CONFIG_FEATURE_REVERSE_SEARCH=y
CONFIG_FEATURE_TAB_COMPLETION=y
CONFIG_FEATURE_USERNAME_COMPLETION=y
CONFIG_FEATURE_EDITING_FANCY_PROMPT=y
CONFIG_FEATURE_EDITING_ASK_TERMINAL=y
CONFIG_FEATURE_NON_POSIX_CP=y
# CONFIG_FEATURE_VERBOSE_CP_MESSAGE is not set
CONFIG_FEATURE_COPYBUF_KB=16
CONFIG_FEATURE_SKIP_ROOTFS=y
CONFIG_MONOTONIC_SYSCALL=y
# CONFIG_IOCTL_HEX2STR_ERROR is not set
CONFIG_FEATURE_HWIB=y

#
# Applets
#

#
# Archival Utilities
#
# CONFIG_FEATURE_SEAMLESS_XZ is not set
CONFIG_FEATURE_SEAMLESS_LZMA=y
CONFIG_FEATURE_SEAMLESS_BZ2=y
CONFIG_FEATURE_SEAMLESS_GZ=y
CONFIG_FEATURE_SEAMLESS_Z=y
# CONFIG_AR is not set
# CONFIG_FEATURE_AR_LONG_FILENAMES is not set
# CONFIG_FEATURE_AR_CREATE is not set
# CONFIG_UNCOMPRESS is not set
CONFIG_GUNZIP=y
CONFIG_BUNZIP2=y
CONFIG_UNLZMA=y
CONFIG_FEATURE_LZMA_FAST=y
CONFIG_LZMA=y
CONFIG_UNXZ=y
# CONFIG_XZ is not set
CONFIG_BZIP2=y
CONFIG_CPIO=y
CONFIG_FEATURE_CPIO_O=y
C

Re: bug in busybox sed with non-ascii chars

2014-05-01 Thread Denys Vlasenko
On Wednesday 30 April 2014 10:31, Natanael Copa wrote:
> Hi,
> 
> I came across a bug (or feature) in busybox sed when trying to build 
> firefox-29.
> 
> Testcase based on what firefox's configure scripts does:
> 
> ASCII='AA'
> NONASCII=$'\246\246'
> 
> echo -e "($ASCII)\n($NONASCII)" | busybox sed 's/$/,/'
> 
> 
> Expected result is a comma (,) after both lines. Actual result is that
> the line with non-ascii does not get any comma.

Can't reproduce with uclibc-based busybox:

ASCII='AA'
NONASCII=$'\246\246'
# GNU sed version 4.1.5
echo -e "($ASCII)\n($NONASCII)" | /usr/bin/sed 's/$/,/' | hexdump -C
echo -e "($ASCII)\n($NONASCII)" | ./busybox sed 's/$/,/' | hexdump -C

Result:

  28 41 41 29 2c 0a 28 a6  a6 29 2c 0a  |(AA),.(..),.|
000c
  28 41 41 29 2c 0a 28 a6  a6 29 2c 0a  |(AA),.(..),.|
000c


> With gnu sed both lines gets a trailing comma.
> 
> BusyBox v1.22.1 compiled against musl libc.
> 
> Ideas?

(1) Post your .config
(2) Does the same happen if built against glibc?
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


bug in busybox sed with non-ascii chars

2014-04-30 Thread Natanael Copa
Hi,

I came across a bug (or feature) in busybox sed when trying to build firefox-29.

Testcase based on what firefox's configure scripts does:

ASCII='AA'
NONASCII=$'\246\246'

echo -e "($ASCII)\n($NONASCII)" | busybox sed 's/$/,/'


Expected result is a comma (,) after both lines. Actual result is that
the line with non-ascii does not get any comma.

With gnu sed both lines gets a trailing comma.

BusyBox v1.22.1 compiled against musl libc.

Ideas?

-nc
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox