Re: Proposition: use a hashtable instead of bsearch to locate applets

2014-05-30 Thread Bartosz Gołaszewski
2014-05-29 18:56 GMT+02:00 Laurent Bercot ska-dietl...@skarnet.org:
 Does this provided a noticeable performance increase? Do you have
 benchmarks?

I've checked the times just by looking up all the applets in a loop
and measuring the time using gettimeofday() - the results are: ~220
microseconds for bsearch and ~150 microseconds for hashtable on my
linux laptop. Is it significant? I think so. Is it noticeable?
Probably not. The idea came to me, when thinking about unifying the
hashtables used in busybox. I guess you're right and it isn't really
worth the size increase.

Best regards,
Bartosz Golaszewski
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposition: use a hashtable instead of bsearch to locate applets

2014-05-30 Thread Laurent Bercot


On 05/30/2014 10:16 AM, Bartosz Gołaszewski wrote:

I've checked the times just by looking up all the applets in a loop
and measuring the time using gettimeofday() - the results are: ~220
microseconds for bsearch and ~150 microseconds for hashtable on my
linux laptop. Is it significant? I think so. Is it noticeable?
Probably not. The idea came to me, when thinking about unifying the
hashtables used in busybox. I guess you're right and it isn't really
worth the size increase.


 I think it would definitely be a worthwhile improvement if all busybox
is doing was looking up the applets in a loop. ;)
 A more realistic test, however, would be to fork+exec a trivial applet
(true, for instance) in a loop, and measure the times with bsearch vs.
with the hash table. And I'm certain you'll find that the difference
becomes quite insignificant.

--
 Laurent
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Issues removing files with certain characters in their names.

2014-05-30 Thread Rich Felker
On Thu, May 29, 2014 at 06:41:17PM -0400, Joshua Judson Rosen wrote:
 But why is ls able to match the files when rm is not able to remove them?

I have no idea. Have you tried running them under strace and seeing
where the failure occurs?

 Is it perhaps because ls is not actually doing any operations on the files
 themselves (not even a stat?), and just reporting the dirent-d_name strings
 that it got from readdir()? In which case ls -l * would fail on the same
 files even when ls * doesn't?

I doubt it.

 Or is there something deeper whereby stat() succeeds but unlink() fails?

I'm guessing it's a failure at in the userspace code for rm, not the
syscall. Again strace could help you confirm this and possibly
determine where rm's idea of the filename is getting corrupted.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Issues removing files with certain characters in their names.

2014-05-30 Thread Harald Becker
Hi Joshua !

On 29-05-2014 18:41 Joshua Judson Rosen jro...@harvestai.com
wrote:

But why is ls able to match the files when rm is not able to
remove them?

The problem happens on the opposite direction you expect. It is
not the create/open/unlink which modify the file name it is the
directory scan. ls doesn't really match the filename. It just
scans the directory, filters with given file name masks, then
display what it got. But on some unusual file names the directory
scan gives names which does not exactly match the name stored on
file system. They can be displayed, but used for an remove or
move operation the stat/unlink fails. This usually happens
when the names contain control or unprintable characters
(e.g. a byte with all zero) which get removed by the
kernel/file system driver.

Is it perhaps because ls is not actually doing any operations on
the files themselves (not even a stat?), and just reporting the
dirent-d_name strings that it got from readdir()? In which case
ls -l * would fail on the same files even when ls * doesn't?

Correct.

Or is there something deeper whereby stat() succeeds but
unlink() fails?

No. Never got this.

--
Harald
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposition: use a hashtable instead of bsearch to locate applets

2014-05-30 Thread Rich Felker
On Fri, May 30, 2014 at 12:13:41PM +0100, Laurent Bercot wrote:
 
 On 05/30/2014 10:16 AM, Bartosz Gołaszewski wrote:
 I've checked the times just by looking up all the applets in a loop
 and measuring the time using gettimeofday() - the results are: ~220
 microseconds for bsearch and ~150 microseconds for hashtable on my
 linux laptop. Is it significant? I think so. Is it noticeable?
 Probably not. The idea came to me, when thinking about unifying the
 hashtables used in busybox. I guess you're right and it isn't really
 worth the size increase.
 
  I think it would definitely be a worthwhile improvement if all busybox
 is doing was looking up the applets in a loop. ;)
  A more realistic test, however, would be to fork+exec a trivial applet
 (true, for instance) in a loop, and measure the times with bsearch vs.
 with the hash table. And I'm certain you'll find that the difference
 becomes quite insignificant.

The lowest time I've ever seen for exec (not even counting fork;
measured from immediately before the exec syscall to entry into main)
is over 200µs, and can easily exceed 1ms with dynamic linking. So even
if the program did nothing but applet lookup and exit, I think we'd be
looking at a performance increase of ~30% at best.

Realistically, as soon as you throw in some actual work and other
syscalls that actually do something, I suspect the difference to be
less than 5%.

If there is a desire to change the way applet lookup is done, I would
suggest trying to make it optimal in both size and performance. 150µs
is not impressive at all. A perfect hash constructed at build time for
the list of busybox applet-name strings should make it possible to do
the applet lookup in ~1µs with trivial code/table size (all constant
in .text).

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Issues removing files with certain characters in their names.

2014-05-30 Thread Rich Felker
On Fri, May 30, 2014 at 08:06:24PM +0200, Harald Becker wrote:
 Hi Joshua !
 
 On 29-05-2014 18:41 Joshua Judson Rosen jro...@harvestai.com
 wrote:
 
 But why is ls able to match the files when rm is not able to
 remove them?
 
 The problem happens on the opposite direction you expect. It is
 not the create/open/unlink which modify the file name it is the
 directory scan. ls doesn't really match the filename. It just
 scans the directory, filters with given file name masks, then
 display what it got.

ls never performas any filtering. If ls is given a directory name, it
just reads the entries with readdir(); no 'matching' needs to take
place at all. If it's given a glob pattern, that glob never reaches
ls; it's expanded by the shell. If you quote the glob to actually pass
it to ls, ls will not process it as a glob pattern; it's treated as a
literal filename containing *'s or ?'s. So I don't understand what
you're claiming happens.

 But on some unusual file names the directory
 scan gives names which does not exactly match the name stored on
 file system. They can be displayed, but used for an remove or
 move operation the stat/unlink fails. This usually happens
 when the names contain control or unprintable characters
 (e.g. a byte with all zero) which get removed by the
 kernel/file system driver.

Printability has nothing to do with processing the filename. And a
zero byte fundamentally cannot be in a filename (the filename in the
directory table consists of those characters up to, and not including,
any zero byte stored).

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposition: use a hashtable instead of bsearch to locate applets

2014-05-30 Thread walter harms


Am 30.05.2014 20:07, schrieb Rich Felker:
 On Fri, May 30, 2014 at 12:13:41PM +0100, Laurent Bercot wrote:

 On 05/30/2014 10:16 AM, Bartosz Gołaszewski wrote:
 I've checked the times just by looking up all the applets in a loop
 and measuring the time using gettimeofday() - the results are: ~220
 microseconds for bsearch and ~150 microseconds for hashtable on my
 linux laptop. Is it significant? I think so. Is it noticeable?
 Probably not. The idea came to me, when thinking about unifying the
 hashtables used in busybox. I guess you're right and it isn't really
 worth the size increase.

  I think it would definitely be a worthwhile improvement if all busybox
 is doing was looking up the applets in a loop. ;)
  A more realistic test, however, would be to fork+exec a trivial applet
 (true, for instance) in a loop, and measure the times with bsearch vs.
 with the hash table. And I'm certain you'll find that the difference
 becomes quite insignificant.
 
 The lowest time I've ever seen for exec (not even counting fork;
 measured from immediately before the exec syscall to entry into main)
 is over 200µs, and can easily exceed 1ms with dynamic linking. So even
 if the program did nothing but applet lookup and exit, I think we'd be
 looking at a performance increase of ~30% at best.
 
 Realistically, as soon as you throw in some actual work and other
 syscalls that actually do something, I suspect the difference to be
 less than 5%.
 
 If there is a desire to change the way applet lookup is done, I would
 suggest trying to make it optimal in both size and performance. 150µs
 is not impressive at all. A perfect hash constructed at build time for
 the list of busybox applet-name strings should make it possible to do
 the applet lookup in ~1µs with trivial code/table size (all constant
 in .text).
 

We should keep in mind that busybox is also about size reduction, so we need
a more flexible code that can be used on other places as well (e.g. shell).
From my experience i would not expect to much most times the hogs are somewhere
else.  Glibc has some hash functions (no idea, if POSIX or GNU or ...)
using them should not make a notable size difference.

re,
 wh
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Proposition: use a hashtable instead of bsearch to locate applets

2014-05-30 Thread Isaac Dunham
On Fri, May 30, 2014 at 08:19:29PM +0200, walter harms wrote:
 
 
 Am 30.05.2014 20:07, schrieb Rich Felker:
  On Fri, May 30, 2014 at 12:13:41PM +0100, Laurent Bercot wrote:
 
  On 05/30/2014 10:16 AM, Bartosz Gołaszewski wrote:
  I've checked the times just by looking up all the applets in a loop
  and measuring the time using gettimeofday() - the results are: ~220
  microseconds for bsearch and ~150 microseconds for hashtable on my
  linux laptop. Is it significant? I think so. Is it noticeable?
  Probably not. The idea came to me, when thinking about unifying the
  hashtables used in busybox. I guess you're right and it isn't really
  worth the size increase.
 
   I think it would definitely be a worthwhile improvement if all busybox
  is doing was looking up the applets in a loop. ;)
   A more realistic test, however, would be to fork+exec a trivial applet
  (true, for instance) in a loop, and measure the times with bsearch vs.
  with the hash table. And I'm certain you'll find that the difference
  becomes quite insignificant.
  
  The lowest time I've ever seen for exec (not even counting fork;
  measured from immediately before the exec syscall to entry into main)
  is over 200µs, and can easily exceed 1ms with dynamic linking. So even
  if the program did nothing but applet lookup and exit, I think we'd be
  looking at a performance increase of ~30% at best.
  
  Realistically, as soon as you throw in some actual work and other
  syscalls that actually do something, I suspect the difference to be
  less than 5%.
  
  If there is a desire to change the way applet lookup is done, I would
  suggest trying to make it optimal in both size and performance. 150µs
  is not impressive at all. A perfect hash constructed at build time for
  the list of busybox applet-name strings should make it possible to do
  the applet lookup in ~1µs with trivial code/table size (all constant
  in .text).
  
 
 We should keep in mind that busybox is also about size reduction, so we need
 a more flexible code that can be used on other places as well (e.g. shell).
 From my experience i would not expect to much most times the hogs are 
 somewhere
 else.  Glibc has some hash functions (no idea, if POSIX or GNU or ...)
 using them should not make a notable size difference.

I'm guessing you mean hcreate/hsearch/hdestroy in search.h, which are
POSIX.
I do note that there are a few limitations therein:
-uses malloc
-you must correctly estimate the size of the table at the start; 
it cannot grow, and you cannot remove entries.

I'd think this would not be ideal for the startup code.
HTH,
Isaac Dunham
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Issues removing files with certain characters in their names.

2014-05-30 Thread Harald Becker
Hi Rich !

[To Rich: I have permanent mail failures on dal...@libc.org:
domain has no valid mail exchangers, answer is from GMX mail
server]


On 30-05-2014 14:12 Rich Felker dal...@libc.org wrote:

ls never performas any filtering.

Sorry, yes. The name filtering is in the glob, not ls. My
mistake. So ls just displays the names from readdir. Beside that
mistake with filtering it's as I told.

Printability has nothing to do with processing the filename. And
a zero byte fundamentally cannot be in a filename (the filename
in the directory table consists of those characters up to, and
not including, any zero byte stored).

This is true on a Unix system, but have you looked what Windows
system do? Mapping of names is from to charset code pages is done
in so different ways, many of such mappings may produce unusual
characters. Those unusual characters can produce unusual
effects in further translations, ending up in illegal names when
translated to Unix file names.

With UTF-8 this is all getting better, but before name mangling
of foreign characters was a hell.


... you called it a user error when mounted incorrect. Yes, using
foreign characters in file names is the worst user error. All the
mentioned problems happen after a combination of translation
between different charsets. It is not a single point of failure
producing those problems. The chain of translations lead
sometimes to situations where file names from readdir let stat
fail with same name. All other name confusion can easily be
solved with pattern usage, but when stat fail with names from
readdir you are in trouble. Again, it should not happen, but it
happens - so call it shit.

And remember: All those translations are part of file system
drivers, not part of user space software. Most such userspace
software just passes names just straight through, without
modifications. So tell, who made the error?

--
Harald
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposition: use a hashtable instead of bsearch to locate applets

2014-05-30 Thread Harald Becker
Hi all !

[hash tables in Busybox]

Those tables may speed up the applet start, but always will
increase the size of the program. Nevertheless how small an
effective you can get your hash table, Busybox needs a table with
the full applet names for several purposes, like listing applets
in help text and installing symlinks. So beside minimalistic
speedup you will not receive any benefits. I don't consider
startup speed to be so critical.

--
Harald
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Issues removing files with certain characters in their names.

2014-05-30 Thread Harald Becker
Hi Rich !

My statement was imprecise; of course to support users still
stuck on legacy locales, nl_langinfo(CODESET) should be
consulted.

How do you determine the correct code set of a foreign file
system on an external drive? How can you tell if all systems
which accessed this drive has handled translations in the correct
way?

  and not only unzip may produce such results. Think of
 using an USB stick at an Windows machine, then carry that over
 to an Linux machine.

The filenames are stored in UCS-2. No problem.

UCS-2 with different code page translations from an 8 bit
charset. Translations which leave name mapping in inconsistent
state when further translations occur.

If you mount it incorrectly, then this is user error.

Correct, all those trouble arrives due to anybody having an
incorrect setup. This will ripple trough and may produce trouble
on other ends.

All programs are not affected. Only programs which read
filenames as byte strings from foreign sources (such as the
directory table of a zip file) are affected.

... but how do you know the code page the zip archive uses. How
do you know you need to do translations? I'm unsure if the archiv
contains this information, so it needs to be provided by a much
more error prone user.

--
Harald
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: Proposition: use a hashtable instead of bsearch to locate applets

2014-05-30 Thread Ralf Friedl

Rich Felker wrote:

On Fri, May 30, 2014 at 12:13:41PM +0100, Laurent Bercot wrote:

On 05/30/2014 10:16 AM, Bartosz Gołaszewski wrote:

I've checked the times just by looking up all the applets in a loop
and measuring the time using gettimeofday() - the results are: ~220
microseconds for bsearch and ~150 microseconds for hashtable on my
linux laptop. Is it significant? I think so. Is it noticeable?
Probably not. The idea came to me, when thinking about unifying the
hashtables used in busybox. I guess you're right and it isn't really
worth the size increase.

  I think it would definitely be a worthwhile improvement if all busybox
is doing was looking up the applets in a loop. ;)
  A more realistic test, however, would be to fork+exec a trivial applet
(true, for instance) in a loop, and measure the times with bsearch vs.
with the hash table. And I'm certain you'll find that the difference
becomes quite insignificant.

If there is a desire to change the way applet lookup is done, I would
suggest trying to make it optimal in both size and performance. 150µs
is not impressive at all. A perfect hash constructed at build time for
the list of busybox applet-name strings should make it possible to do
the applet lookup in ~1µs with trivial code/table size (all constant
in .text).
The original number was 220µs for looking up *all* the applets in a 
loop and measuring the time. So this is the time for an unspecified 
number of applets (maybe allyesconfig) on an unspecified machine. So if 
we assume that at least 50 applets were defined, the time for one applet 
is 4µs or less.


I just built busybox with 373 applets, the time to call the lookup 
100 times for APPLET_NAME(0) is 100ms on an AMD 4GHz CPU. An own 
binary search implementation saves a few ms, probably because the 
comparison function is inline which saves a call. It also adds a few 
bytes. I choose APPLET_NAME(0) because it is the worst case for the 
binary search. The best case is APPLET_NAME(NUM_APPLETS/2), which only 
needs 18ms for 100 calls, or 13ms with own binary search.


As busybox has a focus on size and the difference is small I think it 
doesn't make sense to change anything here.

___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Never returning inttab line

2014-05-30 Thread Felipe de Andrade Neves Lavratti
Recently I started working in a system where there is a never returning
call in the busybox inttab file, the line is under the ::sysinit directive.

What are the consequences of this? Respawns are going to work properly?
​​

​Thanks.​

-- 
Skype: felipeanl
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: busybox nslookup slow on x86_64

2014-05-30 Thread Denys Vlasenko
On Thursday 29 May 2014 20:47, Rich Felker wrote:
 On Wed, May 28, 2014 at 04:34:23PM +0200, Denys Vlasenko wrote:
  On Tue, May 27, 2014 at 9:34 AM, muddyboot emu...@gmail.com wrote:
   Hi, I found nslookup resolve very slow on x86_64 system , it cost 5 
   seconds or longer almost everytime.
  
   Tested OS: Debian 7 x86_64 with kernel 3.2.5   LFS x86_64 with kernel 
   3.12
  
   No IPv6 enabled in kernel config.
   DNS server works fine
   nslookup program from bind-9.7 works fine
   nslookup from busybox test on i686 system OK
  
   target busybox version: 1.17.4、1.20.2、1.21.1、1.22.1
  
   Any response for this problem is great appreciated.
  
  bbox nslookup uses libc to perform the lookup.
  
  glibc maintainers known to be quite.. er.. stubborn
  about how DNS should work.
  
  For example, they insist that IPv6 DNS requests must be sent
  even if the machine has no IPv6 support in kernel
  (let alone a more typical case where machine
  has no IPv6 connectivity).
  
  Your DNS server does not respond to IPv6 requests,
  but glibc waits for them.
 
 Unless the caller requested AI_ADDRCONFIG or requested AF_INET
 explicitly as opposed to AF_INET6, it's required to do this. And I
 don't think it's a bug. It may be useful to know all DNS results even
 if some of them (v6) won't be used for your current client setup. The
 bug is in whatever broken nameserver is ignoring  requests rather
 than properly looking them up and returning a result.

Well, in many cases users have no power over that nameserver.

Forcing them to suffer instead of giving them ways to disable
 requests is arrogant.
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Issues removing files with certain characters in their names.

2014-05-30 Thread Rich Felker
On Fri, May 30, 2014 at 09:18:19PM +0200, Harald Becker wrote:
 Hi Rich !
 
 My statement was imprecise; of course to support users still
 stuck on legacy locales, nl_langinfo(CODESET) should be
 consulted.
 
 How do you determine the correct code set of a foreign file
 system on an external drive? How can you tell if all systems
 which accessed this drive has handled translations in the correct
 way?

All modern filesystems used on external devices (fat32, ntfs, udf,
...) use Unicode-based encodings for filenames, so the foreign
encoding is known and fixed.

   and not only unzip may produce such results. Think of
  using an USB stick at an Windows machine, then carry that over
  to an Linux machine.
 
 The filenames are stored in UCS-2. No problem.
 
 UCS-2 with different code page translations from an 8 bit
 charset. Translations which leave name mapping in inconsistent
 state when further translations occur.

I don't follow what you think the problem is.

 If you mount it incorrectly, then this is user error.
 
 Correct, all those trouble arrives due to anybody having an
 incorrect setup. This will ripple trough and may produce trouble
 on other ends.

All modern Linux-based systems use the utf8 option by default when
mounting filesystems that don't store filenames as pure byte strings
but in a Unicode-based form. You have to be rolling your own or else
actively breaking your system's default setup to get this wrong.

 All programs are not affected. Only programs which read
 filenames as byte strings from foreign sources (such as the
 directory table of a zip file) are affected.
 
  but how do you know the code page the zip archive uses. How
 do you know you need to do translations? I'm unsure if the archiv
 contains this information, so it needs to be provided by a much
 more error prone user.

When encountering such an archive, the unzip utility could simply exit
with an error when there are non-ASCII names unless the user specifies
the encoding. To be less error-prone, it could print the names as
interpreted in several different encodings as part of the error
message, to help the user identify which one is correct. IMO it should
also automatically assume UTF-8 and suppress the error condition if
the names all decode as valid UTF-8, since the probability of
meaningful non-UTF-8 text decoding successful as UTF-8 is negligible.

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox


Re: busybox nslookup slow on x86_64

2014-05-30 Thread Rich Felker
On Sat, May 31, 2014 at 02:02:35AM +0200, Denys Vlasenko wrote:
 On Thursday 29 May 2014 20:47, Rich Felker wrote:
  On Wed, May 28, 2014 at 04:34:23PM +0200, Denys Vlasenko wrote:
   On Tue, May 27, 2014 at 9:34 AM, muddyboot emu...@gmail.com wrote:
Hi, I found nslookup resolve very slow on x86_64 system , it cost 5 
seconds or longer almost everytime.
   
Tested OS: Debian 7 x86_64 with kernel 3.2.5   LFS x86_64 with kernel 
3.12
   
No IPv6 enabled in kernel config.
DNS server works fine
nslookup program from bind-9.7 works fine
nslookup from busybox test on i686 system OK
   
target busybox version: 1.17.4、1.20.2、1.21.1、1.22.1
   
Any response for this problem is great appreciated.
   
   bbox nslookup uses libc to perform the lookup.
   
   glibc maintainers known to be quite.. er.. stubborn
   about how DNS should work.
   
   For example, they insist that IPv6 DNS requests must be sent
   even if the machine has no IPv6 support in kernel
   (let alone a more typical case where machine
   has no IPv6 connectivity).
   
   Your DNS server does not respond to IPv6 requests,
   but glibc waits for them.
  
  Unless the caller requested AI_ADDRCONFIG or requested AF_INET
  explicitly as opposed to AF_INET6, it's required to do this. And I
  don't think it's a bug. It may be useful to know all DNS results even
  if some of them (v6) won't be used for your current client setup. The
  bug is in whatever broken nameserver is ignoring  requests rather
  than properly looking them up and returning a result.
 
 Well, in many cases users have no power over that nameserver.
 
 Forcing them to suffer instead of giving them ways to disable
  requests is arrogant.

As I said an option in nslookup to do this would be nice. I believe
glibc also has an option in resolv.conf (mentioned earlier in this
thread, IIRC) to disable the  requests globally. But I don't think
it makes sense to complain about glibc doing both lookups by default,
since that is the behavior specified/required by the standard. IMO
this is one of the few cases of glibc doing the right thing (properly
supporting modern usage rather than focusing on bug-compatibility for
broken legacy setups).

Rich
___
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox