Re: XDL_FAST_HASH can be very slow

2014-12-22 Thread Patrick Reynolds
I have been working with Peff on this and have more results to share.

For background, xdl_hash_record is a hashing function, producing an
unsigned long from an input string terminated by either a newline or
the end of the mmap'd file.

The original xdl_hash_record is essentially DJB hash, which does a
multiplication, load, and xor for each byte of the input.  Commit
6942efc introduces an "XDL_FAST_HASH" version of the same function
that is clearly inspired by the DJB hash, but it does only one
multiplication, load, and xor for each 8 bytes of input -- i.e., fewer
loads, but also a lot less bit mixing.  Less mixing means far more
collisions, leading to the performance problems with evil-icons.  It's
not clear to me if the XDL_FAST_HASH version intended to match the
output of the DJB hash function, but it doesn't at all.

Peff has been experimenting with using two modern hash functions, FNV
and Murmur3.  In theory, these should produce fewer collisions than
DJB, but in his measurements, they didn't run diff any faster than
plain DJB.  They do fix the evil-icons problem.

I have implemented two simpler possibilities, both of which fix the
problems diffing the evil-icons repository:

1. An XDL_FAST_HASH implementation that matches the output of the DJB
hash exactly.  Its performance is basically the same as DJB, because
the only thing is does differently is load 8 bytes at a time instead
of 1.  It does all the same ALU operations as DJB.

2. Using (hash % prime_number) instead of (hash & ((1

[PATCH v2] unblock and unignore SIGPIPE

2014-09-18 Thread Patrick Reynolds
Blocked and ignored signals -- but not caught signals -- are inherited
across exec.  Some callers with sloppy signal-handling behavior can call
git with SIGPIPE blocked or ignored, even non-deterministically.  When
SIGPIPE is blocked or ignored, several git commands can run indefinitely,
ignoring EPIPE returns from write() calls, even when the process that
called them has gone away.  Our specific case involved a pipe of git
diff-tree output to a script that reads a limited amount of diff data.

In an ideal world, git would never be called with SIGPIPE blocked or
ignored.  But in the real world, several real potential callers, including
Perl, Apache, and Unicorn, sometimes spawn subprocesses with SIGPIPE
ignored.  It is easier and more productive to harden git against this
mistake than to clean it up in every potential parent process.

Signed-off-by: Patrick Reynolds 
Signed-off-by: Junio C Hamano 
---
1. Merged Junio's work from pu: moved restore_sigpipe_to_default into
git.c and restyled the new tests.
2. Moved the new tests into t0005.  This meant switching back to `git
diff` as our data generator, as the sample repo in t0005 doesn't have any
files for `git ls-files` to output.
3. Squashed.

 git.c  | 22 ++
 t/t0005-signals.sh | 22 ++
 2 files changed, 44 insertions(+)

diff --git a/git.c b/git.c
index 210f1ae..0f03d56 100644
--- a/git.c
+++ b/git.c
@@ -593,6 +593,26 @@ static int run_argv(int *argcp, const char ***argv)
return done_alias;
 }
 
+/*
+ * Many parts of Git have subprograms communicate via pipe, expect the
+ * upstream of the pipe to die with SIGPIPE and the downstream process
+ * even knows to check and handle EPIPE correctly.  Some third-party
+ * programs that ignore or block SIGPIPE for their own reason forget
+ * to restore SIGPIPE handling to the default before spawning Git and
+ * break this carefully orchestrated machinery.
+ *
+ * Restore the way SIGPIPE is handled to default, which is what we
+ * expect.
+ */
+static void restore_sigpipe_to_default(void)
+{
+   sigset_t unblock;
+
+   sigemptyset(&unblock);
+   sigaddset(&unblock, SIGPIPE);
+   sigprocmask(SIG_UNBLOCK, &unblock, NULL);
+   signal(SIGPIPE, SIG_DFL);
+}
 
 int main(int argc, char **av)
 {
@@ -612,6 +632,8 @@ int main(int argc, char **av)
 */
sanitize_stdfds();
 
+   restore_sigpipe_to_default();
+
git_setup_gettext();
 
trace_command_performance(argv);
diff --git a/t/t0005-signals.sh b/t/t0005-signals.sh
index 981437b..638a355 100755
--- a/t/t0005-signals.sh
+++ b/t/t0005-signals.sh
@@ -27,4 +27,26 @@ test_expect_success !MINGW 'signals are propagated using 
shell convention' '
test_expect_code 143 git sigterm
 '
 
+large_git () {
+   for i in $(test_seq 1 100)
+   do
+   git diff --cached --binary || return
+   done
+}
+
+test_expect_success 'create blob' '
+   test-genrandom foo 16384 >file &&
+   git add file
+'
+
+test_expect_success 'a constipated git dies with SIGPIPE' '
+   OUT=$( ((large_git; echo $? 1>&3) | :) 3>&1 )
+   test "$OUT" -eq 141
+'
+
+test_expect_success 'a constipated git dies with SIGPIPE even if parent 
ignores it' '
+   OUT=$( ((trap "" PIPE; large_git; echo $? 1>&3) | :) 3>&1 )
+   test "$OUT" -eq 141
+'
+
 test_done
-- 
2.0.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unblock and unignore SIGPIPE

2014-09-18 Thread Patrick Reynolds
On Wed, Sep 17, 2014 at 3:11 AM, Jeff King  wrote:
> Would we want to call it from external C commands, too? For the most
> part, git.c is the entry point for running git commands, and any
> sanitizing it does will be inherited by sub-commands. But it _is_ still
> legal to call dashed commands individually, and even required in some
> cases (e.g., git-upload-pack for ssh clients).

git-upload-pack is protected pretty well from SIGPIPE shenanigans,
because its stdout all goes through write_or_die, as of cdf4fb8.  We
did, long ago, have some EPIPE problems with upload-pack and SSH
clients, but it all predates cdf4fb8.

So I think it's redundant to unblock SIGPIPE in git-upload-pack.

I'll tidy up as Junio recommended, recheck the tests, and submit
an updated patch shortly.

--Patrick
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] unblock and unignore SIGPIPE

2014-08-21 Thread Patrick Reynolds
On Sun, Aug 17, 2014 at 8:14 PM, Eric Wong  wrote:
> But unicorn would ignore SIGPIPE it if Ruby did not; relying on SIGPIPE
> while doing any multiplexed I/O doesn't work well.

Exactly.  Callers block SIGPIPE for their own legitimate reasons, but they
don't consistently unblock it before spawning a git subprocess that needs
the default SIGPIPE behavior.  Easier to fix it in git than in every potential
caller.

--Patrick
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] unblock and unignore SIGPIPE

2014-08-14 Thread Patrick Reynolds
Blocked and ignored signals -- but not caught signals -- are inherited
across exec.  Some callers with sloppy signal-handling behavior can call
git with SIGPIPE blocked or ignored, even non-deterministically.  When
SIGPIPE is blocked or ignored, several git commands can run indefinitely,
ignoring EPIPE returns from write() calls, even when the process that
called them has gone away.  Our specific case involved a pipe of git
diff-tree output to a script that reads a limited amount of diff data.

In an ideal world, git would never be called with SIGPIPE blocked or
ignored.  But in the real world, several real potential callers, including
Perl, Apache, and Unicorn, sometimes spawn subprocesses with SIGPIPE
ignored.  It is easier and more productive to harden git against this
mistake than to clean it up in every potential parent process.

Signed-off-by: Patrick Reynolds 
---
 cache.h|  1 +
 git.c  |  5 +
 setup.c| 11 +++
 t/t0012-sigpipe.sh | 27 +++
 4 files changed, 44 insertions(+)
 create mode 100755 t/t0012-sigpipe.sh

diff --git a/cache.h b/cache.h
index fcb511d..0a89fc1 100644
--- a/cache.h
+++ b/cache.h
@@ -463,6 +463,7 @@ extern int set_git_dir_init(const char *git_dir, const char 
*real_git_dir, int);
 extern int init_db(const char *template_dir, unsigned int flags);
 
 extern void sanitize_stdfds(void);
+extern void sanitize_signals(void);
 extern int daemonize(void);
 
 #define alloc_nr(x) (((x)+16)*3/2)
diff --git a/git.c b/git.c
index 9c49519..d6b221b 100644
--- a/git.c
+++ b/git.c
@@ -611,6 +611,11 @@ int main(int argc, char **av)
 */
sanitize_stdfds();
 
+   /*
+* Make sure we aren't ignoring or blocking SIGPIPE.
+*/
+   sanitize_signals();
+
git_setup_gettext();
 
trace_command_performance(argv);
diff --git a/setup.c b/setup.c
index 0a22f8b..7aa4b01 100644
--- a/setup.c
+++ b/setup.c
@@ -865,3 +865,14 @@ int daemonize(void)
return 0;
 #endif
 }
+
+/* un-ignore and un-block SIGPIPE */
+void sanitize_signals(void)
+{
+   sigset_t unblock;
+
+   sigemptyset(&unblock);
+   sigaddset(&unblock, SIGPIPE);
+   sigprocmask(SIG_UNBLOCK, &unblock, NULL);
+   signal(SIGPIPE, SIG_DFL);
+}
diff --git a/t/t0012-sigpipe.sh b/t/t0012-sigpipe.sh
new file mode 100755
index 000..213cde3
--- /dev/null
+++ b/t/t0012-sigpipe.sh
@@ -0,0 +1,27 @@
+#!/bin/sh
+
+test_description='check handling of SIGPIPE'
+. ./test-lib.sh
+
+test_expect_success 'create blob' '
+   test-genrandom foo 16384 >file &&
+   git add file
+'
+
+large_git () {
+   for i in $(test_seq 1 100); do
+   git diff --staged --binary || return $?
+   done
+}
+
+test_expect_success 'git dies with SIGPIPE' '
+   OUT=$( ((large_git; echo $? 1>&3) | true) 3>&1 )
+   test "$OUT" -eq 141
+'
+
+test_expect_success 'git dies with SIGPIPE even if parent ignores it' '
+   OUT=$( ((trap "" PIPE; large_git; echo $? 1>&3) | true) 3>&1 )
+   test "$OUT" -eq 141
+'
+
+test_done
-- 
2.0.1

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] use a hashmap to make remotes faster

2014-07-29 Thread Patrick Reynolds
Remotes are stored as an array, so looking one up or adding one without
duplication is an O(n) operation.  Reading an entire config file full of
remotes is O(n^2) in the number of remotes.  For a repository with tens of
thousands of remotes, the running time can hit multiple minutes.

Hash tables are way faster.  So we add a hashmap from remote name to
struct remote and use it for all lookups.  The time to add a new remote to
a repo that already has 50,000 remotes drops from ~2 minutes to < 1
second.

We retain the old array of remotes so iterators proceed in config-file
order.

Signed-off-by: Patrick Reynolds 
---
mark function with no arguments as taking void

 remote.c | 63 ++-
 remote.h |  3 +++
 2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/remote.c b/remote.c
index a0701f6..52533e4 100644
--- a/remote.c
+++ b/remote.c
@@ -42,6 +42,7 @@ struct rewrites {
 static struct remote **remotes;
 static int remotes_alloc;
 static int remotes_nr;
+static struct hashmap remotes_hash;
 
 static struct branch **branches;
 static int branches_alloc;
@@ -136,26 +137,51 @@ static void add_url_alias(struct remote *remote, const 
char *url)
add_pushurl_alias(remote, url);
 }
 
+struct remotes_hash_key {
+   const char *str;
+   int len;
+};
+
+static int remotes_hash_cmp(const struct remote *a, const struct remote *b, 
const struct remotes_hash_key *key)
+{
+   if (key)
+   return strncmp(a->name, key->str, key->len) || 
a->name[key->len];
+   else
+   return strcmp(a->name, b->name);
+}
+
+static inline void init_remotes_hash(void)
+{
+   if (!remotes_hash.cmpfn)
+   hashmap_init(&remotes_hash, (hashmap_cmp_fn)remotes_hash_cmp, 
0);
+}
+
 static struct remote *make_remote(const char *name, int len)
 {
-   struct remote *ret;
-   int i;
+   struct remote *ret, *replaced;
+   struct remotes_hash_key lookup;
+   struct hashmap_entry lookup_entry;
 
-   for (i = 0; i < remotes_nr; i++) {
-   if (len ? (!strncmp(name, remotes[i]->name, len) &&
-  !remotes[i]->name[len]) :
-   !strcmp(name, remotes[i]->name))
-   return remotes[i];
-   }
+   if (!len)
+   len = strlen(name);
+
+   init_remotes_hash();
+   lookup.str = name;
+   lookup.len = len;
+   hashmap_entry_init(&lookup_entry, memhash(name, len));
+
+   if ((ret = hashmap_get(&remotes_hash, &lookup_entry, &lookup)) != NULL)
+   return ret;
 
ret = xcalloc(1, sizeof(struct remote));
ret->prune = -1;  /* unspecified */
ALLOC_GROW(remotes, remotes_nr + 1, remotes_alloc);
remotes[remotes_nr++] = ret;
-   if (len)
-   ret->name = xstrndup(name, len);
-   else
-   ret->name = xstrdup(name);
+   ret->name = xstrndup(name, len);
+
+   hashmap_entry_init(ret, lookup_entry.hash);
+   replaced = hashmap_put(&remotes_hash, ret);
+   assert(replaced == NULL);  /* no previous entry overwritten */
return ret;
 }
 
@@ -722,13 +748,16 @@ struct remote *pushremote_get(const char *name)
 
 int remote_is_configured(const char *name)
 {
-   int i;
+   struct remotes_hash_key lookup;
+   struct hashmap_entry lookup_entry;
read_config();
 
-   for (i = 0; i < remotes_nr; i++)
-   if (!strcmp(name, remotes[i]->name))
-   return 1;
-   return 0;
+   init_remotes_hash();
+   lookup.str = name;
+   lookup.len = strlen(name);
+   hashmap_entry_init(&lookup_entry, memhash(name, lookup.len));
+
+   return hashmap_get(&remotes_hash, &lookup_entry, &lookup) != NULL;
 }
 
 int for_each_remote(each_remote_fn fn, void *priv)
diff --git a/remote.h b/remote.h
index 917d383..81cb5ff 100644
--- a/remote.h
+++ b/remote.h
@@ -1,6 +1,7 @@
 #ifndef REMOTE_H
 #define REMOTE_H
 
+#include "hashmap.h"
 #include "parse-options.h"
 
 enum {
@@ -10,6 +11,8 @@ enum {
 };
 
 struct remote {
+   struct hashmap_entry ent;  /* must be first */
+
const char *name;
int origin;
 
-- 
2.0.0.rc4