Re: [Bug-wget] Segmentation fault with current development version of wget

2013-05-01 Thread Gijs van Tulder
on the opt.method would be 'restored' to POST by RESTORE_POST_DATA. Regards, Gijs Op 01-05-13 22:16 schreef Giuseppe Scrivano: hi Gijs, Gijs van Tulder gvtul...@gmail.com writes: Giuseppe Scrivano wrote: what about this patch? Any comment? Another suggestion: why not save the original

[Bug-wget] Remaining reference to opt.post_data (WARC in src/http.c)

2013-04-21 Thread Gijs van Tulder
a simple fix. See the attached patch. Regards, Gijs From d2e6e16b3062cc0e6b3c13fd04e3654ed2dbdb6e Mon Sep 17 00:00:00 2001 From: Gijs van Tulder gvtul...@gmail.com Date: Sun, 21 Apr 2013 22:36:50 +0200 Subject: [PATCH] Remove old reference to opt.post_data. --- src/ChangeLog |5 + src/http.c

[Bug-wget] Standards fix for metadata records in WARC files

2013-04-12 Thread Gijs van Tulder
/resources/warc-implementation-guidelines-v1 commit b54fb8feb9dfb2a111d15f1b759de61217d5251e Author: Gijs van Tulder gvtul...@gmail.com Date: Fri Apr 12 23:37:45 2013 +0200 warc: Follow the guidelines for metadata records Do not use the same UUID for the manifest and arguments records

Re: [Bug-wget] wget 1.14 possibly writing off-spec warc.gz files

2013-03-30 Thread Gijs van Tulder
/private/wgzip.c#314 diff --git a/src/ChangeLog b/src/ChangeLog index 8e1213f..65d636d 100644 --- a/src/ChangeLog +++ b/src/ChangeLog @@ -1,3 +1,8 @@ +2013-03-31 Gijs van Tulder gvtul...@gmail.com + + * warc.c: Correctly write the field length in the skip length field + of .warc.gz files

Re: [Bug-wget] [PATCH] Invalid Content-Length header in WARC files, on some platforms

2012-11-24 Thread Gijs van Tulder
Ruehsen fixed this in his version of the patch.) The attached patch uses number_to_string to fix the printf in warc_write_cdx_record. Regards, Gijs From 21fc9f0dd9c71e2dc3aea29be4e16f14620d12a5 Mon Sep 17 00:00:00 2001 From: Gijs van Tulder gvtul...@gmail.com Date: Sat, 24 Nov 2012 12:44:14

[Bug-wget] Invalid Content-Length header in WARC files, on some platforms

2012-11-12 Thread Gijs van Tulder
. Regards, Gijs commit 66c0595f5440b36afb7307d4cab3d6430254183b Author: Gijs van Tulder gvtul...@gmail.com Date: Mon Nov 12 22:03:30 2012 +0100 Fix for invalid WARC Content-Length header on some platforms. diff --git a/src/ChangeLog b/src/ChangeLog index ec78fe8..3901d94 100644 --- a/src/ChangeLog

[Bug-wget] Segfault with WARC + CDX

2012-05-30 Thread Gijs van Tulder
(rec_existing-url, url) == 0) The attached patch makes this change. The deduplication works better. Regards, Gijs From 807b98d7d9289765c9f210336d2dbf294d663f99 Mon Sep 17 00:00:00 2001 From: Gijs van Tulder gvtul...@gmail.com Date: Wed, 30 May 2012 23:00:04 +0200 Subject: [PATCH] warc: Fix segfault

[Bug-wget] Combining --output-document with --recursive

2012-05-24 Thread Gijs van Tulder
Hi, There's a problem if you combine --output-document with --recursive or --page-requisites. --output-document breaks the recursion. First you get a warning: WARNING: combining -O with -r or -p will mean that all downloaded content will be placed in the single file you specified. That

Re: [Bug-wget] Regular expression matching

2012-04-10 Thread Gijs van Tulder
:28:11 + @@ -1,3 +1,8 @@ +2012-04-11 Gijs van Tulder gvtul...@gmail.com + + * bootstrap.conf (gnulib_modules): Include module `regex'. + * configure.ac: Check for PCRE library. + 2012-03-25 Ray Satiro raysat...@yahoo.com * configure.ac: Fix build under mingw when OpenSSL is used

[Bug-wget] Regular expression matching

2012-04-04 Thread Gijs van Tulder
Hi, Here is a patch that adds the --acceptregex and --rejectregex options. With these options it would be possible to do two things: 1. You can match complete urls, instead of just the directory prefix or the file name suffix (which you can do with --accept and --include-directories). 2. You

Re: [Bug-wget] Regular expression matching

2012-04-04 Thread Gijs van Tulder
Ángel González wrote: I really like PCRE, but I think the default should be POSIX regex Certainly. (I'm not sure if it's even worth adding the PCRE option. Matching URLs can't be that hard, can it?) How are the interactions between --{accept,reject}regex and --{accept,reject}? The regex

[Bug-wget] Fix for crash on invalid STYLE tag

2012-04-01 Thread Gijs van Tulder
18:13:27 + +++ src/ChangeLog 2012-04-01 20:35:28 + @@ -1,3 +1,7 @@ +2012-04-01 Gijs van Tulder gvtul...@gmail.com (tiny change) + + * html-url.c: Prevent crash on incomplete STYLE tag. + 2012-03-29 From: Tim Ruehsen tim.rueh...@gmx.de (tiny change) * utils.c (library): Include sys

[Bug-wget] Fix: Large files in WARC

2012-01-31 Thread Gijs van Tulder
-31 23:16:33 + @@ -1,3 +1,9 @@ +2012-02-01 Gijs van Tulder gvtul...@gmail.com + + * warc.c: Fix large file support with ftello, fseeko. + * warc.h: Fix large file support. + * http.c: Fix large file support. + 2012-01-27 Gijs van Tulder gvtul...@gmail.com * retr.c (fd_read_body

[Bug-wget] Two fixes: Memory leak with chunked responses / Chunked responses and WARC files

2012-01-27 Thread Gijs van Tulder
+1,8 @@ +2012-01-27 Gijs van Tulder gvtul...@gmail.com + + * retr.c (fd_read_body): Fix a memory leak with chunked responses. + * http.c (skip_short_body): Fix the same memory leak. + 2012-01-09 Gijs van Tulder gvtul...@gmail.com * init.c: Disable WARC compression if zlib is disabled

Re: [Bug-wget] Cannot compile current bzr trunk: undefined reference to `gzwrite' / `gzclose' / `gzdopen'

2012-01-09 Thread Gijs van Tulder
regards, Thank you for a wonderful utility, -- Evgeniy === modified file 'ChangeLog' --- ChangeLog 2011-12-12 20:30:39 + +++ ChangeLog 2012-01-09 13:40:01 + @@ -1,3 +1,7 @@ +2012-01-09 Gijs van Tulder gvtul...@gmail.com + + * configure.ac: Always try to use libz, even without SSL

Re: [Bug-wget] WARC, new version

2011-11-04 Thread Gijs van Tulder
lovely. I am going to push it soon with some small adjustments. That's good to hear. There's one other small adjustment that you may want to make, see the attached patch. One of the WARC functions uses the basename function, which causes problems on OS X. Including libgen.h and strdup-ing

[Bug-wget] Memory leak when using GnuTLS

2011-10-31 Thread Gijs van Tulder
Hi, I think there is a memory leak in the GnuTLS part of wget. When downloading multiple files from a HTTPS server, wget with GnuTLS uses a lot of memory. Perhaps an explanation for this can be found in src/http.c. The gethttp calls ssl_init for each download: /* Initialize the SSL

Re: [Bug-wget] WARC, new version

2011-10-30 Thread Gijs van Tulder
Hi David, David H. Lipman wrote: I have seen WARC mentioned but have not seen a definition. WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web resources. It is used for making archives of web sites. The Internet Archive, for example, uses it as the file format for

Re: [Bug-wget] WARC output

2011-09-26 Thread Gijs van Tulder
can you please send a complete diff against the current development tree version? Here's the diff of the WARC additions (1.9MB zipped) to revision 2565: http://dl.dropbox.com/u/365100/wget_warc-20110926-complete.patch.bz2 Thanks, Gijs

Re: [Bug-wget] WARC output

2011-09-25 Thread Gijs van Tulder
Hi. It's been a while since we've discussed the WARC addition to Wget. Is there anything I can help with? Gijs

Re: [Bug-wget] WARC output

2011-08-10 Thread Gijs van Tulder
Giuseppe Scrivano writes: The implementation makes use of the open source WARC Tools library (Apache License 2.0): http://code.google.com/p/warc-tools/ how much code is really needed from that library? I wonder if we can avoid this dependency at all. The library comes with some

[Bug-wget] WARC output

2011-08-09 Thread Gijs van Tulder
makes use of the open source WARC Tools library (Apache License 2.0): http://code.google.com/p/warc-tools/ I look forward to your response. Kind regards, Gijs van Tulder