[Bug-wget] [Bug-Wget] Issues with Metalink support

2014-03-18 Thread Darshit Shah
I was trying to download a large ISO (>4GB) through a metalink file.

The first thing that struck me was: The file is first downloaded to
/tmp and then moved to the location.

Is there any specific reason for this? I understand that downloading
partial files to /tmp , stitching them and then moving them to the
actual download location might be good on I/O. But, a very popular
case in such scenarios is when /tmp is not large enough to store the
file, but I have enough capacity in the location where I'm trying to
store it.

Wget tried to download the file to /tmp, where it failed since it ran
out of free space. As a result, Wget crashed instead of exiting
gracefully. Failing in itself should be considered a bug since I never
intended to save it to /tmp. The download location I selected has
enough disk space.

We know that origin/master does not fail even with very large files.
Even if the disk is out of capacity, Wget manages to exit gracefully.
I'll try to dive into the code once I get time, but if anyone has any
ideas in the meantime, it would be greatly appreciated!

-- 
Thanking You,
Darshit Shah



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-03-19 Thread Giuseppe Scrivano
Darshit Shah  writes:

> I was trying to download a large ISO (>4GB) through a metalink file.
>
> The first thing that struck me was: The file is first downloaded to
> /tmp and then moved to the location.
>
> Is there any specific reason for this? I understand that downloading
> partial files to /tmp , stitching them and then moving them to the
> actual download location might be good on I/O. But, a very popular
> case in such scenarios is when /tmp is not large enough to store the
> file, but I have enough capacity in the location where I'm trying to
> store it.

I am not familiar with the metalink code as well and I don't know if
there is any better reason for this behaviour, but if the reason is only
because of an optimization then we shouldn't do this.

Regards,
Giuseppe



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-03-19 Thread Darshit Shah
On Wed, Mar 19, 2014 at 8:47 AM, Giuseppe Scrivano  wrote:
> Darshit Shah  writes:
>
>> I was trying to download a large ISO (>4GB) through a metalink file.
>>
>> The first thing that struck me was: The file is first downloaded to
>> /tmp and then moved to the location.
>>
>> Is there any specific reason for this? I understand that downloading
>> partial files to /tmp , stitching them and then moving them to the
>> actual download location might be good on I/O. But, a very popular
>> case in such scenarios is when /tmp is not large enough to store the
>> file, but I have enough capacity in the location where I'm trying to
>> store it.
>
> I am not familiar with the metalink code as well and I don't know if
> there is any better reason for this behaviour, but if the reason is only
> because of an optimization then we shouldn't do this.
>
Yeah, I agree. This part needs changing. Also, we need to test the
stability of Wget. Can't have it segfaulting on every failure.

The progress output is very complex and unreadable too. I think I'll
make a Wiki page based on all the things that need changing in
parallel-wget.



-- 
Thanking You,
Darshit Shah



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-05 Thread L Walsh



Darshit Shah wrote:

I was trying to download a large ISO (>4GB) through a metalink file.

The first thing that struck me was: The file is first downloaded to
/tmp and then moved to the location.

Is there any specific reason for this?


Sorry for the long delay answering this but I thought
I would mention a specific reason that such is done
on windows (that may apply to linux in various degrees
depending on filesystem type used and file-system activity).

To answer the question, there is a reason, but
its importance would be specific to each user's use case.

It is consistent with how some files from the internet are
downloaded, copied or extracted on windows.

I.e. IE will download things to a tmp dir (usually
under the user's home dir on windows), then
move it into place when it is done.  This prevents partly
transfered files from appearing in the destination.

Downloading this way can, also, *allow* for allocating
sufficient contiguous space at the destination in 1
allocation, and then copying the file
into place -- this allows for less fragmentation at the
final destination.  This is more true with larger
files and slower downloads that might stretch over several
or more minutes.  Other activity on the disk
is likely and if writes occur, they might happen in the
middle of where the downloaded file _could_ have had
contiguous space.

So putting a file that is likely to be fragmented as it
is downloaded due to other processes running, into
a 'tmp' location, can allow for knowing the full size
and allocating the full amount for the file so it can
be contiguous on disk.

It can't allocate the full amount for the file at
the destination until it has the whole thing locally, since
if the download is interrupted, the destination would contain
a file that looks to be the right size, but would have
an incomplete download in it.

Anyway -- the behavior of copying it to a tmp is a useful
feature to have -- IF you have the space.  It would be
a "nice" (not required) feature if there was an option on
how to do this (i.e. store file directly on download, or
use a tmpdir and then move (or copy) the file into the
final location.

Always going direct is safest if user is tight on diskspace,
but has the deficit of often causing more disk fragmentation.

(FWIW, I don't really care one way or the other, but wanted
to tell you why it might be useful)...

Cheers!
Linda



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-05 Thread Random Coder
On Sat, Apr 5, 2014 at 4:09 PM, L Walsh  wrote:

> I.e. IE will download things to a tmp dir (usually
> under the user's home dir on windows), then
> move it into place when it is done.  This prevents partly
> transfered files from appearing in the destination.
>

IE does not download to a tmp folder.  For instance, I just downloaded a
file to a folder, and I can watch the file grow in the destination folder.
 IE uses a ".partial" extension for the file as it downloads it, renaming
the file to the proper file when it's done.  Chrome and Firefox behave
similarly, just using a different extension for the partial file.

] dir ubuntu-12.04.4-desktop-amd64.iso.scwpnys.partial
04/05/2014  04:57 PM13,115,224
ubuntu-12.04.4-desktop-amd64.iso.scwpnys.partial

] dir ubuntu-12.04.4-desktop-amd64.iso.scwpnys.partial Volume in drive C is
OS
04/05/2014  04:57 PM14,163,800
ubuntu-12.04.4-desktop-amd64.iso.scwpnys.partial

I'm not convinced trying to pre-optimize for disk fragmentation is useful
here.  If the user is concerned about such things, they're free to copy the
download after it's done and delete the original.  Or run an defragmenter.


Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-05 Thread L Walsh



Random Coder wrote:
On Sat, Apr 5, 2014 at 4:09 PM, L Walsh > wrote:


I.e. IE will download things to a tmp dir (usually
under the user's home dir on windows), then
move it into place when it is done.  This prevents partly
transfered files from appearing in the destination.


IE does not download to a tmp folder.

---

It depends on timing, what version of IE, and probably
the phase of the moon, but here's a abbreviated trace of me downloading
the linux kernel into C:\tmp\download.  I annotate what's going on in
the left column... you can see almost 50% of the file was downloaded
into a tmp file, then switched to final destination and only
wrote 1M chunks instead of previous 4-12K chunks.

6:17:13,IEXPLORE,CreateFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N","Desired Access: 
Read Attributes, OpenResult: Opened"
6:17:13,IEXPLORE,CreateFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Desired 
Access: Generic Write, Read Attributes, OpenResult: Created"
6:17:13,IEXPLORE,SetAllocationInformationFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","AllocationSize: 
78,399,152"
6:17:13,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
0, Length: 704, Priority: Normal"
6:17:13,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
704, Length: 1,944"
6:17:13,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
2,648, Length: 8,192"
6:17:13,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
10,840, Length: 4,096"

...
6:17:23,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
36,207,192, Length: 4,096"
6:17:23,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
36,211,288, Length: 4,096"
6:17:23,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
36,215,384, Length: 16,384"


I've typed in the save pathname now:
6:17:23,explorer,808","CreateFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz","Desired Access: Read Attributes, 
OpenResult: Opened"

6:17:23,explorer,808","CloseFile",OK ,"C:\tmp\download\linux-3.14.tar.xz",""
6:17:23,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
36,231,768, Length: 4,096"

...
6:17:23,IEXPLORE,WriteFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
36,461,144, Length: 4,096"

...

opens "partial file in same directory":

6:17:23,IEXPLORE,CreateFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Desired Access: Generic 
Write, OpenResult: Opened"


copies from 1st tmp to final location tmp, but in 1MB increments
6:17:23,IEXPLORE,ReadFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
0, Length: 1,048,576, Priority: Normal"
6:17:23,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 0, Length: 
1,048,576, Priority: Normal"

...
6:17:23,IEXPLORE,ReadFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
35,651,584, Length: 817,752"
6:17:23,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 35,651,584, 
Length: 817,752"
6:17:23,IEXPLORE,ReadFile","END OF 
FILE","C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Offset: 
36,469,336, Length: 1,048,576"


deletes first tmp, and now saved directly to "patial" at destination:
6:17:23,IEXPLORE,SetDispositionInformationFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz","Delete: 
True"
6:17:23,IEXPLORE,CloseFile",OK 
,"C:AppData\Local\MS\Win\\BNZE234N\linux-3.14.tar[1].xz",""


only 1M writes:
6:17:23,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 36,469,336, 
Length: 1,048,576"
6:17:24,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 37,517,912, 
Length: 1,048,576"
6:17:24,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 38,566,488, 
Length: 1,048,576"
6:17:24,explorer,QueryDirectory",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Filter: 
linux-3.14.tar.xz.w5aj0r5.partial, 1: linux-3.14.tar.xz.w5aj0r5.partial"



final output being created:
6:17:24,explorer,CreateFile",OK ,"C:\tmp\download\linux-3.14.tar.xz","Desired 
Access: Read Attributes, Read Control, OpenResult: Opened"


more writes to partial:
6:17:25,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 40,663,640, 
Length: 1,048,576"
6:17:25,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 41,712,216, 
Length: 1,048,576"
6:17:25,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offset: 42,760,792, 
Length: 1,048,576"
6:17:25,IEXPLORE,WriteFile",OK 
,"C:\tmp\download\linux-3.14.tar.xz.w5aj0r5.partial","Offs

Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-06 Thread Steven M. Schweda
   I havan't been following this thread closely, but the following
statement caught my eye:

> Even utilities like winzip and 7zip will extract file to the user's tmp
> dir before copying or moving them into the final location.

   Really?  I can't speak for those programs, but I can say that
Info-ZIP UnZip definitely does not use "the user's tmp dir" when
extracting an archive member.  Extraction is done directly to the actual
destination directory.

   When creating an archive, Info-ZIP Zip does create a temporary
archive file, but it normally uses the actual destination directory for
this temporary archive file, and then renames it to the user-specified
name if it's created successfully.  With Zip, a user can use the
-b/--temp-path option to specify explicitly a different temporary
archive directory, and then Zip will do a copy+delete operation if a
rename operation fails.  However, this is useful only if the archive
destination is something like a WORM drive, where a rename operation is
not available.  In general, a rename operation won't work between
devices, and copy+delete is much slower than rename, so using a
different directory for the temporary archive is generally a bad idea.

   In some cases, on some operating systems (VMS, for example), UnZip
can pre-allocate disk space when extracting an archive member.  It's
not generally done, because the methods used tend to be OS-specific.

   I'll let you decide what Wget should be doing, but I'd be careful
about faulty analogies to other programs.



   Steven M. Schweda   sms@antinode-info
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-06 Thread Ángel González

On 06/04/14 01:09, L Walsh wrote:

Sorry for the long delay answering this but I thought
I would mention a specific reason that such is done
on windows (that may apply to linux in various degrees
depending on filesystem type used and file-system activity).

To answer the question, there is a reason, but
its importance would be specific to each user's use case.

It is consistent with how some files from the internet are
downloaded, copied or extracted on windows.

I.e. IE will download things to a tmp dir (usually
under the user's home dir on windows), then
move it into place when it is done.  This prevents partly
transfered files from appearing in the destination.

Downloading this way can, also, *allow* for allocating
sufficient contiguous space at the destination in 1
allocation, and then copying the file
into place -- this allows for less fragmentation at the
final destination.  This is more true with larger
files and slower downloads that might stretch over several
or more minutes.  Other activity on the disk
is likely and if writes occur, they might happen in the
middle of where the downloaded file _could_ have had
contiguous space.

So putting a file that is likely to be fragmented as it
is downloaded due to other processes running, into
a 'tmp' location, can allow for knowing the full size
and allocating the full amount for the file so it can
be contiguous on disk.

If %TEMP% is in the same drive as the final folder, you still
have fragmentation.


It can't allocate the full amount for the file at
the destination until it has the whole thing locally, since
if the download is interrupted, the destination would contain
a file that looks to be the right size, but would have
an incomplete download in it.

It's possible -with some FS- with the Linux-specific fallocate() syscall,
but that's hardly portable :)
From Vista onwards, SetFileInformationByHandle(*FILE_ALLOCATION_INFO* 
)

seems able to also do that.
I would make it fail gracefully for the EOLed versions, but seems perfectly
fine to use.


Anyway -- the behavior of copying it to a tmp is a useful
feature to have -- IF you have the space.  It would be
a "nice" (not required) feature if there was an option on
how to do this (i.e. store file directly on download, or
use a tmpdir and then move (or copy) the file into the
final location.

Always going direct is safest if user is tight on diskspace,
but has the deficit of often causing more disk fragmentation.

Not if you do something like calling posix_fallocate(2)
(but it does change the file size)


(FWIW, I don't really care one way or the other, but wanted
to tell you why it might be useful)...

Cheers!
Linda


If you don't want to download with the final filename, I vote for
downloading at the same folder with another extension and
renaming.

I don't think wget should care about fragmentation, though.

Looking a bit the available options, and trying to get the best from
both sides, I think we should download with the file in place, trying
to preallocate the blocks (fallocate, SetFileInformationByHandle)
when possible, but not worrying too much if it can't.

Cheers



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-07 Thread Darshit Shah
I completely agree with Angel, I don't think Wget should bother about
fragmentation. It would be nice to reduce as much fragmentation as
possible, but actually going out of the way to do this doesn't seem very
right to me.
As Random Coder states, users who care about fragmentation should either
manually copy the file, run a defragmentation tool, or more importantly use
a starter File System that handles this for you. We *may* state this in the
man/info pages to ensure that everyone knows what exactly is going on.

Wget could, in theory, use fallocate() for linux, posix_fallocate() for
other posix-compliant systems and SetFileInformationByHandle (is this
available on older versions of Windows?) for Windows systems. It isn't
going out of the way by a large extent but ensures Wget plays well on each
system. However, this is going to lead to way too many code paths and ifdef
statements, and personally speaking, I'd rather we use only
posix_fallocate() everywhere and the Windows SysCalls for Windows.



On Sun, Apr 6, 2014 at 9:41 PM, Ángel González  wrote:

>  On 06/04/14 01:09, L Walsh wrote:
>
> Sorry for the long delay answering this but I thought
> I would mention a specific reason that such is done
> on windows (that may apply to linux in various degrees
> depending on filesystem type used and file-system activity).
>
> To answer the question, there is a reason, but
> its importance would be specific to each user's use case.
>
> It is consistent with how some files from the internet are
> downloaded, copied or extracted on windows.
>
> I.e. IE will download things to a tmp dir (usually
> under the user's home dir on windows), then
> move it into place when it is done.  This prevents partly
> transfered files from appearing in the destination.
>
> Downloading this way can, also, *allow* for allocating
> sufficient contiguous space at the destination in 1
> allocation, and then copying the file
> into place -- this allows for less fragmentation at the
> final destination.  This is more true with larger
> files and slower downloads that might stretch over several
> or more minutes.  Other activity on the disk
> is likely and if writes occur, they might happen in the
> middle of where the downloaded file _could_ have had
> contiguous space.
>
> So putting a file that is likely to be fragmented as it
> is downloaded due to other processes running, into
> a 'tmp' location, can allow for knowing the full size
> and allocating the full amount for the file so it can
> be contiguous on disk.
>
> If %TEMP% is in the same drive as the final folder, you still
> have fragmentation.
>
>
>  It can't allocate the full amount for the file at
> the destination until it has the whole thing locally, since
> if the download is interrupted, the destination would contain
> a file that looks to be the right size, but would have
> an incomplete download in it.
>
> It's possible -with some FS- with the Linux-specific fallocate() syscall,
> but that's hardly portable :)
> From Vista onwards, 
> SetFileInformationByHandle(*FILE_ALLOCATION_INFO*
> )
> seems able to also do that.
> I would make it fail gracefully for the EOLed versions, but seems perfectly
> fine to use.
>
>
> Anyway -- the behavior of copying it to a tmp is a useful
> feature to have -- IF you have the space.  It would be
> a "nice" (not required) feature if there was an option on
> how to do this (i.e. store file directly on download, or
> use a tmpdir and then move (or copy) the file into the
> final location.
>
> Always going direct is safest if user is tight on diskspace,
> but has the deficit of often causing more disk fragmentation.
>
> Not if you do something like calling posix_fallocate(2)
> (but it does change the file size)
>
>
> (FWIW, I don't really care one way or the other, but wanted
> to tell you why it might be useful)...
>
> Cheers!
> Linda
>
>
> If you don't want to download with the final filename, I vote for
> downloading at the same folder with another extension and
> renaming.
>
> I don't think wget should care about fragmentation, though.
>
> Looking a bit the available options, and trying to get the best from
> both sides, I think we should download with the file in place, trying
> to preallocate the blocks (fallocate, SetFileInformationByHandle)
> when possible, but not worrying too much if it can't.
>
> Cheers
>
>


-- 
Thanking You,
Darshit Shah


Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-07 Thread L Walsh



Darshit Shah wrote:
Wget could, in theory, use fallocate() for linux, posix_fallocate() for 
other posix-compliant systems and SetFileInformationByHandle (is this 
available on older versions of Windows?) for Windows systems. It isn't 
going out of the way by a large extent but ensures Wget plays well on 
each system. However, this is going to lead to way too many code paths 
and ifdef statements, and personally speaking, I'd rather we use only 
posix_fallocate() everywhere and the Windows SysCalls for Windows.


Hey, that'd be fine with me -- OR if the length is not known,
then allocating 1Meg chunks at a time and truncating at the final
write.  If performance was an issue, I'd fork off the truncation
in background -- I do something similar in a file util that can
delete duplicates, the deletions I do with async i/o in the
background so they won't slow down the primary function.

I don't usually have a problem with fragmentation on linux
as I run xfs and will do some pre-allocation for you (more in recent
kernels with it's "speculative preallocation"), AND for those who
have degenerate use cases or who are anal-retentive (*cough*) their
is a file-system reorganizer that can be run when needed or on a nightly
cronjob...  So this isn't really a problem for me -- I was answering
the question because MS took preventative measures to try to slow
down disk fragmentation, as NTFS (and FAT for that matter)
will suffer when it gets bad like many file systems.  Most don't protect
themselves to the extremes that xfs does to prevent it.

But a sane middle ground like using posix pre-alloc calls
and such seem like a reasonable middle ground -- or preallocating
larger spaces when downloading large files

I.e. Probably don't want to allocate a meg for each little
1k file on a mirror, but if you see the file size is large (size known),
or have downloaded a meg or more, then preallocation w/a truncate
starts to make some sense...

I was just speaking up to answer the question you posed, about
why someone might copy to one place then another...it wasn't meant
to create a problem as to give some insight as to why it might be done.





Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-07 Thread Darshit Shah
On Mon, Apr 7, 2014 at 4:21 PM, L Walsh  wrote:
>
>
> Darshit Shah wrote:
>>
>> Wget could, in theory, use fallocate() for linux, posix_fallocate() for
>> other posix-compliant systems and SetFileInformationByHandle (is this
>> available on older versions of Windows?) for Windows systems. It isn't going
>> out of the way by a large extent but ensures Wget plays well on each system.
>> However, this is going to lead to way too many code paths and ifdef
>> statements, and personally speaking, I'd rather we use only
>> posix_fallocate() everywhere and the Windows SysCalls for Windows.
>
> 
> Hey, that'd be fine with me -- OR if the length is not known,
> then allocating 1Meg chunks at a time and truncating at the final
> write.  If performance was an issue, I'd fork off the truncation
> in background -- I do something similar in a file util that can
> delete duplicates, the deletions I do with async i/o in the
> background so they won't slow down the primary function.
>
> I don't usually have a problem with fragmentation on linux
> as I run xfs and will do some pre-allocation for you (more in recent
> kernels with it's "speculative preallocation"), AND for those who
> have degenerate use cases or who are anal-retentive (*cough*) their
> is a file-system reorganizer that can be run when needed or on a nightly
> cronjob...  So this isn't really a problem for me -- I was answering
> the question because MS took preventative measures to try to slow
> down disk fragmentation, as NTFS (and FAT for that matter)
> will suffer when it gets bad like many file systems.  Most don't protect
> themselves to the extremes that xfs does to prevent it.
>
> But a sane middle ground like using posix pre-alloc calls
> and such seem like a reasonable middle ground -- or preallocating
> larger spaces when downloading large files
>
> I.e. Probably don't want to allocate a meg for each little
> 1k file on a mirror, but if you see the file size is large (size known),
> or have downloaded a meg or more, then preallocation w/a truncate
> starts to make some sense...
>
I *think* we might be going far from the original issue. Wget as it is
right now on origin/master seems to work perfectly. We could probably
improve or optimize it, but that is calls for a separate discussion.
The issue at hand is how parallel-wget works. Now, one thing we must
remember in this use-case is, we *always* know the file size. If we
don't Wget should automatically fall back to non-parallel download of
a single file.

Armed with the knowledge that we know the file size, I believe the
right way is to allocate the complete block with a .tmp/,swp or
similar extension and then rename(2) the complete download. This is
important since with downloading a single file in multiple parts, you
want to be able to randomly write to different locations of the file.
Continuing such downloads would be a problem. The guys from metalink,
curl, etc have a better idea baout such download scenarios than we do
and could probably suggest some easier alternatives.

> I was just speaking up to answer the question you posed, about
> why someone might copy to one place then another...it wasn't meant
> to create a problem as to give some insight as to why it might be done.
>
Never tried to insinuate that you were. :)
All help and advice is always welcome here as we try to learn and
understand about new things.



-- 
Thanking You,
Darshit Shah



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-04-08 Thread L Walsh



Steven M. Schweda wrote:


   In some cases, on some operating systems (VMS, for example), UnZip
can pre-allocate disk space when extracting an archive member.  It's
not generally done, because the methods used tend to be OS-specific.

---
Do the posix calls, and if the OS is compliant, it works, if not, then no
worse than today.




   I'll let you decide what Wget should be doing, but I'd be careful
about faulty analogies to other programs.

-
   I wouldn't call them faulty analogies.  In the cases I've seen
w/7.zip, it's extracting from a network drive onto the local drive.
While it is true my network drive is faster than hard disks of 8-10 years
ago, it's still 'downloading' from the net, on to the local machine, so..
not sure why you'd call that faulty...   The theme in common is how many
writes from other processes are likely to come in and reserve space in
the middle of your download. 


FWIW -- I just tried 7z now, and extracting a 6G file to C:/tmp --
it *did go direct*. 


Thing is, some of the things I remember have changed over the years.
So it's hard to say with any given version what does what without
retesting.

For this subject, when downloading in parallel -- if the final size
is known, it sounds like pre-allocating the file would be a good thing.

I know transmission (a torrent client) at least makes that an option
(don't remember if it is default or not) so as to not cause fragmentation --
and it's fill pattern might not be extremely different than running
say, several TCP downloads that would fill the file from different locations.





Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-05-07 Thread Jure Grabnar
Hello,

I wrote two patches regarding this issue based on your suggestions.

The 1st one is crucial for retrieve_from_file() function: it fixes 2
memory corruption bugs.

The 2nd patch is more of an experiment than a real deal and is
intended to get some feedback. It changes parallel download to one
temporary file which is located in the selected directory. 

Before download starts it uses posix_fallocate() to allocate space and
mkstemp() for unique name. After download is completed it uses rename()
to rename temporary file to the final file name.

After posix_fallocate() each thread opens file with fopen("wb"). I
opted for fopen() rather than messing around with file descriptors
because I believe it's more portable. I don't know how Windows
would react to file descriptors and I don't have a proper Windows system
to test it out. Now, fopen("wb") means file, which was fallocate'd, is
truncated to zero but after first request from the thread, which is
reponsible for the last chunk, it would grow back to at least file_size
- chunk_size. I'm also not sure how devastating that is.

I'm attaching a handmade Metalink file which downloads a 50MB file for
testing purposes. Currently all threads connect to the same server and I
understand we don't support such behaviour but I guess 2-3 threads for
testing purpose don't hurt anyone. :)

I'm open for suggestions.

Regards,

Jure Grabnar

>From ed8acdbf66d74284d6688ad0ac69362bfdbc98a9 Mon Sep 17 00:00:00 2001
From: Jure Grabnar 
Date: Wed, 7 May 2014 22:38:20 +0200
Subject: [PATCH 1/2] Fix bugs causing memory corruption.

---
 src/ChangeLog | 6 ++
 src/multi.c   | 2 +-
 src/retr.c| 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index 537a707..55a1278 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,9 @@
+2014-05-07	Jure Grabnar 
+
+	* multi.c: Add condition to fix memory corruption and downloading in parallel
+	in general.
+	* retr.c: Increase buffer size by 1 for '\0' to avoid memory corruption.
+
 2014-03-26  Darshit Shah  
 
 	* ftp.c (getftp): Rearrange parameters to fix compiler warning
diff --git a/src/multi.c b/src/multi.c
index 4b22b2e..43c2f73 100644
--- a/src/multi.c
+++ b/src/multi.c
@@ -153,7 +153,7 @@ fill_ranges_data(int num_of_resources, long long int file_size,
   for (r = 0; r < num_of_resources; ++r)
 ranges[i].resources[r] = false;
   ++i;
-} while (ranges[i-1].last_byte < (file_size - 1));
+} while (i < opt.jobs && ranges[i-1].last_byte < (file_size - 1));
   ranges[i-1].last_byte = file_size -1;
 
   return i;
diff --git a/src/retr.c b/src/retr.c
index 8c361de..2f45fa5 100644
--- a/src/retr.c
+++ b/src/retr.c
@@ -1250,7 +1250,7 @@ retrieve_from_file (const char *file, bool html, int *count)
   int res;
   /* Form the actual file to be downloaded and verify hash. */
   file_path = malloc((opt.dir_prefix ? strlen(opt.dir_prefix) : 0)
-   + strlen(file->name) + (sizeof "/"));
+   + strlen(file->name) + (sizeof "/") + 1);
   if(opt.dir_prefix)
 sprintf(file_path, "%s/%s", opt.dir_prefix, file->name);
   else
-- 
1.9.2

>From 7b453b0c1c8355b538d9f7d8d040313a1d345d37 Mon Sep 17 00:00:00 2001
From: Jure Grabnar 
Date: Wed, 7 May 2014 22:49:51 +0200
Subject: [PATCH 2/2] Change parallel download to one temporary file instead of
 multiple.

---
 src/ChangeLog | 16 +++
 src/http.c| 12 ++---
 src/multi.c   | 87 ++-
 src/multi.h   | 13 +
 src/retr.c| 17 ++--
 5 files changed, 79 insertions(+), 66 deletions(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index 55a1278..2a310e9 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,5 +1,21 @@
 2014-05-07	Jure Grabnar 
 
+	* multi.c: Parallel download is now stored in one temporary file rather than
+	multiple files.
+	(SUFFIX_TEMP): Define.
+	(name_temp_files): Use function mkstemp() instead of tmpnam() which is safer
+	and allows for customized path.
+	(init_temp_files, name_temp_files, delete_temp_files, clean_temp_files):
+	Rewritten to work with one file.
+	(merge_temp_files): Remove.
+	(rename_temp_file): Add.
+	* multi.h: Add global variable (barrier) 'file_rdy_bar'.
+	* retr.c (retrieve_from_file): Change code to work with one temporary file.
+	* http.c (gethttp): Likewise. Use external barrier to sync threads after
+	fopen().
+
+2014-05-07	Jure Grabnar 
+
 	* multi.c: Add condition to fix memory corruption and downloading in parallel
 	in general.
 	* retr.c: Increase buffer size by 1 for '\0' to avoid memory corruption.
diff --git a/src/http.c b/src/http.c
index 388530c..b46fc66 100644
--- a/src/http.c
+++ b/src/http.c
@@ -150,7 +150,7 @@ struct request {
 };
 
 extern int numurls;
-
+extern pthread_barrier_t file_rdy_bar;
 /* Create a new, empty request. Set the request's method and its
argum

Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-05-11 Thread Ángel González

On 07/05/14 23:46, Jure Grabnar wrote:

Hello,

I wrote two patches regarding this issue based on your suggestions.

The 1st one is crucial for retrieve_from_file() function: it fixes 2
memory corruption bugs.

The 2nd patch is more of an experiment than a real deal and is
intended to get some feedback. It changes parallel download to one
temporary file which is located in the selected directory.

Before download starts it uses posix_fallocate() to allocate space and
mkstemp() for unique name. After download is completed it uses rename()
to rename temporary file to the final file name.

After posix_fallocate() each thread opens file with fopen("wb").

You could use w+b, even though you're not going to read from it.


I opted for fopen() rather than messing around with file descriptors
because I believe it's more portable. I don't know how Windows
would react to file descriptors and I don't have a proper Windows system
to test it out.

It works fine.
On Windows, FILE* are a layer on top of fds, which are themselves a 
layer over HANDLEs. To

make things more complex, gnulib provides a different abstraction to wget.
But it should work. The only special bit would be the need to add 
O_BINARY, which

gnulib should already be doing for you.



Now, fopen("wb") means file, which was fallocate'd, is
truncated to zero but after first request from the thread, which is
reponsible for the last chunk, it would grow back to at least file_size
- chunk_size. I'm also not sure how devastating that is.
It's up to the filesystem, but I think it would be better to do open (or 
dup) + fdopen()
+ fseek rather than the fopen(, "wb"); It also allows you to dispense 
with the barrier.




I'm attaching a handmade Metalink file which downloads a 50MB file for
testing purposes. Currently all threads connect to the same server and I
understand we don't support such behaviour but I guess 2-3 threads for
testing purpose don't hurt anyone. :)

I'm open for suggestions.

Regards,

Jure Grabnar





Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-06-07 Thread Darshit Shah
On Sun, May 11, 2014 at 11:28 PM, Ángel González  wrote:
> On 07/05/14 23:46, Jure Grabnar wrote:
>>
>> Hello,
>>
>> I wrote two patches regarding this issue based on your suggestions.
>>
>> The 1st one is crucial for retrieve_from_file() function: it fixes 2
>> memory corruption bugs.
>>
>> The 2nd patch is more of an experiment than a real deal and is
>> intended to get some feedback. It changes parallel download to one
>> temporary file which is located in the selected directory.
>>
>> Before download starts it uses posix_fallocate() to allocate space and
>> mkstemp() for unique name. After download is completed it uses rename()
>> to rename temporary file to the final file name.
>>
>> After posix_fallocate() each thread opens file with fopen("wb").
>
> You could use w+b, even though you're not going to read from it.
>
>
>> I opted for fopen() rather than messing around with file descriptors
>> because I believe it's more portable. I don't know how Windows
>> would react to file descriptors and I don't have a proper Windows system
>> to test it out.
>
> It works fine.
> On Windows, FILE* are a layer on top of fds, which are themselves a layer
> over HANDLEs. To
> make things more complex, gnulib provides a different abstraction to wget.
> But it should work. The only special bit would be the need to add O_BINARY,
> which
> gnulib should already be doing for you.
>
>
>
>> Now, fopen("wb") means file, which was fallocate'd, is
>> truncated to zero but after first request from the thread, which is
>> reponsible for the last chunk, it would grow back to at least file_size
>> - chunk_size. I'm also not sure how devastating that is.
>
> It's up to the filesystem, but I think it would be better to do open (or
> dup) + fdopen()
> + fseek rather than the fopen(, "wb"); It also allows you to dispense with
> the barrier.
>
>
>
>> I'm attaching a handmade Metalink file which downloads a 50MB file for
>> testing purposes. Currently all threads connect to the same server and I
>> understand we don't support such behaviour but I guess 2-3 threads for
>> testing purpose don't hurt anyone. :)
>>
Does anyone have any objections to the above patches? Else we can merge them.
>> I'm open for suggestions.
>>
>> Regards,
>>
>> Jure Grabnar
>
>
>



-- 
Thanking You,
Darshit Shah



Re: [Bug-wget] [Bug-Wget] Issues with Metalink support

2014-06-07 Thread Jure Grabnar
On Sun, 8 Jun 2014 10:35:30 +0530
Darshit Shah  wrote:

> On Sun, May 11, 2014 at 11:28 PM, Ángel González 
> wrote:
> > On 07/05/14 23:46, Jure Grabnar wrote:
> >>
> >> Hello,
> >>
> >> I wrote two patches regarding this issue based on your suggestions.
> >>
> >> The 1st one is crucial for retrieve_from_file() function: it fixes
> >> 2 memory corruption bugs.
> >>
> >> The 2nd patch is more of an experiment than a real deal and is
> >> intended to get some feedback. It changes parallel download to one
> >> temporary file which is located in the selected directory.
> >>
> >> Before download starts it uses posix_fallocate() to allocate space
> >> and mkstemp() for unique name. After download is completed it uses
> >> rename() to rename temporary file to the final file name.
> >>
> >> After posix_fallocate() each thread opens file with fopen("wb").
> >
> > You could use w+b, even though you're not going to read from it.
> >
> >
> >> I opted for fopen() rather than messing around with file
> >> descriptors because I believe it's more portable. I don't know how
> >> Windows would react to file descriptors and I don't have a proper
> >> Windows system to test it out.
> >
> > It works fine.
> > On Windows, FILE* are a layer on top of fds, which are themselves a
> > layer over HANDLEs. To
> > make things more complex, gnulib provides a different abstraction
> > to wget. But it should work. The only special bit would be the need
> > to add O_BINARY, which
> > gnulib should already be doing for you.
> >
> >
> >
> >> Now, fopen("wb") means file, which was fallocate'd, is
> >> truncated to zero but after first request from the thread, which is
> >> reponsible for the last chunk, it would grow back to at least
> >> file_size
> >> - chunk_size. I'm also not sure how devastating that is.
> >
> > It's up to the filesystem, but I think it would be better to do
> > open (or dup) + fdopen()
> > + fseek rather than the fopen(, "wb"); It also allows you to
> > dispense with the barrier.
> >
> >
> >
> >> I'm attaching a handmade Metalink file which downloads a 50MB file
> >> for testing purposes. Currently all threads connect to the same
> >> server and I understand we don't support such behaviour but I
> >> guess 2-3 threads for testing purpose don't hurt anyone. :)
> >>
> Does anyone have any objections to the above patches? Else we can
> merge them.
I'm sending updated 2nd patch. It uses fopen(,"r+b") and doesn't need
barrier. I tested quite a bit and it works ok. The only problem comes
when number of thread is >=8, the program sometimes crashes. I tried
debugging it with gdb and valgrind but without avail - it doesn't crash
(Heisenbug). Through core dump I found out crash is quite random
(probably race condition?). It's either accessing free'd pointer or
freeing already free'd pointer. I don't want to spend too much time
fixing it right now, because when downloading of a single file is done,
downloading Metalink should migrate to it aswell (this bug might
be gone then). 

I believe this patch is not connected to the bug though, because
Wget crashes even if I use the original code.


Best Regards,

Jure Grabnar
>From a15b4f08efdf59471c45d8f322c72248d75ebd54 Mon Sep 17 00:00:00 2001
From: Jure Grabnar 
Date: Sun, 8 Jun 2014 08:08:38 +0200
Subject: [PATCH] Download to single temporary file.

---
 src/ChangeLog | 14 ++
 src/http.c| 10 ---
 src/multi.c   | 87 ++-
 src/multi.h   | 11 
 src/retr.c| 18 ++---
 5 files changed, 74 insertions(+), 66 deletions(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index 55a1278..fd76bb1 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,5 +1,19 @@
 2014-05-07	Jure Grabnar 
 
+	* multi.c: Parallel download is now stored in one temporary file rather than
+	multiple files.
+	(SUFFIX_TEMP): Define.
+	(name_temp_files): Use function mkstemp() instead of tmpnam() which is safer
+	and allows for customized path.
+	(init_temp_files, name_temp_files, delete_temp_files, clean_temp_files):
+	Rewritten to work with one file.
+	(merge_temp_files): Remove.
+	(rename_temp_file): Add.
+	* retr.c (retrieve_from_file): Change code to work with one temporary file.
+	* http.c (gethttp): Likewise.
+
+2014-05-07	Jure Grabnar 
+
 	* multi.c: Add condition to fix memory corruption and downloading in parallel
 	in general.
 	* retr.c: Increase buffer size by 1 for '\0' to avoid memory corruption.
diff --git a/src/http.c b/src/http.c
index 388530c..d44530e 100644
--- a/src/http.c
+++ b/src/http.c
@@ -150,7 +150,6 @@ struct request {
 };
 
 extern int numurls;
-
 /* Create a new, empty request. Set the request's method and its
arguments.  METHOD should be a literal string (or it should outlive
the request) because it will not be freed.  ARG will be freed by
@@ -2784,7 +2783,7 @@ read_header:
   REGISTER_PERSISTENT_CONNECTION (4);
   return RETRUNNEEDED;
 }
-  else if (!ALLOW_CLOBBER)
+