forum download, cookies?

2007-09-13 Thread Juhana Sadeharju

A forum has topics which are available only for members.
How to use wget for downloading copy of the pages in that
case? How to get the proper cookies and how to get wget to
use them correctly? I use IE in PC/Windows and wget in
a unix computer. I could use Lynx in the unix computer
if needed.

(PC/Windows has Firefox but I cannot install anything new.
If Firefox has a downloader plugin suitable for forum
downloading, that would be ok.)

Juhana


Bug in 1.10.2 vs 1.9.1

2006-12-03 Thread Juhana Sadeharju

Hello. Wget 1.10.2 has the following bug compared to version 1.9.1.
First, the bin/wgetdir is defined as
  wget -p -E -k --proxy=off -e robots=off --passive-ftp
  -o zlogwget`date +%Y%m%d%H%M%S` -r -l 0 -np -U Mozilla --tries=50
  --waitretry=10 $@

The download command is
  wgetdir http://udn.epicgames.com

Version 1.9.1 result: download ok
Version 1.10.2 result: only udn.epicgames.com/Main/WebHome downloaded
and other converted urls are of the form
  http://udn.epicgames.com/../Two/WebHome

Juhana


url accept/reject? accept scripts

2006-08-04 Thread Juhana Sadeharju

Hello. How do I get wget to ignore urls containing one of the following
strings? The --help did not reveal a suitable option, surprisingly.
 action=
 printable=
 redirect=
 article=
 returnto=
 title=

I would like to remind about the problems with the existing options:
(1) I downloaded an ftp site with --accept=pdf,PDF but only PDF files
were downloaded.
(2) I downloaded a http site with -X forums,wiki but only one was
excluded.
Therefore, I would like to have an example about how the LIST options
are exactly typed.

The options are good but a more general solution would be to have
a script callback system with access to the wget variables. Example:
 wget --accept-script=wikiurls.script
   other options http://nwn2wiki.org/Main_Page.html

wikiurls.script could be:
  accept = 1;
  if (string_has(CurrentURL,action=) == 0) accept = 0;
  if (string_has(CurrentURL,printable=) == 0) accept = 0;
  if (string_has(CurrentURL,redirect=) == 0) accept = 0;
  if (string_has(CurrentURL,article=) == 0) accept = 0;
  if (string_has(CurrentURL,returnto=) == 0) accept = 0;
  if (string_has(CurrentURL,title=) == 0) accept = 0;
  if (accept == 0) fprintf(rejectedfp,%s\n,CurrentURL);
  return accept;

Other script for downloading images could be:
  if (string_has(CurrentURL,/thumbs/) == 0) {
newurl = strdup(CurrentURL);
string_delete(newurl,thumbs/);
queue_url(newurl);
  } else if (string_has(CurrentURL,_small) == 0) {
newurl = strdup(CurrentURL);
string_replace(newurl,_small,_large);
queue_url(newurl);
  }
  accept = 1;
  return accept;

Perhaps not that easy but the idea is there. Other script types could
be parser scripts, e.g., for additional parsing of OpenWindow('page.html');
and OpenImage('image.jpg'); java calls.

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software


wget server?

2006-08-04 Thread Juhana Sadeharju

Hello. The following problem occured recently. I started downloading
all under directory
  http://site.edu/projects/software/
Then after a day I found that the subdirectory
  http://site.edu/projects/software/program/manual/
had a wiki with millions of files. Because I wished that the download
continues to other directories, I did not interrupt the wget. After
a week and half the wget quitted for having no memory.

As a solution, could wget turned to a wget server? A server doing
the downloads and a client program wget. The command wget url
would send the url and the current directory to the server.
The above problem would be solved by allowing user to add rejects
on-the-fly, e.g.,
  wget --add-reject http://site.edu/projects/software/program/manual/
Then the server would start skipping the queued urls of the manual
and eventually end up to other directories.

Client/server model would allow more useful features. I often download
many individual directories from one site. Now all downloads runs
in parallel as background processes because I don't want wait and stop
what I'm doing. The server could by default queue all downloads for the
site and download one url at a time. Downloads from different sites would
still be downloaded in parallel in the server.

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software


news protocol?

2006-08-04 Thread Juhana Sadeharju

Hello. The TODO lists the following:
 * Add more protocols (e.g. gopher and news), implementing them in a
   modular fashion.

Do you mean nntp protocol? If yes, I recently wrote an nntp
downloader:
  http://www.funet.fi/~kouhia/nntppull20060409.tar.gz
I find it good for news archiving. I now archive nearly 700 newsgroups.
But what kind of plans you have?

What I still need is a way to download newsgroup archives from Google.
I have a free project for which I would like to have all from 2 to 4
groups. I'm not aware of any other archive, public or private, who
could help me.

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software


accepted and excluded?

2006-02-10 Thread Juhana Sadeharju

Hello. How I would type the -A option if I want both .pdf and .PDF
files from an ftp site? -A pdf,PDF failed -- only PDF files were
downloaded.

How I would type -X option if I want multiple subdirectories
excluded? -X dir1,dir2 failed -- only one of the given dirs
was excluded. (E.g. www.site.dom/dir1/ and www.site.dom/dir2/
should be excluded when the whole site is downloaded.)

I need now only the exact working options as I'm not sure what
comma-separated list of accepted extensions now means. I will
investigate later if and why the options fails. I'm using latest
wget as far as I know -- version 1.9.1.

Juhana


wget with a log database?

2005-11-30 Thread Juhana Sadeharju

Hello. I would like to have a database within wget. The database
would let wget know what it has downloaded earlier. Wget could
download only new and changed files, and could continue the download
without having the old downloadings in my disk.

The database would also be accessed by other programs.
E.g., new downloads could be later merged to the earlier downloads
with another program.
E.g., the database would allow me to remember when I'm trying to
download something I already have.

Do we have such a downloader with the requested features available
already? Could somebody install and test the Nedlib Harvester at
  http://www.csc.fi/sovellus/nedlib/
The NH was used to download all webpages at Finland (totalling 400GB).

I don't know if NH has all the features I need. E.g., I would
like to associate includes and excludes to individual sites and
webpage structures. When I next time update my copy, the downloader
would use the given includes and excludes.

Juhana



wget problem

2005-04-04 Thread Juhana Sadeharju

Hello.
The following document could not be downloaded at all:
  http://www.greyc.ensicaen.fr/~dtschump/greycstoration/

If you succeed, please tell me how. I want all the html file
and the images.

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software


Re: Help Needed

2004-11-02 Thread Juhana Sadeharju
Hello.
Does wget have a nntp (Usenet newsgroups) support?
For example, I might want download all articles between
numbers M and N. A date based system could be useful too.
We just should agree how these queries are represented to
wget.

I can dig out an old Usenet news downloader code if wget
does not have one yet.

Juhana


on tilde bug

2004-11-01 Thread Juhana Sadeharju
Hello.

I traced the url given at command line, and it looks like there
is no difference if one gives ~ or %7E. Is this true?
The urls end up to url_parse() which switches ~ (as unsafe) to
%7E. If the original url is not used at all as it looks like,
then there is no difference. But mysteriously the url with ~
and the url with %7E downloaded files differently!

I also added new log outputs and while testing them with the
problem sites, surprise, there seemed to be no problems.

So, the fact that urls are not downloaded, could be just some
code bug in wget. But why this problem appears when ~ is
in the download url? Have I just missed the other cases?
Or is the bug in the code which expands the unsafe characters?

Juhana


char 5C problem

2004-11-01 Thread Juhana Sadeharju
Hello.
Wget could not download the images of the page
  http://www.fusionindustries.com/alex/combustion/index.html

The image urls have %5C (backslash \) in them.
  
http://www.fusionindustries.com/alex/combustion/small%5C0103%20edgepoint-pressure%20small.png
  http://www.fusionindustries.com/alex/combustion/big%5C0103%20edgepoint-pressure.png

The wget 1.9.1 options used included -E, -r, -k, and -np.

Juhana


Tilde bug again

2004-10-16 Thread Juhana Sadeharju
Hello.

Has the ~ / %7E bug been always in wget? When it was added to wget?
Who wrote the code?

I would like to suggest that the person who made this severe bug
should immediately fix it back. It does not make sense that we waste
time in trying to fix this bug if the person did not use any moment to
design the feature and think of the consequences. It is better that
the original code is restored and that the person uses plenty of time
in redesigning the feature if he wishes to get it back!!

Sorry about this bad tone, but if the bug is not fixed, then we must
restore the original code as soon as possible.

(Here the bug is hitting almost everytime when the url has ~ in it.)

PS. I willing to look at it myself, but: what piece of code
changes %7E to ~ in the case where the given url has %7E but the
 ~ appears to my disk? In what format url is saved for the -np
option, by what routine? What format commands are sent to the server,
with ~ only or with %7E only, or both? I'm not sure if these questions
help but that is a start.

Juhana


Directory indecies?

2004-10-16 Thread Juhana Sadeharju
Hello.
Why wget generates the following index files?
Why so many index files?

  ftp1.sourceforge.net/gut/index.html
  ftp1.sourceforge.net/gut/index.html?C=MO=A
  ftp1.sourceforge.net/gut/index.html?C=MO=D
  ftp1.sourceforge.net/gut/index.html?C=NO=A
  ftp1.sourceforge.net/gut/index.html?C=NO=D
  ftp1.sourceforge.net/gut/index.html?C=SO=A
  ftp1.sourceforge.net/gut/index.html?C=SO=D

Juhana


img dynsrc not downloaded?

2004-10-16 Thread Juhana Sadeharju
Hello.
Wget could not follow dynsrc tags; the mpeg file was not downloaded:
  pimg dynsrc=Collision.mpg CONTROLS LOOP=1
at
  http://www.wideopenwest.com/~nkuzmenko7225/Collision.htm

Regards,
Juhana


xml files not processed?

2004-10-16 Thread Juhana Sadeharju
Hello.
When the url
  http://zeus.fri.uni-lj.si/%7Ealeks/POIS/Kolaborativno%20delo.htm
is downloaded with -np -r -l 0 etc., the file
  http://zeus.fri.uni-lj.si/~aleks/POIS/Kolaborativno delo_files/filelist.xml
is downloaded correctly. However, the hrefs in the xml file are not
then followed:
 o:File HRef=slide0008.htm/
 o:File HRef=slide0008_image001.png/
 o:File HRef=slide0008_image002.jpg/
 o:File HRef=slide0011.htm/

Note that pres.xml file in the same directory has
  href=c:\temp\Kolaborativno delo.htm
which is apparently incorrect and should ignored. The refered
file is in the first url given in this mail.

These xml files are apparently generated by PowerPoint-to-html
converter.

Regards,
Juhana


Developers here?

2004-10-16 Thread Juhana Sadeharju
Hello.

Recent mails has not been replied and CVS may be old.
Who are the developers of wget at the moment?

I just posted a couple of featureloss reports, but my intend
is not to pour the tasks on the current developers. However,
without anyone giving hints on what to look at, the features
may go unimplemented by me.

Regards,
Juhana


wget scripting?

2004-10-04 Thread Juhana Sadeharju
Hello.

I have slightly thought how to make wget more better, possibly.
We would need a scripting system so that features can be programmed
more easily. One way how to incorporate the scripting to wget would
be to re-write wget as a data flow system. Much similar way than
OpenGL (www.opengl.org) is a data flow for graphics. The scripts
would be executed in specific places in the data flow graph.
Much similar way than vertex and fragment programs are executed
in OpenGL in the specific places of the graph.

So, the urls would enter the data flow and the routines in the
graph would do something to them. I don't know yet what kind of
graph we would have but here is a simple one:

  url input -- url processing -- site exclusion -- dir path exclusion
  -- get file --

Then the graph goes deeper in parsing the html.

Example: I could add a script just after the get file. The script
would uncompress the downloaded file to a new file and change the
local_filename variable to the name of the new file.

The graph would make it possible to use different granularity.
Details can be added by splitting the large graph nodes later.

Regards,
Juhana


compressed html files?

2004-09-23 Thread Juhana Sadeharju
Hello.
The file
  http://www.cs.utah.edu/~gooch/JOT/index.html
is compressed and wget could not follow the urls in it.
What can be done? Should wget uncompress the compressed *.htm
and *.html files? *.asp, *.php??

Juhana


Character coding gives problems

2004-08-20 Thread Juhana Sadeharju
Hello.

Char coding of ~ causes problems in downloading.

Example:

 wget -p -E -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np
 http://www.stanford.edu/~dattorro/

However, not all was downloaded. The file machines.html has hrefs
 http://www.stanford.edu/%7Edattorro/images/calloph.jpg
 http://www.stanford.edu/%7Edattorro/Lexicon.htm
instead of the correct urls
 images/calloph.jpg
 Lexicon.htm
and these two files were not downloaded. In fact, 52 files were downloaded
and 105 files were not!! A major bug.

Simply, ~ and %7E should be treated as the same char. Otherwise
there is no point in % MIME(?) codings at all.

For local filenames, only one of ~ and %7E should be used. I would
prefer MIME codings because Linux scripts (e.g., for i in `find`),
cannot handle spaces in filenames.

Has this problem been fixed already? Is there any quick solution before
the problem is fixed?

Juhana
-- 
  http://music.columbia.edu/mailman/listinfo/linux-graphics-dev
  for developers of open source graphics software


wget problem: urls behind script

2004-04-16 Thread Juhana Sadeharju
Hello.

One wget problem this time. I downloaded all in
  http://www.planetunreal.com/wod/tutorials/
but most of the files were not downloaded because urls are
in the file
  http://www.planetunreal.com/wod/tutorials/sidebar.js
in the following format

  FItem(Beginner's Guide to UnrealScript, guide.htm);
  Item(Class Tree, classtree.htm);
  Item(Download the MASSIVE all inclusive UScript Tutorial,
   UScript Tutorial.doc);
  Item(My First Mod (Part 1), 1stmod.html);

Could wget test each string inside the function call if the string
is a file in the directory? And then wget would continue processing
the file if it exists.

In the above example, wget would try to download additionally the
following files:
  http://www.planetunreal.com/wod/tutorials/Beginner's Guide to UnrealScript
  http://www.planetunreal.com/wod/tutorials/guide.htm
  http://www.planetunreal.com/wod/tutorials/classtree.htm
  http://www.planetunreal.com/wod/tutorials/Class Tree
  http://www.planetunreal.com/wod/tutorials/Download the MASSIVE all inclusive UScript 
Tutorial
  http://www.planetunreal.com/wod/tutorials/UScript Tutorial.doc
  http://www.planetunreal.com/wod/tutorials/My First Mod (Part 1)
  http://www.planetunreal.com/wod/tutorials/1stmod.html

It could be that the webserver reports file not found errors and
generates an error page. How to prevent that those pages are
not saved? E.g., I should not get the file My First Mod (Part 1).html
containing the errors.

Regards,
Juhana


wget bug: directory overwrite

2004-04-05 Thread Juhana Sadeharju
Hello.

Problem: When downloading all in
   http://udn.epicgames.com/Technical/MyFirstHUD
wget overwrites the downloaded MyFirstHUD file with
MyFirstHUD directory (which comes later).

GNU Wget 1.9.1
wget -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np -U Mozilla $@

Solution: Use of -E option.

Regards,
Juhana


Bug report

2004-03-24 Thread Juhana Sadeharju
Hello. This is report on some wget bugs. My wgetdir command looks
the following (wget 1.9.1):
wget -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np -U Mozilla $@

Bugs:

Command: wgetdir http://www.directfb.org;.
Problem: In file www.directfb.org/index.html the hrefs of type
  /screenshots/index.xml was not converted to relative
  with -k option.

Command: wgetdir http://threedom.sourceforge.net;.
Problem: In file threedom.sourceforge.net/index.html the
hrefs were not converted to relative with -k option.

Command: wgetdir http://liarliar.sourceforge.net;.
Problem: Files are named as
  content.php?content.2
  content.php?content.3
  content.php?content.4
which are interpreted, e.g., by Nautilus as manual pages and are
displayed as plain texts. Could the files and the links to them
renamed as the following?
  content.php?content.2.html
  content.php?content.3.html
  content.php?content.4.html
After all, are those pages still php files or generated html files?
If they are html files produced by the php files, then it could
be a good idea to add a new extension to the files.

Command: wgetdir 
http://www.newtek.com/products/lightwave/developer/lscript2.6/index.html;
Problem: Images are not downloaded. Perhaps because the image links
are the following:
  image src=v26_2.jpg

Regards,
Juhana


will mime coding make the site different?

2004-03-13 Thread Juhana Sadeharju
Hello.

I downloaded
 http://agar.csoft.org/index.html
with -k option, but the URL
 http://agar.csoft.org/man.cgi?query=widgetamp;sektion=3
in the file was not converted to relative.
(The local filename is man.cgi?query=widgetsektion=3.) 

Regards,
Juhana


Re: not downloading at all, help

2004-02-12 Thread Juhana Sadeharju
   --16:59:21--  http://www.maqamworld.com:80/
  = `index.html'
   Connecting to www.maqamworld.com:80... connected!

It looks like you have http_proxy=80 in your wgetrc file.

I placed use_proxy = off to .wgetrc (which file I did not have earlier)
and to ~/wget/etc/wgetrc (which file I had), and tried
  wget --proxy=off http://www.maqamworld.com
and it still does not work.

Could there be some system wgetrc files somewhere? I have compiled
wget on my own to my home directory, and certainly wish that my own
installation does not use files of some other installation.

Why did you think the :80 comes from proxy? I have always thought
it comes from the target site, not from our site. Did you try the
given command yourself and it worked? Please try now if you did not.

If wget puts the :80 , then how do I instruct wget to not do that
no matter what is told somewhere? What part of the source code I should
edit if that is only what helps?

Though, you should fix this to the wget source because something is
not working now. I wonder why this not working is set as a default
behaviour to wget...

Regards,
Juhana


not downloading at all, help

2004-02-11 Thread Juhana Sadeharju
Hello.

What goes wrong in the following? (I will read replies from the list
archives.)

  % wget http://www.maqamworld.com/

  --16:59:21--  http://www.maqamworld.com:80/
 = `index.html'
  Connecting to www.maqamworld.com:80... connected!
  HTTP request sent, awaiting response... 503 Unknown site
  16:59:21 ERROR 503: Unknown site.

Regards,
Juhana