According to Alexis Mikhailov:
> Gilles Detillieux wrote:
> > The 3.2.0b1 release is under a feature freeze, which means only bug fixes
> > go in unchallenged.  Anything else must go to a vote on the htdig3-dev
> > mailing list.
> 
> Will it be possible to put patch for '-' argument (to read list of
> URL's from stdin) to htdig into 3.1.x version? Or only in 3.2.0?
> I need this or similar argument very much in order to index all
> HTML documents in a some directory with a command like 
> "find /some/dir -name '*.html' | htdig -".

You can do something similar right now, but with a couple steps:

   find /some/dir -name '*.html' -print > /confdir/starturls.txt

then in htdig.conf:

   start_url: `/confdir/starturls.txt`

I thought about putting your 'htdig -' option into 3.1.4, but there were
too many other higher priority I needed to nail down first.  I considered
this option as syntactic sugar, as it only made it a bit easier to do
something you already could do.

It's too late now for 3.1.4, as it's in pre-release, but we could still
put it to a vote for 3.2.0b1.  If it doesn't pass the vote, it can go
into the next 3.2 release.  If it does pass, I'll see if I can find the
time to fit it into 3.2.0b1, but I'll use cin.getline() rather than your
">>" operator.  So, developers, should this go into 3.2.0b1?  Here's my

+1

> And I began to write HtFile class for 3.2.0 version. Will it be
> possible to include this change too?

Is this for handling URLs of the form file:/path/to/file?  There may be
some interest, but I think we'd like to see some code before we vote on
it, given that right now we must focus on stomping out bugs in the beta
release.  Of course, if it doesn't make 3.2.0b1, I'm sure 3.2.0b2 won't
be too far behind (my guess would be 2-3 months, but don't quote me).

> > > This variable was added to allow index of local files with NOINDEX tag.
> > > For example, all Qt documentation files contain this tag. I thought
> > > it would be not interesting to remove this tag in every file by hand,
> > > so I created this variable. Forgot about default value though.
> > 
> > Geoff & I have reservations about this.  I think it should be discussed
> > further.
> 
> It is another thing I need very much, because many local documents
> contain this tag and it would be better to just ignore the tag than to
> force user to manually remove it. I was not able to understand is
> decision reached on this point from letters CC to me (I'm not
> subscribed to mailing list, so I could miss some letters).

Geoff's initial comments on this were...

   I actually disagree on this. I don't think the indexer should ever 
   ignore the directive of the page author. If the author intended that 
   the page should not be indexed, then ht://Dig should follow those 
   wishes. I'd have a similar opinion about something that ignored 
   robots.txt.

And after you explained that you needed this feature to index Qt docs...

   Hmm. Now there's a problem. Obviously I can see a need for indexing 
   local files with NOINDEX, but I think it could open up a nasty can of 
   worms if it's allowed on the web. I guess the easiest way is to check 
   whether the access is local or not. Maybe the local access methods 
   temporarily allow it, then disallow when they're done?

I still have reservations about this.  In the case of robots.txt, I
can see how it would be OK to ignore it when you're indexing using the
local filesystem only.  However, if you're indexing your own system,
you should have control over its robots.txt, so you should be able to
customize it to allow your htdig into the files you want.  In any case,
if you're indexing a site with the site owner's permission, you should be
able to get them to change their robots.txt file to allow your htdig in.
If you're indexing a site without the owners permission, we certainly
don't want htdig to ignore robots.txt.

I'd apply similar reasoning to noindex tags within a document.  If the
author put noindex tags in all his/her documents, it must be because
the author did not want these documents to be indexed.  Period.  So even
if you have a local copy of the author's documents, htdig should still
honour the author's stated wishes.

In the case of Qt, back when it was not open source, the authors may
have had good reason for not wanting their documents indexed.  Now that
Qt is released under an open source license, it doesn't make much sense
to keep those noindex tags in there, so it would seem to me that they
should be removed.  Did you ever bring this up with the Qt developers?

If the authors don't mind if people index these documents, the noindex
tags should go.  If they put them there just to avoid them being indexed
on their own site, then that's an inappropriate use of noindex tags -
something site specific like that should be done in robots.txt, not in
documents that are part of an open source distribution.

> > I'm still concerned about portability to non-GNU C++ compilers &
> > libraries.  It also seems unnecessary to me.  Most line input in htdig
> > is done via the getline() method, which wouldn't complicate things much
> 
> I haven't seen single call to getline() in 3.1.x version. Is it in
> 3.2? 

It's used in both!

$ grep getline htdig3-?-x/*/*.cc
htdig3-1-x/htlib/Configuration.cc:        in.getline(buffer, sizeof(buffer));
htdig3-1-x/htlib/cgi.cc:                        cin.getline(buffer, sizeof(buffer));
htdig3-1-x/htlib/cgi.cc:                cin.getline(buffer, sizeof(buffer));
htdig3-2-x/htcommon/cgi.cc:                     cin.getline(buffer, sizeof(buffer));
htdig3-2-x/htcommon/cgi.cc:             cin.getline(buffer, sizeof(buffer));
htdig3-2-x/test/dbbench.cc:      in.getline(buffer, sizeof(buffer));
$ 

> 
> > > > 6) htlib/cgi.cc & htsearch/htsearch.cc: add a -a option to htsearch, to
> > > > add name=value parameters to those in query string.  This is undocumented
> > > > as well.  I'm not sure how it relates to the other changes, but it seems
> > > > simple enough.
> > >
> > > This option was added to simplify calling of htsearch from external
> > > program. Of course it was possible to set enviroment variables to
> > > emulate CGI call, but it seemed unnecessary complicated.
> > 
> > Understood, but I think Torsten's method, which I adapted for 3.1.4 &
> > 3.2.0b1 is a better approach.  If the separate -a option is still wanted
> > after these releases are out, then I'd say it could go in future releases.
> 
> Please, can you point me to what this method is? In truth I was just
> lazy to write function to URL_encode words before passing them to
> htsearch. Nevertheless if this option is undesirable, I'll write
> (or just rip somewhere) this function.

E.g., in 3.1.4 or 3.2.0b1:

  /home/httpd/cgi-bin/htsearch "config=htdig&method=and&words=search+words"

In general, the URL encoding is not strictly necessary when you pass your
query string as an argument in this way, unless you need to embed an equals
sign (=) or ampersand (&) in an input parameter definition.  You can find
an encodeURL() function in htcommon/URLTrans.cc in 3.2.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to