[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
indexer from mnogosearch-3.3.14-mysql started with 
'/usr/local/etc/mnogosearch/indexer.conf'
[57177]{01} URL: https://www.a.com/index.php/code_2007_:_Selection
[57177]{01} Server Path Allow 'https://www.a.com/'
[57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'
[57177]{01} ROBOTS: https://www.a.com/robots.txt
[57177]{01} Request.Accept-Encoding: gzip,deflate,compress
[57177]{01} Request.Accept-Language: en, fr
[57177]{01} Request.From: b...@toto.com
[57177]{01} Request.Host: www.a.com
[57177]{01} Request.User-Agent: bot
[57177]{01} Response.Accept-Ranges: bytes
[57177]{01} Response.Connection: close
[57177]{01} Response.Content-Encoding: gzip
[57177]{01} Response.Content-Length: 0
[57177]{01} Response.Content-Type: text/plain
[57177]{01} Response.Date: Sun, 10 Nov 2013 07:41:01 GMT
[57177]{01} Response.DefaultLang: en
[57177]{01} Response.DetectClones: 1
[57177]{01} Response.ETag: "1ea26b-0-4e0f93dabf240"
[57177]{01} Response.Last-Modified: Mon, 08 Jul 2013 05:23:13 GMT
[57177]{01} Response.Method: Disallow
[57177]{01} Response.Period: 604800
[57177]{01} Response.Request.Accept-Language: en, fr
[57177]{01} Response.Request.From: b...@toto.com
[57177]{01} Response.Request.User-Agent: bot
[57177]{01} Response.ResponseLine: HTTP/1.1 200 OK
[57177]{01} Response.ResponseSize: 360
[57177]{01} Response.Server: Apache
[57177]{01} Response.Status: 200
[57177]{01} Response.Tag: www_en
[57177]{01} Response.URL: https://www.a.com/robots.txt
[57177]{01} Response.URL_ID: -1277106540
[57177]{01} Response.Vary: Accept-Encoding
[57177]{01} Response.VaryLang: en fr
[57177]{01} Response.X-Frame-Options: Deny
[57177]{01} Response.X-XSS-Protection: 1; mode=block
[57177]{01} Request.Accept-Encoding: gzip,deflate,compress
[57177]{01} Request.Accept-Language: en, fr
[57177]{01} Request.From: b...@toto.com
[57177]{01} Request.Host: www.a.com
[57177]{01} Request.User-Agent: bot
[57177]{01} Response.body: 
[57177]{01} Response.Cache-Control: private, must-revalidate, max-age=0
[57177]{01} Response.CachedCopy: 
[57177]{01} Response.Charset: 
[57177]{01} Response.Connection: close
[57177]{01} Response.Content-Encoding: gzip
[57177]{01} Response.Content-Language: en
[57177]{01} Response.Content-Length: 7496
[57177]{01} Response.Content-Type: text/html
[57177]{01} Response.crc32: 1003223498
[57177]{01} Response.crc32old: 1003223498
[57177]{01} Response.crosswords: 
[57177]{01} Response.Date: Sun, 10 Nov 2013 07:41:01 GMT
[57177]{01} Response.DefaultLang: en
[57177]{01} Response.DetectClones: 1
[57177]{01} Response.Expires: Thu, 01 Jan 1970 00:00:00 GMT
[57177]{01} Response.Hops: 14
[57177]{01} Response.ID: 405428
[57177]{01} Response.Last-Modified: Mon, 14 Oct 2013 15:14:00 GMT
[57177]{01} Response.MaxDocPerSite: 0
[57177]{01} Response.MaxHops: 256
[57177]{01} Response.meta.description: 
[57177]{01} Response.meta.keywords: 
[57177]{01} Response.Method: Disallow
[57177]{01} Response.msg.from: 
[57177]{01} Response.msg.subject: 
[57177]{01} Response.msg.to: 
[57177]{01} Response.Period: 604800
[57177]{01} Response.PrevStatus: 200
[57177]{01} Response.Request.Accept-Language: en, fr
[57177]{01} Response.Request.From: b...@toto.com
[57177]{01} Response.Request.User-Agent: bot
[57177]{01} Response.ResponseLine: HTTP/1.1 200 OK
[57177]{01} Response.ResponseSize: 7952
[57177]{01} Response.Server: Apache
[57177]{01} Response.Server-Charset: utf-8
[57177]{01} Response.Server_id: -1149994654
[57177]{01} Response.Site_id: -1149994654
[57177]{01} Response.Status: 200
[57177]{01} Response.Tag: www_en
[57177]{01} Response.title: 
[57177]{01} Response.URL: https://www.a.com/index.php/code_2007_:_Selection
[57177]{01} Response.url.file: 
[57177]{01} Response.url.host: 
[57177]{01} Response.url.path: 
[57177]{01} Response.url.proto: 
[57177]{01} Response.URL_ID: 1908964734
[57177]{01} Response.Vary: Accept-Encoding,Cookie
[57177]{01} Response.VaryLang: en fr
[57177]{01} Response.X-Content-Type-Options: nosniff
[57177]{01} Response.X-Frame-Options: Deny
[57177]{01} Response.X-XSS-Protection: 1; mode=block
[57177]{01} Status: 200 OK
[57177]{01} Stored rec_id: 405428 Size: 25459 Ratio: 29.35%
[57177]{01} Guesser: Lang: en, Charset: utf-8
[57177]{01} SectionFilter: Allow by default
[57177]{01} Link '/favicon.ico' https://www.a.com/favicon.ico
[57177]{01}  Server applied: site_id: -1149994654 URL: https://www.a.com/
[57177]{01} Allow Regex InSensitive '\.php$|\.cgi$|\.pl$'
[57177]{01} Link '/opensearch_desc.php' https://www.a.com/opensearch_desc.php

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Content-type

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
Hi Guys,

Indexing, I see the "unsupported content-type" values growing hugely.

Since I disallow for example *.png, putting it as a specific type, as Checkonly 
also to try reducing this, I dont understand why it is detected as unsupported 
content type.

It should not be indexed and so listed as unsupported no ?

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
> Hi Alex,
> 
> Thanks for your answer.
> 
> I did not wrote perfectly the URL.
> What you wrote is what I did and it does not work, apparently.
> I am on FreeBSD, mnoGo 3.3.14
> 
> Disallow regex www.a.com/news/*/2000/*
> Disallow regex www.a.com/index.html\?*setlang=za
> Server https://allow www.a.com/
> 
> Is this the correct format ?

Try this:

Disallow regex "www[.]a[.]com/news/.*/2000/.*"
Disallow regex "www[.]a[.]com/index[.]html[?].*setlang=za"
Server allow https://www.a.com/

If it does not help, try this command:

indexer -amv6 -u "https://www.a.com/index.php?title=Toto&value=1&setlang=za";

It will print debug output and explain why this URL
is accepted or rejected. Please post its output here.


> 
> In the log, I see https://www.a.com/index.php?title=Toto&value=1&setlang=za
> as well as:
> https://www.a.com/index.html?Special/file_2007_Conference
> 
> thanks
> 
> 

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
Hi Alex,

Thanks for your answer.

I did not wrote perfectly the URL.
What you wrote is what I did and it does not work, apparently.
I am on FreeBSD, mnoGo 3.3.14

Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server https://allow www.a.com/

Is this the correct format ?

In the log, I see https://www.a.com/index.php?title=Toto&value=1&setlang=za
as well as:
https://www.a.com/index.html?Special/file_2007_Conference

thanks



Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Alexander Barkov
Email: b...@mnogosearch.org
Message:
Hi,

> Hi Guys,
> 
> It's a long since my udm-gw script in Y2K.
> I am back on mnoGosearch and face a newbie issue I cant solve.
> 
> I want to index a server but not some specific regex on it.
> I tried disallow with server, all fails.

Can you please clarify what fails?
Does it crawl the entire site?
Or does it crawl nothing?

> Server disallow with pattern is not possible to me, no try.
> 
> Here I want to index www.a.com/
> without www.a.com/news/*/2000/*
> and www.a.com/index.html?*setlang=za
> 
> I did:
> Disallow regex www.a.com/news/*/2000/*
> Disallow regex www.a.com/index.html\?*setlang=za
> Server allow www.a.com/

The correct command is:

Server http://www.a.com/

Notice the "http://"; prefix.

> 
> I also tried using .* as pattern for any instead of *, no success.

".*" is correct.

Btw, which version are you using?

> 
> Any help appreciated :-)
> 
> Thanks


Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general


[General] Webboard: Indexer with regex

2013-11-09 Thread bar
Author: Laurent
Email: 
Message:
Hi Guys,

It's a long since my udm-gw script in Y2K.
I am back on mnoGosearch and face a newbie issue I cant solve.

I want to index a server but not some specific regex on it.
I tried disallow with server, all fails.
Server disallow with pattern is not possible to me, no try.

Here I want to index www.a.com/
without www.a.com/news/*/2000/*
and www.a.com/index.html?*setlang=za

I did:
Disallow regex www.a.com/news/*/2000/*
Disallow regex www.a.com/index.html\?*setlang=za
Server allow www.a.com/

I also tried using .* as pattern for any instead of *, no success.

Any help appreciated :-)

Thanks

Reply: 

___
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general