Right.  Another case that I'm exploring...crawling an internal site and wanting 
a load balanced url.  So you would crawl something like this:

http://mystaging-server.myco.com/index.html

and then want to change it to:

https://www.myco.com/index.html

Is that better for the url mapper?




--

Michael Cizmar
Managing Director

p: 312.585.6396

d: 312.585.6286
twitter: @michaelcizmar<http://twitter.com/michaelcizmar>

http://www.mcplusa.com/


The information contained in this communication is confidential, private, 
proprietary, or otherwise privileged and is intended only for the use of the 
addressee.  This e-mail is intended only for the person or entity to whom it is 
directed.  Unauthorized use, disclosure, distribution or copying is strictly 
prohibited and may be unlawful.  If you are not the intended recipient, please 
notify us immediately and permanently delete this e-mail and any attachments.

________________________________
From: Karl Wright <daddy...@gmail.com>
Sent: Thursday, May 28, 2020 12:03 PM
To: user@manifoldcf.apache.org <user@manifoldcf.apache.org>
Subject: Re: URL Mapping

Thanks!  It's far better to implement this than to try and hack it.  A general 
way of removing session information with regular expressions is probably not 
going to cut it either, so for now it's got to be in Java.

Karl


On Thu, May 28, 2020 at 12:47 PM Michael Cizmar 
<michael.ciz...@mcplusa.com<mailto:michael.ciz...@mcplusa.com>> wrote:
The "!ut" and then a bunch of session information is from Web Sphere Portal.  
Some information about it here:
https://books.google.com/books?id=bqAXnpmj5LwC&pg=PA180&lpg=PA180&dq=%22!ut%22+session+variables+websphere#v=onepage&q=%22!ut%22%20session%20variables%20websphere&f=false

I'll look at making a change to the web crawler to suppor this like the BV and 
ASP.NET<http://ASP.NET>

________________________________
From: Karl Wright <daddy...@gmail.com<mailto:daddy...@gmail.com>>
Sent: Thursday, May 28, 2020 11:41 AM
To: user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org> 
<user@manifoldcf.apache.org<mailto:user@manifoldcf.apache.org>>
Subject: Re: URL Mapping

Hi,

There are provisions in the URL canonicallization part of the world for removal 
of session information from the URL.  It only knows about some kinds of widely 
used sessions; java app server sessions, for example, Broadvision sessions, 
etc.  If you can convince me that your session information is (a) uniquely 
identifiable, and (b) commonly used, the proper approach is to incorporate 
session removal in this framework.  Please let me know.

Karl


On Thu, May 28, 2020 at 12:11 PM Michael Cizmar 
<michael.ciz...@mcplusa.com<mailto:michael.ciz...@mcplusa.com>> wrote:
I've got a really long url with a bunch of unnecessary session query string 
parameters.  I've been trying unsuccessfully to map it to the same url without 
the session.

an example of the url below.  I thought I could do this:

url map regular expression:

(.*)\/!ut

replacement configuration:


[cid:1725c3c8c33cb971f161]

So the go would be that the url be:
http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/

But the url gets rejected.

Sample Crawl Url

http://localhost:8080/mcplusa/myportal/agents/portal/quoteenroll/digs%20-%20quoting%20%20enrollment%20(individual)/!ut/p/a1/rZHLTsMwEEV_hS6yjDx5OWZpdRFImzYCAYk3lZM6D5TYSWoqPh8HFu2GQhHejEeae-aOLmIoQ0zyY1tz3SrJu7lneLfdBtTxI1iRhzsMFEfrpZ_6AFFoBnIzAN88Cj_pXxBDrJR60A3KeS2kvimV1KZaMKhJ886C8U1pIeSkOtNM3Pz5QewO3IJG9WIGDGW7RzkB7hZFIWxyyx3bL8LAJo6L7QoELitMPAH7r4WXLefmpvBkOoqfiTHth6vYTRxIAT1eufMy8D74Z2DqXg2Mf5Fz-zqOjJq05nzeNcr-FpchuVOyTGpjkOvGbmWlUHYmQtmZCGWfoqF_6omHq83G5gUBL-iOa0oXiw9FOxLu/dl5/d5/L0lJS2FZcHBpbW1LYVlwcGltbVlwcGchIS9vSHd3QUFBSXdpRUFJSkRBQ1VZaUVJVTVCZ09DbFFBQUlBQVNvU0FyUnFBQURBQWF0QXdMTzlRQUFFQUJ3WWVBR0tTQUFDa0k1Z21HU3dTaXJTQUFDZ0s5ZzBIUS80SmlHcGhxRWFoR29ScUVhbEdwaC9aNl9PTzVBMTRHMEs4Ukg2MEE2R0xDNFA0MDBHNy9hZ2VudCBjb250ZW50JTBwb3J0YWwlMHF1b3RlZW5yb2xsJTBkaWdzIC0gcXVvdGluZyAgZW5yb2xsbWVudCAoaW5kaXZpZHVhbCkvZjQ0YmEyOWUtODQwOC00YjFlLTg4MzktMTFlMjI4NDgxYTVhL2RpZ3MgLSBxdW90aW5nICBlbnJvbGxtZW50IChpbmRpdmlkdWFsKQ

Reply via email to