The "?" in your url probably is being interpreted as a regular expression "?" in your include list. You need to escape it properly there.
Karl On Wed, May 6, 2020 at 2:54 AM ritika jain <[email protected]> wrote: > Hi Michael, > > Yes i testing this with Debug Mode and tested one more scenario. > Whenever Seed URL is something like this:- > https://www.abc.com/societybusiness/entrepreneurship/?lang=en > <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en>., > Our web connector.Java code is return Null in this function, when m.find() > is executed. hence giving DocumentIdenitifer null and thus Iilegal seed URL > error > > /** Check if the document identifier is legal. > */ > public boolean isDocumentLegal(String url) > { > // First, verify that the url matches one of the patterns in the > include list. > int i = 0; > while (i < includePatterns.size()) > { > Pattern p = includePatterns.get(i); > Matcher m = p.matcher(url); > if (m.find()) > break; > i++; > > Whereas when the Seed method is something like this :- > https://www.abc.com/societybusiness/entrepreneurship/ , this code is > getting passed with out fail. > Can anybody make me understand why the same code is behaving differently? > > Thanks > Ritika > } > > On Tue, May 5, 2020 at 6:09 PM Michael Cizmar <[email protected]> > wrote: > >> Hi Ritika, >> >> >> >> There are several reasons that you could get that. Have you started >> manifoldcf in debug mode? If so, what’s the output just before that >> statement in the logs? >> >> >> >> -- >> >> Michael Cizmar >> >> >> >> *From: *ritika jain <[email protected]> >> *Reply-To: *"[email protected]" <[email protected]> >> *Date: *Tuesday, May 5, 2020 at 4:34 AM >> *To: *"[email protected]" <[email protected]> >> *Subject: *Illegal Seed URL >> >> >> >> Hi All, >> >> >> >> I am using Manifoldcf 2.14 Repository as Web crawler and Output as >> Elastic Search. I have mentioned a seed URL which is valid as it is opening >> successfully in browser. >> >> Say URl is https://www.abc.com/societybusiness/entrepreneurship/?lang=en >> <https://www.rug.nl/society-business/centre-for-entrepreneurship/?lang=en> >> . >> >> >> >> Which is having ? query string in URL. >> >> I am doing anything wrong in this >> >> >> >> Thanks >> >> Ritika >> >> >> >> >> >
