I don't understand the inclusion of 'n' in the character classes in this pattern... it's pretty clear that the broken examples in the OP were where the letter n occurred in the domain name. I expect a similar problem for user parts that contain n...
^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+) On Tue, Jun 12, 2018 at 7:15 PM, Kevin Risden <kris...@apache.org> wrote: > Looks like stop words (in, and, on) is what is breaking. The regex looks > like it is correct. > > Kevin Risden > > On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder <harinder.han...@calgary.ca> > wrote: > > > Hello! > > > > I am indexing web documents and have a need to extract their top-level > URL > > to be stored in a different field. I have had some success with the > > PatternTokenizerFactory (relevant schema bits at the bottom) but the > > behavior appears to be inconsistent. Most of the times, the top level > URL > > is extracted just fine but for some documents, it is being cut off. > > > > Examples: > > URL > > > > Extracted URL > > > > Comment > > > > http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf > > > > http://www.calgaryarb.ca > > > > Success > > > > http://www.calgarymlc.ca/about-cmlc/ > > > > http://www.calgarymlc.ca > > > > Success > > > > http://www.calgarypolicecommission.ca/reports.php > > > > http://www.calgarypolicecommissio > > > > Fail > > > > https://attainyourhome.com/ > > > > https://attai > > > > Fail > > > > https://liveandplay.calgary.ca/DROPIN/page/dropin > > > > https://livea > > > > Fail > > > > > > > > > > Relevant schema: > > <copyField dest="hostname" source="SolrId"/> > > > > <field name="hostname" type="hostnameType" stored="true" indexed="false" > > multiValued="false"/> > > > > <fieldType name="hostnameType" class="solr.TextField" > > sortMissingLast="true"> > > <analyzer type="index"> > > <tokenizer > > > > class="solr.PatternTokenizerFactory" > > > > pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)" > > group="0"/> > > </analyzer> > > </fieldType> > > > > > > I have tested the Regex and it is matching things fine. Please see > > https://regex101.com/r/wN6cZ7/358. > > So it appears that I have a gap in my understanding of how Solr > > PatternTokenizerFactory works. I would appreciate any insight on the > issue. > > hostname field will be used in facet queries. > > > > Thank you! > > Harinder > > > > ________________________________ > > NOTICE - > > This communication is intended ONLY for the use of the person or entity > > named above and may contain information that is confidential or legally > > privileged. If you are not the intended recipient named above or a person > > responsible for delivering messages or communications to the intended > > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying > > of this communication or any of the information contained in it is > strictly > > prohibited. If you have received this communication in error, please > notify > > us immediately by telephone and then destroy or delete this > communication, > > or return it to us by mail if requested by us. The City of Calgary thanks > > you for your attention and co-operation. > > > -- http://www.the111shift.com