Re: Extracting top level URL when indexing document

Gus Heck Tue, 19 Jun 2018 15:09:37 -0700

I don't understand the inclusion of 'n' in the character classes in this
pattern... it's pretty clear that the broken examples in the OP were where
the letter n occurred in the domain name. I expect a similar problem for
user parts that contain n...


^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)

On Tue, Jun 12, 2018 at 7:15 PM, Kevin Risden <kris...@apache.org> wrote:

> Looks like stop words (in, and, on) is what is breaking. The regex looks
> like it is correct.
>
> Kevin Risden
>
> On Tue, Jun 12, 2018, 18:02 Hanjan, Harinder <harinder.han...@calgary.ca>
> wrote:
>
> > Hello!
> >
> > I am indexing web documents and have a need to extract their top-level
> URL
> > to be stored in a different field. I have had some success with the
> > PatternTokenizerFactory (relevant schema bits at the bottom) but the
> > behavior appears to be inconsistent.  Most of the times, the top level
> URL
> > is extracted just fine but for some documents, it is being cut off.
> >
> > Examples:
> > URL
> >
> > Extracted URL
> >
> > Comment
> >
> > http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
> >
> > http://www.calgaryarb.ca
> >
> > Success
> >
> > http://www.calgarymlc.ca/about-cmlc/
> >
> > http://www.calgarymlc.ca
> >
> > Success
> >
> > http://www.calgarypolicecommission.ca/reports.php
> >
> > http://www.calgarypolicecommissio
> >
> > Fail
> >
> > https://attainyourhome.com/
> >
> > https://attai
> >
> > Fail
> >
> > https://liveandplay.calgary.ca/DROPIN/page/dropin
> >
> > https://livea
> >
> > Fail
> >
> >
> >
> >
> > Relevant schema:
> > <copyField dest="hostname" source="SolrId"/>
> >
> > <field name="hostname" type="hostnameType" stored="true" indexed="false"
> > multiValued="false"/>
> >
> > <fieldType name="hostnameType" class="solr.TextField"
> > sortMissingLast="true">
> >                 <analyzer type="index">
> >                                 <tokenizer
> >
> > class="solr.PatternTokenizerFactory"
> >
> > pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
> >                                                 group="0"/>
> >                 </analyzer>
> > </fieldType>
> >
> >
> > I have tested the Regex and it is matching things fine. Please see
> > https://regex101.com/r/wN6cZ7/358.
> > So it appears that I have a gap in my understanding of how Solr
> > PatternTokenizerFactory works. I would appreciate any insight on the
> issue.
> > hostname field will be used in facet queries.
> >
> > Thank you!
> > Harinder
> >
> > ________________________________
> > NOTICE -
> > This communication is intended ONLY for the use of the person or entity
> > named above and may contain information that is confidential or legally
> > privileged. If you are not the intended recipient named above or a person
> > responsible for delivering messages or communications to the intended
> > recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> > of this communication or any of the information contained in it is
> strictly
> > prohibited. If you have received this communication in error, please
> notify
> > us immediately by telephone and then destroy or delete this
> communication,
> > or return it to us by mail if requested by us. The City of Calgary thanks
> > you for your attention and co-operation.
> >
>



-- 
http://www.the111shift.com

Re: Extracting top level URL when indexing document

Reply via email to