Yeah, URL Classify does only do so much. That's why you need to combine multiple methods.

As a fourth method, you could code up a short JavaScript "StatelessScriptUpdateProcessor" that did something like take a full domain name (such as output by URL Classify) and turn it into multiple values, each with more of the prefix removed, so that "lucene.apache.org" would index as:

lucene.apache.org
apache.org
apache
.org
org

And then the user could query by any of those partial domain names.

But, if you simply tokenize the URL (copy the URL string to a text field), you automatically get most of that. The user can query by a URL fragment, such as "apache.org", ".org", "lucene.apache.org", etc. and the tokenization will strip out the punctuation.

I'll add this script to my list of examples to add in the next rev of my book.

-- Jack Krupansky

-----Original Message----- From: Flavio Pompermaier
Sent: Tuesday, June 25, 2013 10:06 AM
To: solr-user@lucene.apache.org
Subject: Re: URL search and indexing

I bought the book and looking at the example I still don't understand if it
possible query all sub-urls of my URL.
For example, if the URLClassifyProcessorFactory takes in input "url_s":"
http://lucene.apache.org/solr/4_0_0/changes/Changes.html"; and makes some
outputs like
- "url_domain_s":"lucene.apache.org"
- "url_canonical_s":"
http://lucene.apache.org/solr/4_0_0/changes/Changes.html";
How should I configure url_domain_s in order to be able to makes query like
'*.apache.org'?
How should I configure url_canonical_s in order to be able to makes query
like 'http://lucene.apache.org/solr/*'?
Is it better to have two different fields for the two queries or could I
create just one field for the two kind of queries (obviously for the former
case then I should query something like *://.apache.org/*)?


On Tue, Jun 25, 2013 at 3:15 PM, Jack Krupansky <j...@basetechnology.com>wrote:

There are examples in my book:
http://www.lulu.com/shop/jack-**krupansky/solr-4x-deep-dive-**
early-access-release-1/ebook/**product-21079719.html<http://www.lulu.com/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-1/ebook/product-21079719.html>

But... I still think you should use a tokenized text field as well - use
all three: raw string, tokenized text, and URL classification fields.

-- Jack Krupansky

-----Original Message----- From: Flavio Pompermaier
Sent: Tuesday, June 25, 2013 9:02 AM
To: solr-user@lucene.apache.org
Subject: Re: URL search and indexing


That's sound exactly what I'm looking for! However I cannot find an example
of how to use it..could you help me please?
Moreover, about id field, isn't true that id field shouldn't be analyzed as
suggested in
http://wiki.apache.org/solr/**UniqueKey#Text_field_in_the_**document<http://wiki.apache.org/solr/UniqueKey#Text_field_in_the_document>
?


On Tue, Jun 25, 2013 at 2:47 PM, Jan Høydahl <jan....@cominvent.com>
wrote:

 Sure you can query the url directly. Or if you choose you can split it up
in multiple components, e.g. using
http://lucene.apache.org/solr/**4_3_0/solr-core/org/apache/**
solr/update/processor/**URLClassifyProcessor.html<http://lucene.apache.org/solr/4_3_0/solr-core/org/apache/solr/update/processor/URLClassifyProcessor.html>

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

25. juni 2013 kl. 14:10 skrev Flavio Pompermaier <pomperma...@okkam.it>:

> Sorry but maybe I miss something here..could I declare url as key field
and
> query it too..?
> At the moment, my schema.xml looks like:
>
> <fields>
>     <field name="url" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
>
>   <field name="category" type="string" indexed="true" stored="true"/>
>   <field name="language" type="string" indexed="true" stored="true"/>
>  ...
>   <field name="_version_" type="long" indexed="true" stored="true"/>
>
> </fields>
> <uniqueKey>url</uniqueKey>
>
> Is it ok? or should I add a "baseurl" field of some kind to be able to
> query all url coming from a certain domain (1st or 2nd level as well)?
>
> Best,
> Flavio
>
>
> On Tue, Jun 25, 2013 at 12:28 PM, Jan Høydahl <jan....@cominvent.com>
wrote:
>
>> Probably a good match for the RegExp feature of Solr (given that your
url
>> is not tokenized)
>> e.g. q=url:/.*\.it$/
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> 25. juni 2013 kl. 12:17 skrev Flavio Pompermaier <pomperma...@okkam.it
>:
>>
>>> Hi to everybody,
>>> I'm quite new to Solr so maybe my question could be trivial for you..
>>> In my use case I have to index stuff contained in some URL so i use
>>> url
>> as
>>> key of my document and I treat it like a string.
>>>
>>> However I'd like to be able to query by domain name, like *.it or *.
>>> somesite.com, what's the best strategy? I tought to made a URL to
path
>>> transfromation and indexed using solr.**PathHierarchyTokenizerFactory
>>> but
>>> maybe there's a simpler solution..isn't it?
>>>
>>> Best,
>>> Flavio
>>>
>>> --
>>>
>>> Flavio Pompermaier
>>> *Development Department
>>> *_____________________________**__________________
>>> *OKKAM**Srl **- www.okkam.it*
>>>
>>> *Phone:* +(39) 0461 283 702
>>> *Fax:* + (39) 0461 186 6433
>>> *Email:* f.pomperma...@okkam.it
>>> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
>>> *Registered office:* Trento (Italy), via Segantini 23
>>>
>>> Confidentially notice. This e-mail transmission may contain legally
>>> privileged and/or confidential information. Please do not read it if
you
>>> are not the intended recipient(S). Any use, distribution, >>> reproduction
or
>>> disclosure by any other person is strictly prohibited. If you have
>> received
>>> this e-mail in error, please notify the sender and destroy the >>>
original
>>> transmission and its attachments without reading or saving it in any
>> manner.
>>
>>
>
>
> --
>
> Flavio Pompermaier
> *Development Department
> *_____________________________**__________________
> *OKKAM**Srl **- www.okkam.it*
>
> *Phone:* +(39) 0461 283 702
> *Fax:* + (39) 0461 186 6433
> *Email:* f.pomperma...@okkam.it
> *Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
> *Registered office:* Trento (Italy), via Segantini 23
>
> Confidentially notice. This e-mail transmission may contain legally
> privileged and/or confidential information. Please do not read it if > you
> are not the intended recipient(S). Any use, distribution, reproduction
> or
> disclosure by any other person is strictly prohibited. If you have
received
> this e-mail in error, please notify the sender and destroy the original
> transmission and its attachments without reading or saving it in any
manner.




--

Flavio Pompermaier
*Development Department
*_____________________________**__________________
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any
manner.




--

Flavio Pompermaier
*Development Department
*_______________________________________________
*OKKAM**Srl **- www.okkam.it*

*Phone:* +(39) 0461 283 702
*Fax:* + (39) 0461 186 6433
*Email:* f.pomperma...@okkam.it
*Headquarters:* Trento (Italy), fraz. Villazzano, Salita dei Molini 2
*Registered office:* Trento (Italy), via Segantini 23

Confidentially notice. This e-mail transmission may contain legally
privileged and/or confidential information. Please do not read it if you
are not the intended recipient(S). Any use, distribution, reproduction or
disclosure by any other person is strictly prohibited. If you have received
this e-mail in error, please notify the sender and destroy the original
transmission and its attachments without reading or saving it in any manner.

Reply via email to