Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
Ah, the difference between open source and a product. With Ultraseek, we chose 
a solid, stable algorithm that worked well for 3000 customers. In open source, 
it is a research project for every single customer.

I love open source. I’ve brought Solr into Netflix and Chegg. But there is a 
clear difference between developer-driven and customer-driven software.

I first learned about bounded binary exponential backoff in the 
Digital/Intel/Xerox (“DIX”) Ethernet spec in 1980. It is a solid algorithm for 
events with a Poisson distribution, like packet arrival times or web page next 
change times. There is no need for configuring algorithms here, especially 
configurations that lead to an unstable estimate. The only meaningful choices 
are the minimum revisit time, the maximum revisit time, and the number of bins. 
Those will be different for CNN (a launch customer for Ultraseek) or Sun 
documentation (another launch customer). CNN news articles change minute by 
minute, new Sun documentation appeared weekly or monthly.

Sorry for the rant, but “you can fix the algorithm yourself” almost always 
means a bad installation, an unhappy admin, and another black eye for open 
source.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 4:07 PM, Markus Jelsma <markus.jel...@openindex.io> wrote:
> 
> Depending on your settings, Nutch does this as well. It is even possible to 
> set up different inc/decremental values per mime-type. 
> The algorithms are pluggable and overridable at any point of interest. You 
> can go all the way.  
> 
> -Original message-
>> From:Walter Underwood <wun...@wunderwood.org>
>> Sent: Wednesday 3rd August 2016 20:03
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
>> 
>> That’s good news.
>> 
>> It should reset the interval estimate on page change instead of slowly 
>> shortening it.
>> 
>> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
>> page had not changed.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote:
>>> 
>>> Nutch also has adaptive strategy:
>>> 
>>> This class implements an adaptive re-fetch algorithm. This works as
>>>> follows:
>>>> 
>>>>  - for pages that has changed since the last fetchTime, decrease their
>>>>  fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>>>  - for pages that haven't changed since the last fetchTime, increase
>>>>  their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>>>  If SYNC_DELTA property is true, then:
>>>> - calculate a delta = fetchTime - modifiedTime
>>>> - try to synchronize with the time of change, by shifting the next
>>>> fetchTime by a fraction of the difference between the last modification
>>>> time and the last fetch time. I.e. the next fetch time will be set to 
>>>> fetchTime
>>>> + fetchInterval - delta * SYNC_DELTA_RATE
>>>> - if the adjusted fetch interval is bigger than the delta, then 
>>>> fetchInterval
>>>> = delta.
>>>>  - the minimum value of fetchInterval may not be smaller than
>>>>  MIN_INTERVAL (default is 1 minute).
>>>>  - the maximum value of fetchInterval may not be bigger than
>>>>  MAX_INTERVAL (default is 365 days).
>>>> 
>>>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>>>> the algorithm, so that the fetch interval either increases or decreases
>>>> infinitely, with little relevance to the page changes. Please use
>>>> main(String[])
>>>> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>>>> method to test the values before applying them in a production system.
>>>> 
>>> 
>>> From:
>>> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
>>> 
>>> 
>>> 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:
>>> 
>>>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>>>> in Ultraseek.
>>>> 
>>>> I think we were the only people who built an adaptive crawler for
>>>> enterprise use. I tried to get Ultraseek open-sourced. I mad

RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
No, just run it continously, always! By default everything is refetched (if 
possible) every 30 days. Just read the descriptions for adaptive schedule and 
its javadoc. It is simple to use, but sometimes hard to predict its outcome, 
just because you never know what changes, at whatever time.

You will be fine with defaults if you have a small site. Just set the interval 
to a few days, or more if your site is slightly larger.

M.

 
 
-Original message-
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
> <kris.t.musshorn@mail.mil>
> Sent: Wednesday 3rd August 2016 20:08
> To: solr-user@lucene.apache.org
> Subject: RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> Shall I assume that, even though nutch has adaptive capability, I would still 
> have to figure out how to trigger it to go look for content that needs update?
> 
> Thanks,
> Kris
> 
> ~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.  
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn@mail.mil
> ~~
> 
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Wednesday, August 03, 2016 2:03 PM
> To: solr-user@lucene.apache.org
> Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> All active links contained in this email were disabled.  Please verify the 
> identity of the sender, and confirm the authenticity of all links contained 
> within the message prior to copying and pasting the address to a Web browser. 
>  
> 
> 
> 
> 
> 
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly 
> shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
> page had not changed.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> Caution-http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>  - calculate a delta = fetchTime - modifiedTime
> >>  - try to synchronize with the time of change, by shifting the next
> >>  fetchTime by a fraction of the difference between the last 
> >> modification
> >>  time and the last fetch time. I.e. the next fetch time will be set to 
> >> fetchTime
> >>  + fetchInterval - delta * SYNC_DELTA_RATE
> >>  - if the adjusted fetch interval is bigger than the delta, then 
> >> fetchInterval
> >>  = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
> >> destabilize the algorithm, so that the fetch interval either 
> >> increases or decreases infinitely, with little relevance to the page 
> >> changes. Please use
> >> main(String[])
> >> <Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc
> >> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> > crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
> >> crawler in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for 
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the 
> >> argument to Mike Lynch. He looked at me like I had three heads and 
> >> didn’t even answer me.
> >>

RE: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
Depending on your settings, Nutch does this as well. It is even possible to set 
up different inc/decremental values per mime-type. 
The algorithms are pluggable and overridable at any point of interest. You can 
go all the way.  
 
-Original message-
> From:Walter Underwood <wun...@wunderwood.org>
> Sent: Wednesday 3rd August 2016 20:03
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly 
> shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
> page had not changed.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>  - calculate a delta = fetchTime - modifiedTime
> >>  - try to synchronize with the time of change, by shifting the next
> >>  fetchTime by a fraction of the difference between the last 
> >> modification
> >>  time and the last fetch time. I.e. the next fetch time will be set to 
> >> fetchTime
> >>  + fetchInterval - delta * SYNC_DELTA_RATE
> >>  - if the adjusted fetch interval is bigger than the delta, then 
> >> fetchInterval
> >>  = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> >> the algorithm, so that the fetch interval either increases or decreases
> >> infinitely, with little relevance to the page changes. Please use
> >> main(String[])
> >> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> >> in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> >> to Mike Lynch. He looked at me like I had three heads and didn’t even
> >> answer me.
> >> 
> >> Ultraseek also has great support for sites that need login. If you use
> >> that, you’ll need to find a way to do that with another crawler.
> >> 
> >> wunder
> >> Walter Underwood
> >> Former Ultraseek Principal Engineer
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >> 
> >> 
> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
> >> <kris.t.musshorn@mail.mil> wrote:
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >>> 
> >>> We are currently using ultraseek and looking to deprecate it in favor of
> >> solr/nutch.
> >>> Ultraseek runs all the time and auto detects when pages have changed and
> >> automatically reindexes them.
> >>> Is this possible with SOLR/nutch?
> >>> 
> >>> Thanks,
> >>> Kris
> >>> 
> >>> ~~
> >>> Kris T. Musshorn
> >>> FileMaker Developer - Contractor - Catapult Technology Inc.
> >>> US Army Research Lab
> >>> Aberdeen Proving Ground
> >>> Application Management & Development Branch
> >>> 410-278-7251
> >>> kris.t.musshorn@mail.mil
> >>> ~~
> >>> 
> >>> 
> >>> 
> >>> CLASSIFICATION: UNCLASSIFIED
> >> 
> >> 
> 
> 


RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Musshorn, Kris T CTR USARMY RDECOM ARL (US)
CLASSIFICATION: UNCLASSIFIED

Shall I assume that, even though nutch has adaptive capability, I would still 
have to figure out how to trigger it to go look for content that needs update?

Thanks,
Kris

~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.  
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn@mail.mil
~~


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, August 03, 2016 2:03 PM
To: solr-user@lucene.apache.org
Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the 
identity of the sender, and confirm the authenticity of all links contained 
within the message prior to copying and pasting the address to a Web browser.  






That’s good news.

It should reset the interval estimate on page change instead of slowly 
shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page 
had not changed.

wunder
Walter Underwood
wun...@wunderwood.org
Caution-http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>  - calculate a delta = fetchTime - modifiedTime
>>  - try to synchronize with the time of change, by shifting the next
>>  fetchTime by a fraction of the difference between the last modification
>>  time and the last fetch time. I.e. the next fetch time will be set to 
>> fetchTime
>>  + fetchInterval - delta * SYNC_DELTA_RATE
>>  - if the adjusted fetch interval is bigger than the delta, then 
>> fetchInterval
>>  = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
>> destabilize the algorithm, so that the fetch interval either 
>> increases or decreases infinitely, with little relevance to the page 
>> changes. Please use
>> main(String[])
>> <Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc
>> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
>> crawler in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for 
>> enterprise use. I tried to get Ultraseek open-sourced. I made the 
>> argument to Mike Lynch. He looked at me like I had three heads and 
>> didn’t even answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you 
>> use that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wun...@wunderwood.org
>> Caution-http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL 
>>> (US)
>> <kris.t.musshorn@mail.mil> wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in 
>>> favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed 
>>> and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn@mail.mil
>>> ~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 


CLASSIFICATION: UNCLASSIFIED


Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
That’s good news.

It should reset the interval estimate on page change instead of slowly 
shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page 
had not changed.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>  - calculate a delta = fetchTime - modifiedTime
>>  - try to synchronize with the time of change, by shifting the next
>>  fetchTime by a fraction of the difference between the last modification
>>  time and the last fetch time. I.e. the next fetch time will be set to 
>> fetchTime
>>  + fetchInterval - delta * SYNC_DELTA_RATE
>>  - if the adjusted fetch interval is bigger than the delta, then 
>> fetchInterval
>>  = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
>> the algorithm, so that the fetch interval either increases or decreases
>> infinitely, with little relevance to the page changes. Please use
>> main(String[])
>> 
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
>> in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for
>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
>> to Mike Lynch. He looked at me like I had three heads and didn’t even
>> answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you use
>> that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>>  wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn@mail.mil
>>> ~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 



Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Marco Scalone
Nutch also has adaptive strategy:

This class implements an adaptive re-fetch algorithm. This works as
> follows:
>
>- for pages that has changed since the last fetchTime, decrease their
>fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>- for pages that haven't changed since the last fetchTime, increase
>their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>If SYNC_DELTA property is true, then:
>   - calculate a delta = fetchTime - modifiedTime
>   - try to synchronize with the time of change, by shifting the next
>   fetchTime by a fraction of the difference between the last modification
>   time and the last fetch time. I.e. the next fetch time will be set to 
> fetchTime
>   + fetchInterval - delta * SYNC_DELTA_RATE
>   - if the adjusted fetch interval is bigger than the delta, then 
> fetchInterval
>   = delta.
>- the minimum value of fetchInterval may not be smaller than
>MIN_INTERVAL (default is 1 minute).
>- the maximum value of fetchInterval may not be bigger than
>MAX_INTERVAL (default is 365 days).
>
> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> the algorithm, so that the fetch interval either increases or decreases
> infinitely, with little relevance to the page changes. Please use
> main(String[])
> 
> method to test the values before applying them in a production system.
>

From:
https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html


2016-08-03 14:45 GMT-03:00 Walter Underwood :

> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> in Ultraseek.
>
> I think we were the only people who built an adaptive crawler for
> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> to Mike Lynch. He looked at me like I had three heads and didn’t even
> answer me.
>
> Ultraseek also has great support for sites that need login. If you use
> that, you’ll need to find a way to do that with another crawler.
>
> wunder
> Walter Underwood
> Former Ultraseek Principal Engineer
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
>  wrote:
> >
> > CLASSIFICATION: UNCLASSIFIED
> >
> > We are currently using ultraseek and looking to deprecate it in favor of
> solr/nutch.
> > Ultraseek runs all the time and auto detects when pages have changed and
> automatically reindexes them.
> > Is this possible with SOLR/nutch?
> >
> > Thanks,
> > Kris
> >
> > ~~
> > Kris T. Musshorn
> > FileMaker Developer - Contractor - Catapult Technology Inc.
> > US Army Research Lab
> > Aberdeen Proving Ground
> > Application Management & Development Branch
> > 410-278-7251
> > kris.t.musshorn@mail.mil
> > ~~
> >
> >
> >
> > CLASSIFICATION: UNCLASSIFIED
>
>


Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Walter Underwood
I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler in 
Ultraseek.

I think we were the only people who built an adaptive crawler for enterprise 
use. I tried to get Ultraseek open-sourced. I made the argument to Mike Lynch. 
He looked at me like I had three heads and didn’t even answer me.

Ultraseek also has great support for sites that need login. If you use that, 
you’ll need to find a way to do that with another crawler.

wunder
Walter Underwood
Former Ultraseek Principal Engineer
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
>  wrote:
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> We are currently using ultraseek and looking to deprecate it in favor of 
> solr/nutch.
> Ultraseek runs all the time and auto detects when pages have changed and 
> automatically reindexes them.
> Is this possible with SOLR/nutch?
> 
> Thanks,
> Kris
> 
> ~~
> Kris T. Musshorn
> FileMaker Developer - Contractor - Catapult Technology Inc.  
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn@mail.mil
> ~~
> 
> 
> 
> CLASSIFICATION: UNCLASSIFIED



Re: Solr Nutch

2014-01-28 Thread Jack Krupansky
1. Nutch follows the links within HTML web pages to crawl the full graph of 
a web of pages.


2. Think of a core as an SQL table - each table/core has a different type of 
data.


3. SolrCloud is all about scaling and availability - multiple shards for 
larger collections and multiple replicas for both scaling of query response 
and availability if nodes go down.


-- Jack Krupansky

-Original Message- 
From: rashmi maheshwari

Sent: Tuesday, January 28, 2014 11:36 AM
To: solr-user@lucene.apache.org
Subject: Solr  Nutch

Hi,

Question1 -- When Solr could parse html, documents like doc, excel pdf
etc, why do we need nutch to parse html files? what is different?

Questions 2: When do we use multiple core in solar? any practical business
case when we need multiple cores?

Question 3: When do we go for cloud? What is meaning of implementing solr
cloud?


--
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org 



Re: Solr Nutch

2014-01-28 Thread Jorge Luis Betancourt Gonzalez
Q1: Nutch doesn’t only handle the parse of HTML files, it also use hadoop to 
achieve large-scale crawling using multiple nodes, it fetch the content of the 
HTML file, and yes it also parse its content.

Q2: In our case we use sold to crawl some website, store the content in one 
“main” solr core. We also have a web app with the typical “search box” we use a 
separated core to store the queries made by our users.

Q3: Not currently using SolrCloud so I’m going to let this one pass to a more 
experienced fellow.

On Jan 28, 2014, at 11:36 AM, rashmi maheshwari maheshwari.ras...@gmail.com 
wrote:

 Hi,
 
 Question1 -- When Solr could parse html, documents like doc, excel pdf
 etc, why do we need nutch to parse html files? what is different?
 
 Questions 2: When do we use multiple core in solar? any practical business
 case when we need multiple cores?
 
 Question 3: When do we go for cloud? What is meaning of implementing solr
 cloud?
 
 
 -- 
 Rashmi
 Be the change that you want to see in this world!
 www.minnal.zor.org
 disha.resolve.at
 www.artofliving.org


III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 
2014. Ver www.uci.cu


Re: Solr Nutch

2014-01-28 Thread Alexei Martchenko
1) Plus, those files are binaries sometimes with metadata, specific
crawlers need to understand them. html is a plain text

2) Yes, different data schemes. Sometimes I replicate the same core and
make some A-B tests with different weights, filters etc etc and some people
like to creare CoreA and CoreB with the same schema and hammer CoreA with
updates and commits and optmizes, they make it available for searches while
hammering CoreB. Then swap again. This produces faster searches.


alexei martchenko
Facebook http://www.facebook.com/alexeiramone |
Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
Steam http://steamcommunity.com/id/alexeiramone/ |
4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
Github https://github.com/alexeiramone | (11) 9 7613.0966 |


2014-01-28 Jack Krupansky j...@basetechnology.com

 1. Nutch follows the links within HTML web pages to crawl the full graph
 of a web of pages.

 2. Think of a core as an SQL table - each table/core has a different type
 of data.

 3. SolrCloud is all about scaling and availability - multiple shards for
 larger collections and multiple replicas for both scaling of query response
 and availability if nodes go down.

 -- Jack Krupansky

 -Original Message- From: rashmi maheshwari
 Sent: Tuesday, January 28, 2014 11:36 AM
 To: solr-user@lucene.apache.org
 Subject: Solr  Nutch


 Hi,

 Question1 -- When Solr could parse html, documents like doc, excel pdf
 etc, why do we need nutch to parse html files? what is different?

 Questions 2: When do we use multiple core in solar? any practical business
 case when we need multiple cores?

 Question 3: When do we go for cloud? What is meaning of implementing solr
 cloud?


 --
 Rashmi
 Be the change that you want to see in this world!
 www.minnal.zor.org
 disha.resolve.at
 www.artofliving.org



Re: Solr Nutch

2014-01-28 Thread rashmi maheshwari
Thanks All for quick response.

Today I crawled a webpage using nutch. This page have many links. But all
anchor tags have href=# and javascript is written on onClick event of
each anchor tag to open a new page.

So crawler didnt crawl any of those links which were opening using onClick
event and has # href value.

How these links are crawled using nutch?




On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko 
ale...@martchenko.com.br wrote:

 1) Plus, those files are binaries sometimes with metadata, specific
 crawlers need to understand them. html is a plain text

 2) Yes, different data schemes. Sometimes I replicate the same core and
 make some A-B tests with different weights, filters etc etc and some people
 like to creare CoreA and CoreB with the same schema and hammer CoreA with
 updates and commits and optmizes, they make it available for searches while
 hammering CoreB. Then swap again. This produces faster searches.


 alexei martchenko
 Facebook http://www.facebook.com/alexeiramone |
 Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
 Steam http://steamcommunity.com/id/alexeiramone/ |
 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
 Github https://github.com/alexeiramone | (11) 9 7613.0966 |


 2014-01-28 Jack Krupansky j...@basetechnology.com

  1. Nutch follows the links within HTML web pages to crawl the full graph
  of a web of pages.
 
  2. Think of a core as an SQL table - each table/core has a different type
  of data.
 
  3. SolrCloud is all about scaling and availability - multiple shards for
  larger collections and multiple replicas for both scaling of query
 response
  and availability if nodes go down.
 
  -- Jack Krupansky
 
  -Original Message- From: rashmi maheshwari
  Sent: Tuesday, January 28, 2014 11:36 AM
  To: solr-user@lucene.apache.org
  Subject: Solr  Nutch
 
 
  Hi,
 
  Question1 -- When Solr could parse html, documents like doc, excel pdf
  etc, why do we need nutch to parse html files? what is different?
 
  Questions 2: When do we use multiple core in solar? any practical
 business
  case when we need multiple cores?
 
  Question 3: When do we go for cloud? What is meaning of implementing solr
  cloud?
 
 
  --
  Rashmi
  Be the change that you want to see in this world!
  www.minnal.zor.org
  disha.resolve.at
  www.artofliving.org
 




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr Nutch

2014-01-28 Thread Markus Jelsma
Short answer, you can't.rashmi maheshwari maheshwari.ras...@gmail.com 
schreef:Thanks All for quick response.

Today I crawled a webpage using nutch. This page have many links. But all
anchor tags have href=# and javascript is written on onClick event of
each anchor tag to open a new page.

So crawler didnt crawl any of those links which were opening using onClick
event and has # href value.

How these links are crawled using nutch?




On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko 
ale...@martchenko.com.br wrote:

 1) Plus, those files are binaries sometimes with metadata, specific
 crawlers need to understand them. html is a plain text

 2) Yes, different data schemes. Sometimes I replicate the same core and
 make some A-B tests with different weights, filters etc etc and some people
 like to creare CoreA and CoreB with the same schema and hammer CoreA with
 updates and commits and optmizes, they make it available for searches while
 hammering CoreB. Then swap again. This produces faster searches.


 alexei martchenko
 Facebook http://www.facebook.com/alexeiramone |
 Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
 Steam http://steamcommunity.com/id/alexeiramone/ |
 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
 Github https://github.com/alexeiramone | (11) 9 7613.0966 |


 2014-01-28 Jack Krupansky j...@basetechnology.com

  1. Nutch follows the links within HTML web pages to crawl the full graph
  of a web of pages.
 
  2. Think of a core as an SQL table - each table/core has a different type
  of data.
 
  3. SolrCloud is all about scaling and availability - multiple shards for
  larger collections and multiple replicas for both scaling of query
 response
  and availability if nodes go down.
 
  -- Jack Krupansky
 
  -Original Message- From: rashmi maheshwari
  Sent: Tuesday, January 28, 2014 11:36 AM
  To: solr-user@lucene.apache.org
  Subject: Solr  Nutch
 
 
  Hi,
 
  Question1 -- When Solr could parse html, documents like doc, excel pdf
  etc, why do we need nutch to parse html files? what is different?
 
  Questions 2: When do we use multiple core in solar? any practical
 business
  case when we need multiple cores?
 
  Question 3: When do we go for cloud? What is meaning of implementing solr
  cloud?
 
 
  --
  Rashmi
  Be the change that you want to see in this world!
  www.minnal.zor.org
  disha.resolve.at
  www.artofliving.org
 




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr Nutch

2014-01-28 Thread Alexei Martchenko
Well, not even Google parse those. I'm not sure about Nutch but in some
crawlers (jSoup i believe) there's an option to try to get full URLs from
plain text, so you can capture some urls in the form of someClickFunction('
http://www.someurl.com/whatever') or even if they are in the middle of some
paragraph. Sometimes it works beautifully, sometimes it misleads you to
parse urls shortened with ellipsis in the middle.



alexei martchenko
Facebook http://www.facebook.com/alexeiramone |
Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
Steam http://steamcommunity.com/id/alexeiramone/ |
4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
Github https://github.com/alexeiramone | (11) 9 7613.0966 |


2014-01-28 rashmi maheshwari maheshwari.ras...@gmail.com

 Thanks All for quick response.

 Today I crawled a webpage using nutch. This page have many links. But all
 anchor tags have href=# and javascript is written on onClick event of
 each anchor tag to open a new page.

 So crawler didnt crawl any of those links which were opening using onClick
 event and has # href value.

 How these links are crawled using nutch?




 On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko 
 ale...@martchenko.com.br wrote:

  1) Plus, those files are binaries sometimes with metadata, specific
  crawlers need to understand them. html is a plain text
 
  2) Yes, different data schemes. Sometimes I replicate the same core and
  make some A-B tests with different weights, filters etc etc and some
 people
  like to creare CoreA and CoreB with the same schema and hammer CoreA with
  updates and commits and optmizes, they make it available for searches
 while
  hammering CoreB. Then swap again. This produces faster searches.
 
 
  alexei martchenko
  Facebook http://www.facebook.com/alexeiramone |
  Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
  Steam http://steamcommunity.com/id/alexeiramone/ |
  4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
  Github https://github.com/alexeiramone | (11) 9 7613.0966 |
 
 
  2014-01-28 Jack Krupansky j...@basetechnology.com
 
   1. Nutch follows the links within HTML web pages to crawl the full
 graph
   of a web of pages.
  
   2. Think of a core as an SQL table - each table/core has a different
 type
   of data.
  
   3. SolrCloud is all about scaling and availability - multiple shards
 for
   larger collections and multiple replicas for both scaling of query
  response
   and availability if nodes go down.
  
   -- Jack Krupansky
  
   -Original Message- From: rashmi maheshwari
   Sent: Tuesday, January 28, 2014 11:36 AM
   To: solr-user@lucene.apache.org
   Subject: Solr  Nutch
  
  
   Hi,
  
   Question1 -- When Solr could parse html, documents like doc, excel pdf
   etc, why do we need nutch to parse html files? what is different?
  
   Questions 2: When do we use multiple core in solar? any practical
  business
   case when we need multiple cores?
  
   Question 3: When do we go for cloud? What is meaning of implementing
 solr
   cloud?
  
  
   --
   Rashmi
   Be the change that you want to see in this world!
   www.minnal.zor.org
   disha.resolve.at
   www.artofliving.org
  
 



 --
 Rashmi
 Be the change that you want to see in this world!
 www.minnal.zor.org
 disha.resolve.at
 www.artofliving.org



Re: Solr Nutch

2014-01-28 Thread rashmi maheshwari
Thanks Markus and Alexei.


On Wed, Jan 29, 2014 at 12:08 AM, Alexei Martchenko 
ale...@martchenko.com.br wrote:

 Well, not even Google parse those. I'm not sure about Nutch but in some
 crawlers (jSoup i believe) there's an option to try to get full URLs from
 plain text, so you can capture some urls in the form of someClickFunction('
 http://www.someurl.com/whatever') or even if they are in the middle of
 some
 paragraph. Sometimes it works beautifully, sometimes it misleads you to
 parse urls shortened with ellipsis in the middle.



 alexei martchenko
 Facebook http://www.facebook.com/alexeiramone |
 Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
 Steam http://steamcommunity.com/id/alexeiramone/ |
 4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
 Github https://github.com/alexeiramone | (11) 9 7613.0966 |


 2014-01-28 rashmi maheshwari maheshwari.ras...@gmail.com

  Thanks All for quick response.
 
  Today I crawled a webpage using nutch. This page have many links. But all
  anchor tags have href=# and javascript is written on onClick event of
  each anchor tag to open a new page.
 
  So crawler didnt crawl any of those links which were opening using
 onClick
  event and has # href value.
 
  How these links are crawled using nutch?
 
 
 
 
  On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko 
  ale...@martchenko.com.br wrote:
 
   1) Plus, those files are binaries sometimes with metadata, specific
   crawlers need to understand them. html is a plain text
  
   2) Yes, different data schemes. Sometimes I replicate the same core and
   make some A-B tests with different weights, filters etc etc and some
  people
   like to creare CoreA and CoreB with the same schema and hammer CoreA
 with
   updates and commits and optmizes, they make it available for searches
  while
   hammering CoreB. Then swap again. This produces faster searches.
  
  
   alexei martchenko
   Facebook http://www.facebook.com/alexeiramone |
   Linkedinhttp://br.linkedin.com/in/alexeimartchenko|
   Steam http://steamcommunity.com/id/alexeiramone/ |
   4sqhttps://pt.foursquare.com/alexeiramone| Skype: alexeiramone |
   Github https://github.com/alexeiramone | (11) 9 7613.0966 |
  
  
   2014-01-28 Jack Krupansky j...@basetechnology.com
  
1. Nutch follows the links within HTML web pages to crawl the full
  graph
of a web of pages.
   
2. Think of a core as an SQL table - each table/core has a different
  type
of data.
   
3. SolrCloud is all about scaling and availability - multiple shards
  for
larger collections and multiple replicas for both scaling of query
   response
and availability if nodes go down.
   
-- Jack Krupansky
   
-Original Message- From: rashmi maheshwari
Sent: Tuesday, January 28, 2014 11:36 AM
To: solr-user@lucene.apache.org
Subject: Solr  Nutch
   
   
Hi,
   
Question1 -- When Solr could parse html, documents like doc, excel
 pdf
etc, why do we need nutch to parse html files? what is different?
   
Questions 2: When do we use multiple core in solar? any practical
   business
case when we need multiple cores?
   
Question 3: When do we go for cloud? What is meaning of implementing
  solr
cloud?
   
   
--
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org
   
  
 
 
 
  --
  Rashmi
  Be the change that you want to see in this world!
  www.minnal.zor.org
  disha.resolve.at
  www.artofliving.org
 




-- 
Rashmi
Be the change that you want to see in this world!
www.minnal.zor.org
disha.resolve.at
www.artofliving.org


Re: Solr Nutch

2014-01-28 Thread Koji Sekiguchi

1. Nutch follows the links within HTML web pages to crawl the full graph of a 
web of pages.


In addition, I think Nutch has PageRank-like scoring function as opposed to
Lucene/Solr, those are based on vector space model scoring.

koji
--
http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html


Re: solr nutch url indexing

2009-08-26 Thread last...@gmail.com

Uri Boness wrote:
Well... yes, it's a tool the Nutch ships with. It also ships with an 
example Solr schema which you can use. 


hi,
is there any documentation to understand what going in the schema ?

requestHandler name=/nutch class=solr.SearchHandler 
   lst name=defaults
   str name=defTypedismax/str
   str name=echoParamsexplicit/str
   float name=tie0.01/float
   str name=qfcontent0.5 anchor1.0 title5.2/str
   str name=pfcontent0.5 anchor1.5 title5.2 site1.5/str
   str name=flurl/str
   str name=mm2lt;-1 5lt;-2 6lt;90%/str
   int name=ps100/int
   bool hl=true/
   str name=q.alt*:*/str
   str name=hl.fltitle url content/str
   str name=f.title.hl.fragsize0/str
   str name=f.title.hl.alternateFieldtitle/str
   str name=f.url.hl.fragsize0/str
   str name=f.url.hl.alternateFieldurl/str
   str name=f.content.hl.fragmenterregex/str
   /lst
/requestHandler


Re: solr nutch url indexing

2009-08-26 Thread Uri Boness

Do you mean the schema or the solrconfig.xml?

The request handler is configured in the solrconfig.xml and you can find 
out more about this particular configuration in 
http://wiki.apache.org/solr/DisMaxRequestHandler?highlight=(CategorySolrRequestHandler)|((CategorySolrRequestHandler)). 



To understand the schema better, you can read 
http://wiki.apache.org/solr/SchemaXml


Uri

last...@gmail.com wrote:

Uri Boness wrote:
Well... yes, it's a tool the Nutch ships with. It also ships with an 
example Solr schema which you can use. 


hi,
is there any documentation to understand what going in the schema ?

requestHandler name=/nutch class=solr.SearchHandler 
   lst name=defaults
   str name=defTypedismax/str
   str name=echoParamsexplicit/str
   float name=tie0.01/float
   str name=qfcontent0.5 anchor1.0 title5.2/str
   str name=pfcontent0.5 anchor1.5 title5.2 site1.5/str
   str name=flurl/str
   str name=mm2lt;-1 5lt;-2 6lt;90%/str
   int name=ps100/int
   bool hl=true/
   str name=q.alt*:*/str
   str name=hl.fltitle url content/str
   str name=f.title.hl.fragsize0/str
   str name=f.title.hl.alternateFieldtitle/str
   str name=f.url.hl.fragsize0/str
   str name=f.url.hl.alternateFieldurl/str
   str name=f.content.hl.fragmenterregex/str
   /lst
/requestHandler



Re: solr nutch url indexing

2009-08-25 Thread Thibaut Lassalle
Thanks for your help.

I use the default Nutch configuration and I use solrindex to give the Nutch
result to Solr. I have results when I query therefore Nutch works properly
(it gives a url, title, content ...)

I would like to query on Solr to emphase the title field and not the
content field.

Here is a sample of my shema.xml

..
uniqueKeyid/uniqueKey
defaultSearchFieldcontent/defaultSearchField
solrQueryParser defaultOperator=AND/
copyField source=url dest=id/
...


Here is a sample of my solrconfig.xml

requestHandler name=/nutch class=solr.SearchHandler 
lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qfcontent^0.5 anchor^1.0 title^5.2/str
str name=pfcontent^0.5 anchor^1.5 title^5.2 site^1.5/str
str name=flurl/str
str name=mm2lt;-1 5lt;-2 6lt;90%/str
int name=ps100/int
bool hl=true/
str name=q.alt*:*/str
str name=hl.fltitle url content/str
str name=f.title.hl.fragsize0/str
str name=f.title.hl.alternateFieldtitle/str
str name=f.url.hl.fragsize0/str
str name=f.url.hl.alternateFieldurl/str
str name=f.content.hl.fragmenterregex/str
/lst
/requestHandler


This configuration query on content only.
How to I change them to query mostly on title ?

I tried to change defaultSearchField to title but it doesn't work.

Where can I find doc on the solr.SearchHandler ?

Thanks
t.


Re: solr nutch url indexing

2009-08-25 Thread Uri Boness
It seems to me that this configuration actually does what you want - 
queries on title mostly. The default search field doesn't influence a 
dismax query. I would suggest you to include the debugQuery=true 
parameter, it will help you figure out how the matching is performed.


You can read more about dismax queries here: 
http://wiki.apache.org/solr/DisMaxRequestHandler




Thibaut Lassalle wrote:

Thanks for your help.

I use the default Nutch configuration and I use solrindex to give the Nutch
result to Solr. I have results when I query therefore Nutch works properly
(it gives a url, title, content ...)

I would like to query on Solr to emphase the title field and not the
content field.

Here is a sample of my shema.xml

..
uniqueKeyid/uniqueKey
defaultSearchFieldcontent/defaultSearchField
solrQueryParser defaultOperator=AND/
copyField source=url dest=id/
...


Here is a sample of my solrconfig.xml

requestHandler name=/nutch class=solr.SearchHandler 
lst name=defaults
str name=defTypedismax/str
str name=echoParamsexplicit/str
float name=tie0.01/float
str name=qfcontent^0.5 anchor^1.0 title^5.2/str
str name=pfcontent^0.5 anchor^1.5 title^5.2 site^1.5/str
str name=flurl/str
str name=mm2lt;-1 5lt;-2 6lt;90%/str
int name=ps100/int
bool hl=true/
str name=q.alt*:*/str
str name=hl.fltitle url content/str
str name=f.title.hl.fragsize0/str
str name=f.title.hl.alternateFieldtitle/str
str name=f.url.hl.fragsize0/str
str name=f.url.hl.alternateFieldurl/str
str name=f.content.hl.fragmenterregex/str
/lst
/requestHandler


This configuration query on content only.
How to I change them to query mostly on title ?

I tried to change defaultSearchField to title but it doesn't work.

Where can I find doc on the solr.SearchHandler ?

Thanks
t.

  


RE: solr nutch url indexing

2009-08-25 Thread Fuad Efendi
Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I
use similar approach...

-Original Message-
From: Uri Boness 
Hi,

Nutch comes with support for Solr out of the box. I suggest you follow 
the steps as described here: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

Cheers,
Uri

Fuad Efendi wrote:
 Is SolrIndex plugin for Nutch? 
 Thanks!


   




Re: solr nutch url indexing

2009-08-25 Thread Uri Boness
Well... yes, it's a tool the Nutch ships with. It also ships with an 
example Solr schema which you can use.


Fuad Efendi wrote:

Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I
use similar approach...

-Original Message-
From: Uri Boness 
Hi,


Nutch comes with support for Solr out of the box. I suggest you follow 
the steps as described here: 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/


Cheers,
Uri

Fuad Efendi wrote:
  
Is SolrIndex plugin for Nutch? 
Thanks!



  





  


Re: solr nutch url indexing

2009-08-24 Thread Uri Boness

How did you configure nutch?

Make sure you have the parse-html and index-basic configured. The 
HtmlParser should by default extract the page title and add to the 
parsed data, and the BasicIndexingFilter by default adds this title to 
the NutchDocument and stores it in the title filed. All the SolrIndex 
(actually the SolrWriter) does is converting the NuchDocument to a 
SolrInputDocument. So having these plugins configured in Nutch and 
having a field in the schema named title should work. (I'm assuming 
you're using the solrindex tool)


Cheers,
Uri

Lassalle, Thibaut wrote:

Hi,

 


I would like to crawl intranets with nutch and index them with solr.

 


I would like to search mostly on the title of the pages (the one in
titleThis is a title/title)

 


I tried to tweak the schema.xml to do that but nothing is working. I
just have the content indexed.

 


How do I index on title ?

 


Thanks

t.


  


RE: solr nutch url indexing

2009-08-24 Thread Fuad Efendi
Is SolrIndex plugin for Nutch? 
Thanks!


-Original Message-
From: Uri Boness [mailto:ubon...@gmail.com] 
Sent: August-24-09 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: solr nutch url indexing

How did you configure nutch?

Make sure you have the parse-html and index-basic configured. The 
HtmlParser should by default extract the page title and add to the 
parsed data, and the BasicIndexingFilter by default adds this title to 
the NutchDocument and stores it in the title filed. All the SolrIndex 
(actually the SolrWriter) does is converting the NuchDocument to a 
SolrInputDocument. So having these plugins configured in Nutch and 
having a field in the schema named title should work. (I'm assuming 
you're using the solrindex tool)

Cheers,
Uri

Lassalle, Thibaut wrote:
 Hi,

  

 I would like to crawl intranets with nutch and index them with solr.

  

 I would like to search mostly on the title of the pages (the one in
 titleThis is a title/title)

  

 I tried to tweak the schema.xml to do that but nothing is working. I
 just have the content indexed.

  

 How do I index on title ?

  

 Thanks

 t.