Re: SOLR + Nutch set up (UNCLASSIFIED)
Ah, the difference between open source and a product. With Ultraseek, we chose a solid, stable algorithm that worked well for 3000 customers. In open source, it is a research project for every single customer. I love open source. I’ve brought Solr into Netflix and Chegg. But there is a clear difference between developer-driven and customer-driven software. I first learned about bounded binary exponential backoff in the Digital/Intel/Xerox (“DIX”) Ethernet spec in 1980. It is a solid algorithm for events with a Poisson distribution, like packet arrival times or web page next change times. There is no need for configuring algorithms here, especially configurations that lead to an unstable estimate. The only meaningful choices are the minimum revisit time, the maximum revisit time, and the number of bins. Those will be different for CNN (a launch customer for Ultraseek) or Sun documentation (another launch customer). CNN news articles change minute by minute, new Sun documentation appeared weekly or monthly. Sorry for the rant, but “you can fix the algorithm yourself” almost always means a bad installation, an unhappy admin, and another black eye for open source. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 3, 2016, at 4:07 PM, Markus Jelsma wrote: > > Depending on your settings, Nutch does this as well. It is even possible to > set up different inc/decremental values per mime-type. > The algorithms are pluggable and overridable at any point of interest. You > can go all the way. > > -Original message- >> From:Walter Underwood >> Sent: Wednesday 3rd August 2016 20:03 >> To: solr-user@lucene.apache.org >> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED) >> >> That’s good news. >> >> It should reset the interval estimate on page change instead of slowly >> shortening it. >> >> I’m pretty sure that Ultraseek used a bounded exponential backoff when the >> page had not changed. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Aug 3, 2016, at 10:51 AM, Marco Scalone wrote: >>> >>> Nutch also has adaptive strategy: >>> >>> This class implements an adaptive re-fetch algorithm. This works as >>>> follows: >>>> >>>> - for pages that has changed since the last fetchTime, decrease their >>>> fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). >>>> - for pages that haven't changed since the last fetchTime, increase >>>> their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). >>>> If SYNC_DELTA property is true, then: >>>> - calculate a delta = fetchTime - modifiedTime >>>> - try to synchronize with the time of change, by shifting the next >>>> fetchTime by a fraction of the difference between the last modification >>>> time and the last fetch time. I.e. the next fetch time will be set to >>>> fetchTime >>>> + fetchInterval - delta * SYNC_DELTA_RATE >>>> - if the adjusted fetch interval is bigger than the delta, then >>>> fetchInterval >>>> = delta. >>>> - the minimum value of fetchInterval may not be smaller than >>>> MIN_INTERVAL (default is 1 minute). >>>> - the maximum value of fetchInterval may not be bigger than >>>> MAX_INTERVAL (default is 365 days). >>>> >>>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize >>>> the algorithm, so that the fetch interval either increases or decreases >>>> infinitely, with little relevance to the page changes. Please use >>>> main(String[]) >>>> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> >>>> method to test the values before applying them in a production system. >>>> >>> >>> From: >>> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html >>> >>> >>> 2016-08-03 14:45 GMT-03:00 Walter Underwood : >>> >>>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler >>>> in Ultraseek. >>>> >>>> I think we were the only people who built an adaptive crawler for >>>> enterprise use. I tried to get Ultraseek open-sourced. I made the argument >>>> to Mike Lynch. He looked at me like I had three heads and didn’t even >>>> ans
RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
No, just run it continously, always! By default everything is refetched (if possible) every 30 days. Just read the descriptions for adaptive schedule and its javadoc. It is simple to use, but sometimes hard to predict its outcome, just because you never know what changes, at whatever time. You will be fine with defaults if you have a small site. Just set the interval to a few days, or more if your site is slightly larger. M. -Original message- > From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) > > Sent: Wednesday 3rd August 2016 20:08 > To: solr-user@lucene.apache.org > Subject: RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED) > > CLASSIFICATION: UNCLASSIFIED > > Shall I assume that, even though nutch has adaptive capability, I would still > have to figure out how to trigger it to go look for content that needs update? > > Thanks, > Kris > > ~~ > Kris T. Musshorn > FileMaker Developer - Contractor – Catapult Technology Inc. > US Army Research Lab > Aberdeen Proving Ground > Application Management & Development Branch > 410-278-7251 > kris.t.musshorn@mail.mil > ~~ > > > -Original Message- > From: Walter Underwood [mailto:wun...@wunderwood.org] > Sent: Wednesday, August 03, 2016 2:03 PM > To: solr-user@lucene.apache.org > Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED) > > All active links contained in this email were disabled. Please verify the > identity of the sender, and confirm the authenticity of all links contained > within the message prior to copying and pasting the address to a Web browser. > > > > > > > > That’s good news. > > It should reset the interval estimate on page change instead of slowly > shortening it. > > I’m pretty sure that Ultraseek used a bounded exponential backoff when the > page had not changed. > > wunder > Walter Underwood > wun...@wunderwood.org > Caution-http://observer.wunderwood.org/ (my blog) > > > > On Aug 3, 2016, at 10:51 AM, Marco Scalone wrote: > > > > Nutch also has adaptive strategy: > > > > This class implements an adaptive re-fetch algorithm. This works as > >> follows: > >> > >> - for pages that has changed since the last fetchTime, decrease their > >> fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). > >> - for pages that haven't changed since the last fetchTime, increase > >> their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). > >> If SYNC_DELTA property is true, then: > >> - calculate a delta = fetchTime - modifiedTime > >> - try to synchronize with the time of change, by shifting the next > >> fetchTime by a fraction of the difference between the last > >> modification > >> time and the last fetch time. I.e. the next fetch time will be set to > >> fetchTime > >> + fetchInterval - delta * SYNC_DELTA_RATE > >> - if the adjusted fetch interval is bigger than the delta, then > >> fetchInterval > >> = delta. > >> - the minimum value of fetchInterval may not be smaller than > >> MIN_INTERVAL (default is 1 minute). > >> - the maximum value of fetchInterval may not be bigger than > >> MAX_INTERVAL (default is 365 days). > >> > >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may > >> destabilize the algorithm, so that the fetch interval either > >> increases or decreases infinitely, with little relevance to the page > >> changes. Please use > >> main(String[]) > >> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc > >> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> > >> method to test the values before applying them in a production system. > >> > > > > From: > > Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/ > > crawl/AdaptiveFetchSchedule.html > > > > > > 2016-08-03 14:45 GMT-03:00 Walter Underwood : > > > >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive > >> crawler in Ultraseek. > >> > >> I think we were the only people who built an adaptive crawler for > >> enterprise use. I tried to get Ultraseek open-sourced. I made the > >> argument to Mike Lynch. He looked at me like I had three heads and > >> didn’t even answer me. > >> > >> Ultraseek also has great support for sites that need login. If you > >> use th
RE: SOLR + Nutch set up (UNCLASSIFIED)
Depending on your settings, Nutch does this as well. It is even possible to set up different inc/decremental values per mime-type. The algorithms are pluggable and overridable at any point of interest. You can go all the way. -Original message- > From:Walter Underwood > Sent: Wednesday 3rd August 2016 20:03 > To: solr-user@lucene.apache.org > Subject: Re: SOLR + Nutch set up (UNCLASSIFIED) > > That’s good news. > > It should reset the interval estimate on page change instead of slowly > shortening it. > > I’m pretty sure that Ultraseek used a bounded exponential backoff when the > page had not changed. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Aug 3, 2016, at 10:51 AM, Marco Scalone wrote: > > > > Nutch also has adaptive strategy: > > > > This class implements an adaptive re-fetch algorithm. This works as > >> follows: > >> > >> - for pages that has changed since the last fetchTime, decrease their > >> fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). > >> - for pages that haven't changed since the last fetchTime, increase > >> their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). > >> If SYNC_DELTA property is true, then: > >> - calculate a delta = fetchTime - modifiedTime > >> - try to synchronize with the time of change, by shifting the next > >> fetchTime by a fraction of the difference between the last > >> modification > >> time and the last fetch time. I.e. the next fetch time will be set to > >> fetchTime > >> + fetchInterval - delta * SYNC_DELTA_RATE > >> - if the adjusted fetch interval is bigger than the delta, then > >> fetchInterval > >> = delta. > >> - the minimum value of fetchInterval may not be smaller than > >> MIN_INTERVAL (default is 1 minute). > >> - the maximum value of fetchInterval may not be bigger than > >> MAX_INTERVAL (default is 365 days). > >> > >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize > >> the algorithm, so that the fetch interval either increases or decreases > >> infinitely, with little relevance to the page changes. Please use > >> main(String[]) > >> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> > >> method to test the values before applying them in a production system. > >> > > > > From: > > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html > > > > > > 2016-08-03 14:45 GMT-03:00 Walter Underwood : > > > >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler > >> in Ultraseek. > >> > >> I think we were the only people who built an adaptive crawler for > >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument > >> to Mike Lynch. He looked at me like I had three heads and didn’t even > >> answer me. > >> > >> Ultraseek also has great support for sites that need login. If you use > >> that, you’ll need to find a way to do that with another crawler. > >> > >> wunder > >> Walter Underwood > >> Former Ultraseek Principal Engineer > >> wun...@wunderwood.org > >> http://observer.wunderwood.org/ (my blog) > >> > >> > >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) > >> wrote: > >>> > >>> CLASSIFICATION: UNCLASSIFIED > >>> > >>> We are currently using ultraseek and looking to deprecate it in favor of > >> solr/nutch. > >>> Ultraseek runs all the time and auto detects when pages have changed and > >> automatically reindexes them. > >>> Is this possible with SOLR/nutch? > >>> > >>> Thanks, > >>> Kris > >>> > >>> ~~ > >>> Kris T. Musshorn > >>> FileMaker Developer - Contractor - Catapult Technology Inc. > >>> US Army Research Lab > >>> Aberdeen Proving Ground > >>> Application Management & Development Branch > >>> 410-278-7251 > >>> kris.t.musshorn@mail.mil > >>> ~~ > >>> > >>> > >>> > >>> CLASSIFICATION: UNCLASSIFIED > >> > >> > >
RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
CLASSIFICATION: UNCLASSIFIED Shall I assume that, even though nutch has adaptive capability, I would still have to figure out how to trigger it to go look for content that needs update? Thanks, Kris ~~ Kris T. Musshorn FileMaker Developer - Contractor – Catapult Technology Inc. US Army Research Lab Aberdeen Proving Ground Application Management & Development Branch 410-278-7251 kris.t.musshorn@mail.mil ~~ -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Wednesday, August 03, 2016 2:03 PM To: solr-user@lucene.apache.org Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED) All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. That’s good news. It should reset the interval estimate on page change instead of slowly shortening it. I’m pretty sure that Ultraseek used a bounded exponential backoff when the page had not changed. wunder Walter Underwood wun...@wunderwood.org Caution-http://observer.wunderwood.org/ (my blog) > On Aug 3, 2016, at 10:51 AM, Marco Scalone wrote: > > Nutch also has adaptive strategy: > > This class implements an adaptive re-fetch algorithm. This works as >> follows: >> >> - for pages that has changed since the last fetchTime, decrease their >> fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). >> - for pages that haven't changed since the last fetchTime, increase >> their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). >> If SYNC_DELTA property is true, then: >> - calculate a delta = fetchTime - modifiedTime >> - try to synchronize with the time of change, by shifting the next >> fetchTime by a fraction of the difference between the last modification >> time and the last fetch time. I.e. the next fetch time will be set to >> fetchTime >> + fetchInterval - delta * SYNC_DELTA_RATE >> - if the adjusted fetch interval is bigger than the delta, then >> fetchInterval >> = delta. >> - the minimum value of fetchInterval may not be smaller than >> MIN_INTERVAL (default is 1 minute). >> - the maximum value of fetchInterval may not be bigger than >> MAX_INTERVAL (default is 365 days). >> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may >> destabilize the algorithm, so that the fetch interval either >> increases or decreases infinitely, with little relevance to the page >> changes. Please use >> main(String[]) >> https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc >> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> >> method to test the values before applying them in a production system. >> > > From: > Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/ > crawl/AdaptiveFetchSchedule.html > > > 2016-08-03 14:45 GMT-03:00 Walter Underwood : > >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive >> crawler in Ultraseek. >> >> I think we were the only people who built an adaptive crawler for >> enterprise use. I tried to get Ultraseek open-sourced. I made the >> argument to Mike Lynch. He looked at me like I had three heads and >> didn’t even answer me. >> >> Ultraseek also has great support for sites that need login. If you >> use that, you’ll need to find a way to do that with another crawler. >> >> wunder >> Walter Underwood >> Former Ultraseek Principal Engineer >> wun...@wunderwood.org >> Caution-http://observer.wunderwood.org/ (my blog) >> >> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL >>> (US) >> wrote: >>> >>> CLASSIFICATION: UNCLASSIFIED >>> >>> We are currently using ultraseek and looking to deprecate it in >>> favor of >> solr/nutch. >>> Ultraseek runs all the time and auto detects when pages have changed >>> and >> automatically reindexes them. >>> Is this possible with SOLR/nutch? >>> >>> Thanks, >>> Kris >>> >>> ~~ >>> Kris T. Musshorn >>> FileMaker Developer - Contractor - Catapult Technology Inc. >>> US Army Research Lab >>> Aberdeen Proving Ground >>> Application Management & Development Branch >>> 410-278-7251 >>> kris.t.musshorn@mail.mil >>> ~~ >>> >>> >>> >>> CLASSIFICATION: UNCLASSIFIED >> >> CLASSIFICATION: UNCLASSIFIED
Re: SOLR + Nutch set up (UNCLASSIFIED)
That’s good news. It should reset the interval estimate on page change instead of slowly shortening it. I’m pretty sure that Ultraseek used a bounded exponential backoff when the page had not changed. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 3, 2016, at 10:51 AM, Marco Scalone wrote: > > Nutch also has adaptive strategy: > > This class implements an adaptive re-fetch algorithm. This works as >> follows: >> >> - for pages that has changed since the last fetchTime, decrease their >> fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). >> - for pages that haven't changed since the last fetchTime, increase >> their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). >> If SYNC_DELTA property is true, then: >> - calculate a delta = fetchTime - modifiedTime >> - try to synchronize with the time of change, by shifting the next >> fetchTime by a fraction of the difference between the last modification >> time and the last fetch time. I.e. the next fetch time will be set to >> fetchTime >> + fetchInterval - delta * SYNC_DELTA_RATE >> - if the adjusted fetch interval is bigger than the delta, then >> fetchInterval >> = delta. >> - the minimum value of fetchInterval may not be smaller than >> MIN_INTERVAL (default is 1 minute). >> - the maximum value of fetchInterval may not be bigger than >> MAX_INTERVAL (default is 365 days). >> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize >> the algorithm, so that the fetch interval either increases or decreases >> infinitely, with little relevance to the page changes. Please use >> main(String[]) >> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> >> method to test the values before applying them in a production system. >> > > From: > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html > > > 2016-08-03 14:45 GMT-03:00 Walter Underwood : > >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler >> in Ultraseek. >> >> I think we were the only people who built an adaptive crawler for >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument >> to Mike Lynch. He looked at me like I had three heads and didn’t even >> answer me. >> >> Ultraseek also has great support for sites that need login. If you use >> that, you’ll need to find a way to do that with another crawler. >> >> wunder >> Walter Underwood >> Former Ultraseek Principal Engineer >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) >> wrote: >>> >>> CLASSIFICATION: UNCLASSIFIED >>> >>> We are currently using ultraseek and looking to deprecate it in favor of >> solr/nutch. >>> Ultraseek runs all the time and auto detects when pages have changed and >> automatically reindexes them. >>> Is this possible with SOLR/nutch? >>> >>> Thanks, >>> Kris >>> >>> ~~ >>> Kris T. Musshorn >>> FileMaker Developer - Contractor - Catapult Technology Inc. >>> US Army Research Lab >>> Aberdeen Proving Ground >>> Application Management & Development Branch >>> 410-278-7251 >>> kris.t.musshorn@mail.mil >>> ~~ >>> >>> >>> >>> CLASSIFICATION: UNCLASSIFIED >> >>
Re: SOLR + Nutch set up (UNCLASSIFIED)
Nutch also has adaptive strategy: This class implements an adaptive re-fetch algorithm. This works as > follows: > >- for pages that has changed since the last fetchTime, decrease their >fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). >- for pages that haven't changed since the last fetchTime, increase >their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). >If SYNC_DELTA property is true, then: > - calculate a delta = fetchTime - modifiedTime > - try to synchronize with the time of change, by shifting the next > fetchTime by a fraction of the difference between the last modification > time and the last fetch time. I.e. the next fetch time will be set to > fetchTime > + fetchInterval - delta * SYNC_DELTA_RATE > - if the adjusted fetch interval is bigger than the delta, then > fetchInterval > = delta. >- the minimum value of fetchInterval may not be smaller than >MIN_INTERVAL (default is 1 minute). >- the maximum value of fetchInterval may not be bigger than >MAX_INTERVAL (default is 365 days). > > NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize > the algorithm, so that the fetch interval either increases or decreases > infinitely, with little relevance to the page changes. Please use > main(String[]) > <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> > method to test the values before applying them in a production system. > From: https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html 2016-08-03 14:45 GMT-03:00 Walter Underwood : > I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler > in Ultraseek. > > I think we were the only people who built an adaptive crawler for > enterprise use. I tried to get Ultraseek open-sourced. I made the argument > to Mike Lynch. He looked at me like I had three heads and didn’t even > answer me. > > Ultraseek also has great support for sites that need login. If you use > that, you’ll need to find a way to do that with another crawler. > > wunder > Walter Underwood > Former Ultraseek Principal Engineer > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) > wrote: > > > > CLASSIFICATION: UNCLASSIFIED > > > > We are currently using ultraseek and looking to deprecate it in favor of > solr/nutch. > > Ultraseek runs all the time and auto detects when pages have changed and > automatically reindexes them. > > Is this possible with SOLR/nutch? > > > > Thanks, > > Kris > > > > ~~ > > Kris T. Musshorn > > FileMaker Developer - Contractor - Catapult Technology Inc. > > US Army Research Lab > > Aberdeen Proving Ground > > Application Management & Development Branch > > 410-278-7251 > > kris.t.musshorn@mail.mil > > ~~ > > > > > > > > CLASSIFICATION: UNCLASSIFIED > >
Re: SOLR + Nutch set up (UNCLASSIFIED)
I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler in Ultraseek. I think we were the only people who built an adaptive crawler for enterprise use. I tried to get Ultraseek open-sourced. I made the argument to Mike Lynch. He looked at me like I had three heads and didn’t even answer me. Ultraseek also has great support for sites that need login. If you use that, you’ll need to find a way to do that with another crawler. wunder Walter Underwood Former Ultraseek Principal Engineer wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) > wrote: > > CLASSIFICATION: UNCLASSIFIED > > We are currently using ultraseek and looking to deprecate it in favor of > solr/nutch. > Ultraseek runs all the time and auto detects when pages have changed and > automatically reindexes them. > Is this possible with SOLR/nutch? > > Thanks, > Kris > > ~~ > Kris T. Musshorn > FileMaker Developer - Contractor - Catapult Technology Inc. > US Army Research Lab > Aberdeen Proving Ground > Application Management & Development Branch > 410-278-7251 > kris.t.musshorn@mail.mil > ~~ > > > > CLASSIFICATION: UNCLASSIFIED
SOLR + Nutch set up (UNCLASSIFIED)
CLASSIFICATION: UNCLASSIFIED We are currently using ultraseek and looking to deprecate it in favor of solr/nutch. Ultraseek runs all the time and auto detects when pages have changed and automatically reindexes them. Is this possible with SOLR/nutch? Thanks, Kris ~~ Kris T. Musshorn FileMaker Developer - Contractor - Catapult Technology Inc. US Army Research Lab Aberdeen Proving Ground Application Management & Development Branch 410-278-7251 kris.t.musshorn@mail.mil ~~ CLASSIFICATION: UNCLASSIFIED
Re: Solr & Nutch
1. Nutch follows the links within HTML web pages to crawl the full graph of a web of pages. In addition, I think Nutch has PageRank-like scoring function as opposed to Lucene/Solr, those are based on vector space model scoring. koji -- http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html
Re: Solr & Nutch
Thanks Markus and Alexei. On Wed, Jan 29, 2014 at 12:08 AM, Alexei Martchenko < ale...@martchenko.com.br> wrote: > Well, not even Google parse those. I'm not sure about Nutch but in some > crawlers (jSoup i believe) there's an option to try to get full URLs from > plain text, so you can capture some urls in the form of someClickFunction(' > http://www.someurl.com/whatever') or even if they are in the middle of > some > paragraph. Sometimes it works beautifully, sometimes it misleads you to > parse urls shortened with ellipsis in the middle. > > > > alexei martchenko > Facebook <http://www.facebook.com/alexeiramone> | > Linkedin<http://br.linkedin.com/in/alexeimartchenko>| > Steam <http://steamcommunity.com/id/alexeiramone/> | > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | > > > 2014-01-28 rashmi maheshwari > > > Thanks All for quick response. > > > > Today I crawled a webpage using nutch. This page have many links. But all > > anchor tags have "href=#" and javascript is written on onClick event of > > each anchor tag to open a new page. > > > > So crawler didnt crawl any of those links which were opening using > onClick > > event and has # href value. > > > > How these links are crawled using nutch? > > > > > > > > > > On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko < > > ale...@martchenko.com.br> wrote: > > > > > 1) Plus, those files are binaries sometimes with metadata, specific > > > crawlers need to understand them. html is a plain text > > > > > > 2) Yes, different data schemes. Sometimes I replicate the same core and > > > make some A-B tests with different weights, filters etc etc and some > > people > > > like to creare CoreA and CoreB with the same schema and hammer CoreA > with > > > updates and commits and optmizes, they make it available for searches > > while > > > hammering CoreB. Then swap again. This produces faster searches. > > > > > > > > > alexei martchenko > > > Facebook <http://www.facebook.com/alexeiramone> | > > > Linkedin<http://br.linkedin.com/in/alexeimartchenko>| > > > Steam <http://steamcommunity.com/id/alexeiramone/> | > > > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | > > > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | > > > > > > > > > 2014-01-28 Jack Krupansky > > > > > > > 1. Nutch follows the links within HTML web pages to crawl the full > > graph > > > > of a web of pages. > > > > > > > > 2. Think of a core as an SQL table - each table/core has a different > > type > > > > of data. > > > > > > > > 3. SolrCloud is all about scaling and availability - multiple shards > > for > > > > larger collections and multiple replicas for both scaling of query > > > response > > > > and availability if nodes go down. > > > > > > > > -- Jack Krupansky > > > > > > > > -Original Message- From: rashmi maheshwari > > > > Sent: Tuesday, January 28, 2014 11:36 AM > > > > To: solr-user@lucene.apache.org > > > > Subject: Solr & Nutch > > > > > > > > > > > > Hi, > > > > > > > > Question1 --> When Solr could parse html, documents like doc, excel > pdf > > > > etc, why do we need nutch to parse html files? what is different? > > > > > > > > Questions 2: When do we use multiple core in solar? any practical > > > business > > > > case when we need multiple cores? > > > > > > > > Question 3: When do we go for cloud? What is meaning of implementing > > solr > > > > cloud? > > > > > > > > > > > > -- > > > > Rashmi > > > > Be the change that you want to see in this world! > > > > www.minnal.zor.org > > > > disha.resolve.at > > > > www.artofliving.org > > > > > > > > > > > > > > > -- > > Rashmi > > Be the change that you want to see in this world! > > www.minnal.zor.org > > disha.resolve.at > > www.artofliving.org > > > -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Solr & Nutch
Well, not even Google parse those. I'm not sure about Nutch but in some crawlers (jSoup i believe) there's an option to try to get full URLs from plain text, so you can capture some urls in the form of someClickFunction(' http://www.someurl.com/whatever') or even if they are in the middle of some paragraph. Sometimes it works beautifully, sometimes it misleads you to parse urls shortened with ellipsis in the middle. alexei martchenko Facebook <http://www.facebook.com/alexeiramone> | Linkedin<http://br.linkedin.com/in/alexeimartchenko>| Steam <http://steamcommunity.com/id/alexeiramone/> | 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | 2014-01-28 rashmi maheshwari > Thanks All for quick response. > > Today I crawled a webpage using nutch. This page have many links. But all > anchor tags have "href=#" and javascript is written on onClick event of > each anchor tag to open a new page. > > So crawler didnt crawl any of those links which were opening using onClick > event and has # href value. > > How these links are crawled using nutch? > > > > > On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko < > ale...@martchenko.com.br> wrote: > > > 1) Plus, those files are binaries sometimes with metadata, specific > > crawlers need to understand them. html is a plain text > > > > 2) Yes, different data schemes. Sometimes I replicate the same core and > > make some A-B tests with different weights, filters etc etc and some > people > > like to creare CoreA and CoreB with the same schema and hammer CoreA with > > updates and commits and optmizes, they make it available for searches > while > > hammering CoreB. Then swap again. This produces faster searches. > > > > > > alexei martchenko > > Facebook <http://www.facebook.com/alexeiramone> | > > Linkedin<http://br.linkedin.com/in/alexeimartchenko>| > > Steam <http://steamcommunity.com/id/alexeiramone/> | > > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | > > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | > > > > > > 2014-01-28 Jack Krupansky > > > > > 1. Nutch follows the links within HTML web pages to crawl the full > graph > > > of a web of pages. > > > > > > 2. Think of a core as an SQL table - each table/core has a different > type > > > of data. > > > > > > 3. SolrCloud is all about scaling and availability - multiple shards > for > > > larger collections and multiple replicas for both scaling of query > > response > > > and availability if nodes go down. > > > > > > -- Jack Krupansky > > > > > > -Original Message- From: rashmi maheshwari > > > Sent: Tuesday, January 28, 2014 11:36 AM > > > To: solr-user@lucene.apache.org > > > Subject: Solr & Nutch > > > > > > > > > Hi, > > > > > > Question1 --> When Solr could parse html, documents like doc, excel pdf > > > etc, why do we need nutch to parse html files? what is different? > > > > > > Questions 2: When do we use multiple core in solar? any practical > > business > > > case when we need multiple cores? > > > > > > Question 3: When do we go for cloud? What is meaning of implementing > solr > > > cloud? > > > > > > > > > -- > > > Rashmi > > > Be the change that you want to see in this world! > > > www.minnal.zor.org > > > disha.resolve.at > > > www.artofliving.org > > > > > > > > > -- > Rashmi > Be the change that you want to see in this world! > www.minnal.zor.org > disha.resolve.at > www.artofliving.org >
Re: Solr & Nutch
Short answer, you can't.rashmi maheshwari schreef:Thanks All for quick response. Today I crawled a webpage using nutch. This page have many links. But all anchor tags have "href=#" and javascript is written on onClick event of each anchor tag to open a new page. So crawler didnt crawl any of those links which were opening using onClick event and has # href value. How these links are crawled using nutch? On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko < ale...@martchenko.com.br> wrote: > 1) Plus, those files are binaries sometimes with metadata, specific > crawlers need to understand them. html is a plain text > > 2) Yes, different data schemes. Sometimes I replicate the same core and > make some A-B tests with different weights, filters etc etc and some people > like to creare CoreA and CoreB with the same schema and hammer CoreA with > updates and commits and optmizes, they make it available for searches while > hammering CoreB. Then swap again. This produces faster searches. > > > alexei martchenko > Facebook <http://www.facebook.com/alexeiramone> | > Linkedin<http://br.linkedin.com/in/alexeimartchenko>| > Steam <http://steamcommunity.com/id/alexeiramone/> | > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | > > > 2014-01-28 Jack Krupansky > > > 1. Nutch follows the links within HTML web pages to crawl the full graph > > of a web of pages. > > > > 2. Think of a core as an SQL table - each table/core has a different type > > of data. > > > > 3. SolrCloud is all about scaling and availability - multiple shards for > > larger collections and multiple replicas for both scaling of query > response > > and availability if nodes go down. > > > > -- Jack Krupansky > > > > -Original Message- From: rashmi maheshwari > > Sent: Tuesday, January 28, 2014 11:36 AM > > To: solr-user@lucene.apache.org > > Subject: Solr & Nutch > > > > > > Hi, > > > > Question1 --> When Solr could parse html, documents like doc, excel pdf > > etc, why do we need nutch to parse html files? what is different? > > > > Questions 2: When do we use multiple core in solar? any practical > business > > case when we need multiple cores? > > > > Question 3: When do we go for cloud? What is meaning of implementing solr > > cloud? > > > > > > -- > > Rashmi > > Be the change that you want to see in this world! > > www.minnal.zor.org > > disha.resolve.at > > www.artofliving.org > > > -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Solr & Nutch
Thanks All for quick response. Today I crawled a webpage using nutch. This page have many links. But all anchor tags have "href=#" and javascript is written on onClick event of each anchor tag to open a new page. So crawler didnt crawl any of those links which were opening using onClick event and has # href value. How these links are crawled using nutch? On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko < ale...@martchenko.com.br> wrote: > 1) Plus, those files are binaries sometimes with metadata, specific > crawlers need to understand them. html is a plain text > > 2) Yes, different data schemes. Sometimes I replicate the same core and > make some A-B tests with different weights, filters etc etc and some people > like to creare CoreA and CoreB with the same schema and hammer CoreA with > updates and commits and optmizes, they make it available for searches while > hammering CoreB. Then swap again. This produces faster searches. > > > alexei martchenko > Facebook <http://www.facebook.com/alexeiramone> | > Linkedin<http://br.linkedin.com/in/alexeimartchenko>| > Steam <http://steamcommunity.com/id/alexeiramone/> | > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | > > > 2014-01-28 Jack Krupansky > > > 1. Nutch follows the links within HTML web pages to crawl the full graph > > of a web of pages. > > > > 2. Think of a core as an SQL table - each table/core has a different type > > of data. > > > > 3. SolrCloud is all about scaling and availability - multiple shards for > > larger collections and multiple replicas for both scaling of query > response > > and availability if nodes go down. > > > > -- Jack Krupansky > > > > -Original Message- From: rashmi maheshwari > > Sent: Tuesday, January 28, 2014 11:36 AM > > To: solr-user@lucene.apache.org > > Subject: Solr & Nutch > > > > > > Hi, > > > > Question1 --> When Solr could parse html, documents like doc, excel pdf > > etc, why do we need nutch to parse html files? what is different? > > > > Questions 2: When do we use multiple core in solar? any practical > business > > case when we need multiple cores? > > > > Question 3: When do we go for cloud? What is meaning of implementing solr > > cloud? > > > > > > -- > > Rashmi > > Be the change that you want to see in this world! > > www.minnal.zor.org > > disha.resolve.at > > www.artofliving.org > > > -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Re: Solr & Nutch
1) Plus, those files are binaries sometimes with metadata, specific crawlers need to understand them. html is a plain text 2) Yes, different data schemes. Sometimes I replicate the same core and make some A-B tests with different weights, filters etc etc and some people like to creare CoreA and CoreB with the same schema and hammer CoreA with updates and commits and optmizes, they make it available for searches while hammering CoreB. Then swap again. This produces faster searches. alexei martchenko Facebook <http://www.facebook.com/alexeiramone> | Linkedin<http://br.linkedin.com/in/alexeimartchenko>| Steam <http://steamcommunity.com/id/alexeiramone/> | 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | 2014-01-28 Jack Krupansky > 1. Nutch follows the links within HTML web pages to crawl the full graph > of a web of pages. > > 2. Think of a core as an SQL table - each table/core has a different type > of data. > > 3. SolrCloud is all about scaling and availability - multiple shards for > larger collections and multiple replicas for both scaling of query response > and availability if nodes go down. > > -- Jack Krupansky > > -Original Message- From: rashmi maheshwari > Sent: Tuesday, January 28, 2014 11:36 AM > To: solr-user@lucene.apache.org > Subject: Solr & Nutch > > > Hi, > > Question1 --> When Solr could parse html, documents like doc, excel pdf > etc, why do we need nutch to parse html files? what is different? > > Questions 2: When do we use multiple core in solar? any practical business > case when we need multiple cores? > > Question 3: When do we go for cloud? What is meaning of implementing solr > cloud? > > > -- > Rashmi > Be the change that you want to see in this world! > www.minnal.zor.org > disha.resolve.at > www.artofliving.org >
Re: Solr & Nutch
Q1: Nutch doesn’t only handle the parse of HTML files, it also use hadoop to achieve large-scale crawling using multiple nodes, it fetch the content of the HTML file, and yes it also parse its content. Q2: In our case we use sold to crawl some website, store the content in one “main” solr core. We also have a web app with the typical “search box” we use a separated core to store the queries made by our users. Q3: Not currently using SolrCloud so I’m going to let this one pass to a more experienced fellow. On Jan 28, 2014, at 11:36 AM, rashmi maheshwari wrote: > Hi, > > Question1 --> When Solr could parse html, documents like doc, excel pdf > etc, why do we need nutch to parse html files? what is different? > > Questions 2: When do we use multiple core in solar? any practical business > case when we need multiple cores? > > Question 3: When do we go for cloud? What is meaning of implementing solr > cloud? > > > -- > Rashmi > Be the change that you want to see in this world! > www.minnal.zor.org > disha.resolve.at > www.artofliving.org III Escuela Internacional de Invierno en la UCI del 17 al 28 de febrero del 2014. Ver www.uci.cu
Re: Solr & Nutch
1. Nutch follows the links within HTML web pages to crawl the full graph of a web of pages. 2. Think of a core as an SQL table - each table/core has a different type of data. 3. SolrCloud is all about scaling and availability - multiple shards for larger collections and multiple replicas for both scaling of query response and availability if nodes go down. -- Jack Krupansky -Original Message- From: rashmi maheshwari Sent: Tuesday, January 28, 2014 11:36 AM To: solr-user@lucene.apache.org Subject: Solr & Nutch Hi, Question1 --> When Solr could parse html, documents like doc, excel pdf etc, why do we need nutch to parse html files? what is different? Questions 2: When do we use multiple core in solar? any practical business case when we need multiple cores? Question 3: When do we go for cloud? What is meaning of implementing solr cloud? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
Solr & Nutch
Hi, Question1 --> When Solr could parse html, documents like doc, excel pdf etc, why do we need nutch to parse html files? what is different? Questions 2: When do we use multiple core in solar? any practical business case when we need multiple cores? Question 3: When do we go for cloud? What is meaning of implementing solr cloud? -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org
AjaxSolr + Solr + Nutch question
I referred https://github.com/evolvingweb/ajax-solr/wiki/reuters-tutorial for Ajax-Solr setup. I want to know that although ajax-solr is running but it's searching under only reuters data. If I want to crawl the web using nutch and integrate it with solr,then i have to replace solr's schema.xml file with nutch's schema.xml file which will not be according to ajax-solr configuration. By replacing the schema.xml files, ajax-solr wont work(correct me if I am wrong)!!! How would I now integrate Solr with Nutch along with Ajax-Solr so ajax-Solr can search other data on the web as well?? Thanks Regards Praful Bagai -- View this message in context: http://lucene.472066.n3.nabble.com/AjaxSolr-Solr-Nutch-question-tp3995030.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellcheck in solr-nutch integration
First go thru the schema.xml file . Look at the different components. On Sat, Feb 5, 2011 at 1:01 PM, 666 [via Lucene] < ml-node+2429702-1399813783-146...@n3.nabble.com > wrote: > Hello Anurag, I'm facing the same problem. Will u please elaborate on how u > solved the problem? It would be great if u give me a step by step > description as I'm new in Solr. > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html > To unsubscribe from Spellcheck in solr-nutch integration, click > here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1953232&code=YW51cmFnLml0LmpvbGx5QGdtYWlsLmNvbXwxOTUzMjMyfC0yMDk4MzQ0MTk2>. > > -- Kumar Anurag - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429782.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellcheck in solr-nutch integration
Hello Anurag, I'm facing the same problem. Will u please elaborate on how u solved the problem? It would be great if u give me a step by step description as I'm new in Solr. -- View this message in context: http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p2429702.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellcheck in solr-nutch integration
i solved the problemAll we need to modify schema file. Also the spellcheck index is created first when spellcheck.build=true - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p1988252.html Sent from the Solr - User mailing list archive at Nabble.com.
Spellcheck in solr-nutch integration
I have integrated solr and nutch using http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ this As the tutorial says, the Schema.xml & SolrConfig.xml of Solr has to be modified. I also did the same. I am using Solr-1.3. But my problem is that i am not able to implement Spellcheck in this Solr-nutch integration. I have got a separate Solr-1.4 where there are options available for Spellcheck. What i want to ask is... 1.Indexing for spellcheck is to be done as the same time of indexing the contents.?What are the steps to follow? 2.How can i implement spellcheck in solr-nutch integration? please help. - Kumar Anurag -- View this message in context: http://lucene.472066.n3.nabble.com/Spellcheck-in-solr-nutch-integration-tp1953232p1953232.html Sent from the Solr - User mailing list archive at Nabble.com.
Seeking Solr/Nutch consultant in San Jose, CA
Hi, I am working with a SaaS vendor who is integrated with Nutch 0.9 and SOLR. We are looking for some help to migrate this to Nutch 1.0. The work involves: 1) We made changes to Nutch 0.9; these need to be ported to Nutch 1.0. 2) Configure SOLR integration with Nutch 1.0 3) Configure SOLR to do Japanese indexing; expose this configuration as part of Baynote configuration. 4) Check if indexes are portable between Nutch 0.9 and Nutch 1.0 - should we re-index? Please email me if there is interest. The work is in San Jose, CA. Duration and rate are not yet known. Best regards, Leann Leann Pereira | o: +1 650.425.7950 | le...@1sourcestaffing.com | Senior Technical Recruiter
Re: solr nutch url indexing
Do you mean the schema or the solrconfig.xml? The request handler is configured in the solrconfig.xml and you can find out more about this particular configuration in http://wiki.apache.org/solr/DisMaxRequestHandler?highlight=(CategorySolrRequestHandler)|((CategorySolrRequestHandler)). To understand the schema better, you can read http://wiki.apache.org/solr/SchemaXml Uri last...@gmail.com wrote: Uri Boness wrote: Well... yes, it's a tool the Nutch ships with. It also ships with an example Solr schema which you can use. hi, is there any documentation to understand what going in the schema ? dismax explicit 0.01 content0.5 anchor1.0 title5.2 content0.5 anchor1.5 title5.2 site1.5 url 2<-1 5<-2 6<90% 100 *:* title url content 0 title 0 url regex
Re: solr nutch url indexing
Uri Boness wrote: Well... yes, it's a tool the Nutch ships with. It also ships with an example Solr schema which you can use. hi, is there any documentation to understand what going in the schema ? dismax explicit 0.01 content0.5 anchor1.0 title5.2 content0.5 anchor1.5 title5.2 site1.5 url 2<-1 5<-2 6<90% 100 *:* title url content 0 title 0 url regex
Re: solr nutch url indexing
Well... yes, it's a tool the Nutch ships with. It also ships with an example Solr schema which you can use. Fuad Efendi wrote: Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I use similar approach... -Original Message- From: Uri Boness Hi, Nutch comes with support for Solr out of the box. I suggest you follow the steps as described here: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ Cheers, Uri Fuad Efendi wrote: Is SolrIndex plugin for Nutch? Thanks!
RE: solr nutch url indexing
Thanks for the link, so, SolrIndex is NOT plugin, it is an application... I use similar approach... -Original Message- From: Uri Boness Hi, Nutch comes with support for Solr out of the box. I suggest you follow the steps as described here: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ Cheers, Uri Fuad Efendi wrote: > Is SolrIndex plugin for Nutch? > Thanks! > > >
Re: solr nutch url indexing
It seems to me that this configuration actually does what you want - queries on "title" mostly. The default search field doesn't influence a dismax query. I would suggest you to include the debugQuery=true parameter, it will help you figure out how the matching is performed. You can read more about dismax queries here: http://wiki.apache.org/solr/DisMaxRequestHandler Thibaut Lassalle wrote: Thanks for your help. I use the default Nutch configuration and I use solrindex to give the Nutch result to Solr. I have results when I query therefore Nutch works properly (it gives a url, title, content ...) I would like to query on Solr to emphase the "title" field and not the "content" field. Here is a sample of my "shema.xml" .. id content ... Here is a sample of my "solrconfig.xml" dismax explicit 0.01 content^0.5 anchor^1.0 title^5.2 content^0.5 anchor^1.5 title^5.2 site^1.5 url 2<-1 5<-2 6<90% 100 *:* title url content 0 title 0 url regex This configuration query on "content" only. How to I change them to query mostly on "title" ? I tried to change "defaultSearchField" to "title" but it doesn't work. Where can I find doc on the "solr.SearchHandler" ? Thanks t.
Re: solr nutch url indexing
Thanks for your help. I use the default Nutch configuration and I use solrindex to give the Nutch result to Solr. I have results when I query therefore Nutch works properly (it gives a url, title, content ...) I would like to query on Solr to emphase the "title" field and not the "content" field. Here is a sample of my "shema.xml" .. id content ... Here is a sample of my "solrconfig.xml" dismax explicit 0.01 content^0.5 anchor^1.0 title^5.2 content^0.5 anchor^1.5 title^5.2 site^1.5 url 2<-1 5<-2 6<90% 100 *:* title url content 0 title 0 url regex This configuration query on "content" only. How to I change them to query mostly on "title" ? I tried to change "defaultSearchField" to "title" but it doesn't work. Where can I find doc on the "solr.SearchHandler" ? Thanks t.
Re: solr nutch url indexing
Hi, Nutch comes with support for Solr out of the box. I suggest you follow the steps as described here: http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ Cheers, Uri Fuad Efendi wrote: Is SolrIndex plugin for Nutch? Thanks! -Original Message- From: Uri Boness [mailto:ubon...@gmail.com] Sent: August-24-09 4:42 PM To: solr-user@lucene.apache.org Subject: Re: solr nutch url indexing How did you configure nutch? Make sure you have the "parse-html" and "index-basic" configured. The HtmlParser should by default extract the page title and add to the parsed data, and the BasicIndexingFilter by default adds this title to the NutchDocument and stores it in the "title" filed. All the SolrIndex (actually the SolrWriter) does is converting the NuchDocument to a SolrInputDocument. So having these plugins configured in Nutch and having a field in the schema named "title" should work. (I'm assuming you're using the "solrindex" tool) Cheers, Uri Lassalle, Thibaut wrote: Hi, I would like to crawl intranets with nutch and index them with solr. I would like to search mostly on the title of the pages (the one in This is a title) I tried to tweak the schema.xml to do that but nothing is working. I just have the content indexed. How do I index on title ? Thanks t.
RE: solr nutch url indexing
Is SolrIndex plugin for Nutch? Thanks! -Original Message- From: Uri Boness [mailto:ubon...@gmail.com] Sent: August-24-09 4:42 PM To: solr-user@lucene.apache.org Subject: Re: solr nutch url indexing How did you configure nutch? Make sure you have the "parse-html" and "index-basic" configured. The HtmlParser should by default extract the page title and add to the parsed data, and the BasicIndexingFilter by default adds this title to the NutchDocument and stores it in the "title" filed. All the SolrIndex (actually the SolrWriter) does is converting the NuchDocument to a SolrInputDocument. So having these plugins configured in Nutch and having a field in the schema named "title" should work. (I'm assuming you're using the "solrindex" tool) Cheers, Uri Lassalle, Thibaut wrote: > Hi, > > > > I would like to crawl intranets with nutch and index them with solr. > > > > I would like to search mostly on the title of the pages (the one in > This is a title) > > > > I tried to tweak the schema.xml to do that but nothing is working. I > just have the content indexed. > > > > How do I index on title ? > > > > Thanks > > t. > > >
Re: solr nutch url indexing
How did you configure nutch? Make sure you have the "parse-html" and "index-basic" configured. The HtmlParser should by default extract the page title and add to the parsed data, and the BasicIndexingFilter by default adds this title to the NutchDocument and stores it in the "title" filed. All the SolrIndex (actually the SolrWriter) does is converting the NuchDocument to a SolrInputDocument. So having these plugins configured in Nutch and having a field in the schema named "title" should work. (I'm assuming you're using the "solrindex" tool) Cheers, Uri Lassalle, Thibaut wrote: Hi, I would like to crawl intranets with nutch and index them with solr. I would like to search mostly on the title of the pages (the one in This is a title) I tried to tweak the schema.xml to do that but nothing is working. I just have the content indexed. How do I index on title ? Thanks t.
solr nutch url indexing
Hi, I would like to crawl intranets with nutch and index them with solr. I would like to search mostly on the title of the pages (the one in This is a title) I tried to tweak the schema.xml to do that but nothing is working. I just have the content indexed. How do I index on title ? Thanks t.
NYC Apache Lucene/Solr/Nutch/etc. Meetup
Hi All, (sorry for the cross-post) For those in NYC, there will be a Lucene ecosystem (Lucene/Solr/Mahout/ Nutch/Tika/Droids/Lucene ports) Meetup on July 22, hosted by MTV Networks and co-sponsored with Lucid Imagination. For more info and to RSVP, see http://www.meetup.com/NYC-Apache-Lucene-Solr-Meetup/ . There is limited seating, so get your spot early. Note, you must register with your first and last name so that security badges can be printed ahead of time for access. Cheers, Grant
Re: Snipets Solr/nutch
On 15-Apr-08, at 1:37 PM, khirb7 wrote: Thank you a lot you are helpful, concerning my solr I am using the 1.2.0 version i download it from the Apache download mirror http://www.apache.org/dyn/closer.cgi/lucene/solr/ , I haven't well understand you when you said : you're trying to apply a patch that has long since been applied to Solr. Hi khirb, You could try looking at "trunk" (the development version of Solr that hasn't yet been release). It contains all the features you were trying to add manually to your version. You can download a "nightly" build of Solr here: http://people.apache.org/builds/lucene/solr/nightly/ regards, -Mike
Re: Snipets Solr/nutch
Mike Klaas wrote: > > On 13-Apr-08, at 3:25 AM, khirb7 wrote: >> >> it doesn't work solr still use the default value fragsize=100. also >> I am not >> able to spécifieregex fragmenter due to this probleme of >> version I >> suppose or the way I am declaring ..> highlighting> >> because >> both of: > > Hi khirb, > > It might be easier for people to help you if you keep things in one > thread. > > I notice that you're trying to apply a patch that has long since been > applied to Solr (another thread). What version of Solr are you > using? How did you acquire it? > > -Mike > hi mike Thank you a lot you are helpful, concerning my solr I am using the 1.2.0 version i download it from the Apache download mirror http://www.apache.org/dyn/closer.cgi/lucene/solr/ , I haven't well understand you when you said : you're trying to apply a patch that has long since been applied to Solr. thank you mike. -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16708645.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snipets Solr/nutch
On 13-Apr-08, at 3:25 AM, khirb7 wrote: it doesn't work solr still use the default value fragsize=100. also I am not able to spécifieregex fragmenter due to this probleme of version I suppose or the way I am declaring ..highlighting> because both of: Hi khirb, It might be easier for people to help you if you keep things in one thread. I notice that you're trying to apply a patch that has long since been applied to Solr (another thread). What version of Solr are you using? How did you acquire it? -Mike
Re: Snipets Solr/nutch
hello, mike adviser me last time to use: >This is done by the fragmenting stage of highlighting. Solr (trunk) >ships with a fragmenter that looks for sentence-like snippets using >regular expressions: try hl.fragmenter=regex (see config in >solrconfig.xml). the prolem is I wasn't able either to do that or spécifie the fragsize from solrconfig.xml i think it is due to the version of solr I use and what classe and package I spécifie ie: I put this in solrconfig.xml − − − 400 − − − − 70 0.5 [-\w ,/\n\"']{20,200} − − so either using org.apache.solr.util.GapFragmenter specifique to solr1.2 or it doesn't work solr still use the default value fragsize=100. also I am not able to spécifieregex fragmenter due to this probleme of version I suppose or the way I am declaring .. because both of: and still use fragsize=100 but i am using 400 as shown above. thank you. -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16656960.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snipets Solr/nutch
On 10-Apr-08, at 12:26 AM, khirb7 wrote: hello every body just one other question, to analyse and modify Solr's snippet, I want to know if org.apache.solr.util.HighlightingUtils is the class generating the snippet and which methode generate them, and could you please explain me how are they generated in that class and where exactly to modify it. all that in order to not return the first word encountered highlighted but to return an other one because of the problem I explained in my previous messages Unfortunately I have not familiar with nutch's snippet generation. Solr's highlighting is located in org.apache.solr.util.HighlightingUtils in version 1.2, in the current (trunk) version, it is located in org.apache.solr.highlight.* package. Your use case is a little tricky. The best way to deal with it in my opinion is to strip out the header before sending the data to Solr. This will improve your highlighting _and_ your search relevance. -Mike
Re: Snipets Solr/nutch(maxFragSize?)
khirb7 wrote: > > hello every body > > just one other question, to analyse and modify Solr's snippet, I want to > know if org.apache.solr.util.HighlightingUtils > is the class generating the snippet and which methode generate them, and > could you please explain me how are they generated in that class and where > exactly to modify it. all that in order to not return the first word > encountered highlighted but to return an other one because of the problem > I explained in my previous messages > > Cheers > I have done deep search and I found that lucene provide this that methode : getBestFragments highlighter.getBestFragments(tokenStream, text, maxNumFragment, "..."); so with this methode we can precise to lucene to return maxNumFragment fragment (with highligted word)of fragsize characters, but there is no maxFragSize parameter in solr. this would be useful in my case if I want to highlight not only the first occurrence of a searched word but up to 1 occurrence of the same word. cheers -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16608806.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snipets Solr/nutch
hello every body just one other question, to analyse and modify Solr's snippet, I want to know if org.apache.solr.util.HighlightingUtils is the class generating the snippet and which methode generate them, and could you please explain me how are they generated in that class and where exactly to modify it. all that in order to not return the first word encountered highlighted but to return an other one because of the problem I explained in my previous messages Cheers -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16603642.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snipets Solr/nutch
thank you for your response. I have another problem with snippets.here is the problem: I transform the HTML code into text then I index all this text generated into one field called myText , many pages has common header with common information (example : web site about the president bush) and the word bush appear in this header, if I want to highlighting the the field myText and I am searching the word bush, I will have the same sentence containing bush highlighted ( which is the sentence of the comment header containing bush word )because I have put fargsize to 150and Solr return through the whole text the first word encountered (bush) highlighted. How can I deal with that. I was told that nutchwax handle this problem is it true?if true how can I integarte nutch classes into solr. thank you in advance. -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16585594.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snipets Solr/nutch
On 7-Apr-08, at 7:12 AM, khirb7 wrote: khirb7 wrote: hello every body I am using solr in my project, and I want to use solr snipets generated by the highlighting. The problem is that these snipets aren't really well displayed, they are trancated and not really meanigful. I heard that nutch provide well snipets, is it possible and how to integrate them to my solr. thank you in advence. hi every body I am digging in solr classes and I am looking for solution to the generated snipets, first of all I want to know on which class and where this snippets are generated . my snippets are like this: " project, and I want to use solr snipets generated by the highlighting" ie: do you se starting whith project has no sens,I think the best way is to to show the whole sentence like this: "I am using solr in my project, and I want to use solr snipets generated by the highlighting". and not to trunc it, may be by paying attention to the punctuation (the comma or the capital letter) This is done by the fragmenting stage of highlighting. Solr (trunk) ships with a fragmenter that looks for sentence-like snippets using regular expressions: try hl.fragmenter=regex (see config in solrconfig.xml). regards, -Mike
Re: Snipets Solr/nutch
khirb7 wrote: > > hello every body > > I am using solr in my project, and I want to use solr snipets generated by > the highlighting. > The problem is that these snipets aren't really well displayed, they are > trancated and not really meanigful. > I heard that nutch provide well snipets, is it possible and how to > integrate them to my solr. > > thank you in advence. > hi every body I am digging in solr classes and I am looking for solution to the generated snipets, first of all I want to know on which class and where this snippets are generated . my snippets are like this: " project, and I want to use solr snipets generated by the highlighting" ie: do you se starting whith project has no sens,I think the best way is to to show the whole sentence like this: "I am using solr in my project, and I want to use solr snipets generated by the highlighting". and not to trunc it, may be by paying attention to the punctuation (the comma or the capital letter) thank you in advence. -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16537460.html Sent from the Solr - User mailing list archive at Nabble.com.
Snipets Solr/nutch
hello every body I am using solr in my project, and I want to use solr snipets generated by the highlighting. The problem is that these snipets aren't really well displayed, they are trancated and not really meanigful. I heard that nutch provide well snipets, is it possible and how to integrate them to my solr. thank you in advence. -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16537216.html Sent from the Solr - User mailing list archive at Nabble.com.