Thanks Markus and Alexei.
On Wed, Jan 29, 2014 at 12:08 AM, Alexei Martchenko < ale...@martchenko.com.br> wrote: > Well, not even Google parse those. I'm not sure about Nutch but in some > crawlers (jSoup i believe) there's an option to try to get full URLs from > plain text, so you can capture some urls in the form of someClickFunction(' > http://www.someurl.com/whatever') or even if they are in the middle of > some > paragraph. Sometimes it works beautifully, sometimes it misleads you to > parse urls shortened with ellipsis in the middle. > > > > alexei martchenko > Facebook <http://www.facebook.com/alexeiramone> | > Linkedin<http://br.linkedin.com/in/alexeimartchenko>| > Steam <http://steamcommunity.com/id/alexeiramone/> | > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | > > > 2014-01-28 rashmi maheshwari <maheshwari.ras...@gmail.com> > > > Thanks All for quick response. > > > > Today I crawled a webpage using nutch. This page have many links. But all > > anchor tags have "href=#" and javascript is written on onClick event of > > each anchor tag to open a new page. > > > > So crawler didnt crawl any of those links which were opening using > onClick > > event and has # href value. > > > > How these links are crawled using nutch? > > > > > > > > > > On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko < > > ale...@martchenko.com.br> wrote: > > > > > 1) Plus, those files are binaries sometimes with metadata, specific > > > crawlers need to understand them. html is a plain text > > > > > > 2) Yes, different data schemes. Sometimes I replicate the same core and > > > make some A-B tests with different weights, filters etc etc and some > > people > > > like to creare CoreA and CoreB with the same schema and hammer CoreA > with > > > updates and commits and optmizes, they make it available for searches > > while > > > hammering CoreB. Then swap again. This produces faster searches. > > > > > > > > > alexei martchenko > > > Facebook <http://www.facebook.com/alexeiramone> | > > > Linkedin<http://br.linkedin.com/in/alexeimartchenko>| > > > Steam <http://steamcommunity.com/id/alexeiramone/> | > > > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone | > > > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 | > > > > > > > > > 2014-01-28 Jack Krupansky <j...@basetechnology.com> > > > > > > > 1. Nutch follows the links within HTML web pages to crawl the full > > graph > > > > of a web of pages. > > > > > > > > 2. Think of a core as an SQL table - each table/core has a different > > type > > > > of data. > > > > > > > > 3. SolrCloud is all about scaling and availability - multiple shards > > for > > > > larger collections and multiple replicas for both scaling of query > > > response > > > > and availability if nodes go down. > > > > > > > > -- Jack Krupansky > > > > > > > > -----Original Message----- From: rashmi maheshwari > > > > Sent: Tuesday, January 28, 2014 11:36 AM > > > > To: solr-user@lucene.apache.org > > > > Subject: Solr & Nutch > > > > > > > > > > > > Hi, > > > > > > > > Question1 --> When Solr could parse html, documents like doc, excel > pdf > > > > etc, why do we need nutch to parse html files? what is different? > > > > > > > > Questions 2: When do we use multiple core in solar? any practical > > > business > > > > case when we need multiple cores? > > > > > > > > Question 3: When do we go for cloud? What is meaning of implementing > > solr > > > > cloud? > > > > > > > > > > > > -- > > > > Rashmi > > > > Be the change that you want to see in this world! > > > > www.minnal.zor.org > > > > disha.resolve.at > > > > www.artofliving.org > > > > > > > > > > > > > > > -- > > Rashmi > > Be the change that you want to see in this world! > > www.minnal.zor.org > > disha.resolve.at > > www.artofliving.org > > > -- Rashmi Be the change that you want to see in this world! www.minnal.zor.org disha.resolve.at www.artofliving.org