Re: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

2018-09-28 Thread Jorge Betancourt
orn@mail.mil> wrote: > Please remove me from this list > > -Original Message- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] > Sent: Friday, September 28, 2018 2:25 AM > To: user@nutch.apache.org > Subject: [Non-DoD Source] Re: Include parent URL

Re: Include parent URL in pdf data - nutch

2018-09-28 Thread Jorge Betancourt
will allow you to index all the outlinks of a given URL. So if A is the parent URL of B (pdf file), then you should be able to find the B URL in the outlinks of A. This is basically reverting the problem, instead of looking for the parent of B, you would be looking for any URL that has B has an ou

RE: [Non-DoD Source] Re: Include parent URL in pdf data - nutch

2018-09-28 Thread Musshorn, Kris T CTR USARMY CECOM (US)
Please remove me from this list -Original Message- From: Sebastian Nagel [mailto:wastl.na...@googlemail.com.INVALID] Sent: Friday, September 28, 2018 2:25 AM To: user@nutch.apache.org Subject: [Non-DoD Source] Re: Include parent URL in pdf data - nutch All active links contained in this

Re: Include parent URL in pdf data - nutch

2018-09-28 Thread UMA MAHESWAR
Hi Sir , By Parent URL , i mean the page the PDF document is linked from . In other words , the name of website where the PDF is present in the site Example : I am crawling multiple pdf from multiple websites . I just wanted to index the respective website name along with each pdf crawled

Re: Include parent URL in pdf data - nutch

2018-09-27 Thread Sebastian Nagel
Hi, could you explain in detail what is meant by "parent URL"? - the page the PDF document is linked from - a redirect pointing to the PDF doc - the "directory" of the PDF URL (clip URL after last "/") - ... Nutch indexes all successfully fetched pages but not r

Include parent URL in pdf data - nutch

2018-09-27 Thread UMA MAHESWAR
I am using nutch1.x for website cawing and indexing in solr(5.5.0). I am trying to include the parent URL along with pdf data . Can someone please suggest me some way to do it ? Thanks in advance for your comments and suggestions -- Sent from: http://lucene.472066.n3.nabble.com/Nutch-User

Re: Need to index Parent URL also

2016-11-29 Thread Sebastian Nagel
Hi, great to here. Ev., you want to add a check whether the parent URL is already set to avoid that it gets overwritten if another page links to the same target. But that depends on your crawling setup and structure of the crawled sites. And of course, in real web crawling there may be

Re: Need to index Parent URL also

2016-11-29 Thread AshokRaj.Lourdusamy
tian Nagel Sent: Monday, November 28, 2016 1:09 AM To: user@nutch.apache.org Subject: Re: Need to index Parent URL also Hi, have a look at the scoring filter interface, esp. the plugin scoring-depth. In the method distributeScoreToOutlinks the fromUrl is at hand and it's no big deal to add

Re: Need to index Parent URL also

2016-11-27 Thread Sebastian Nagel
27;s CrawlDatum "datum". Just modify an existing plugin or implement your own. To finally index the parent URL, add the metadata key which holds the parent/from URL to the property index.db.md: index.db.md Comma-separated list of keys to be taken from the crawldb met

Need to index Parent URL also

2016-11-27 Thread AshokRaj.Lourdusamy
Hi, While nutch1.x is indexing in solr (or Elasticsearch) I need to include the immediate parent URL too. There is no clear help online on where to do this. I don't need the hierarchy till seed url, but just the immediate parent of current parsing document. Someone suggested to do

Re: [MASSMAIL]Parent URL

2015-07-02 Thread Jorge Luis Betancourt González
hani Chaushu" To: user@nutch.apache.org Sent: Thursday, July 2, 2015 4:01:15 AM Subject: [MASSMAIL]Parent URL Hi, I'm using Nutch 1.9 with Solr 4.10 There is any way so see in solr for each page the parent/root page they came from? Thanks, Shani --

Re: Parent URL

2015-07-02 Thread Julien Nioche
Hi Shani Tracking the seed URL which led to a given page is easy : you can add a custom metadata to the seeds being the seed URL itself e.g. *http://www.guardian.co.uk seed=http://www.guardian.co.uk * then specify 'seed' as a value for the co

Parent URL

2015-07-02 Thread Chaushu, Shani
Hi, I'm using Nutch 1.9 with Solr 4.10 There is any way so see in solr for each page the parent/root page they came from? Thanks, Shani - Intel Electronics Ltd. This e-mail and any attachments may contain confidential material

Re: Prevent crawl of parent URL

2013-08-13 Thread feng lu
e the parent directories: > > deleteByQuery( "-anchor:[* TO *]" ) > > However, I would prefer to delete these with Nutch if possible. Any > suggestions would be appreciated. > > Regards. > > > > -- > View this message in context: > http://lucene.47206

Re: Prevent crawl of parent URL

2013-08-13 Thread stone2dbone
. Regards. -- View this message in context: http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4084252.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Prevent crawl of parent URL

2013-08-12 Thread stone2dbone
://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4084057.html Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Prevent crawl of parent URL

2013-08-08 Thread feng lu
> this skips all the subdirectories as well. Is there another solution, or > possibly some way to go back and remove all the parent directories after > the > crawl? > > Thanks for your help. > > > > -- > View this message in context: > http://lucene.472066.n3.

RE: Prevent crawl of parent URL

2013-08-08 Thread stone2dbone
your help. -- View this message in context: http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4083287.html Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Prevent crawl of parent URL

2013-07-25 Thread Markus Jelsma
-Original message- > From:stone2dbone > Sent: Wednesday 24th July 2013 18:25 > To: user@nutch.apache.org > Subject: RE: Prevent crawl of parent URL > > Thanks Markus. I will give this a try. I did refilter the crawldb. One more > question: > > I'm not

RE: Prevent crawl of parent URL

2013-07-24 Thread stone2dbone
ntext: http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4080111.html Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Prevent crawl of parent URL

2013-07-24 Thread Markus Jelsma
Hi -Original message- > From:stone2dbone > Sent: Wednesday 24th July 2013 14:56 > To: user@nutch.apache.org > Subject: Prevent crawl of parent URL > > I would like to crawl everything in > > http://my.domain.name/dir/subdir > > but nothing in its parent

Prevent crawl of parent URL

2013-07-24 Thread stone2dbone
URLs. Any suggestions how to correct this behavior? -- View this message in context: http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032.html Sent from the Nutch - User mailing list archive at Nabble.com.