Re: index-metadata, lowercasing field names?

2018-03-07 Thread Chris Mattmann
+1 On 3/7/18, 3:00 PM, "lewis john mcgibbney" wrote: Patch it Markus. On Wed, Mar 7, 2018 at 1:58 PM, wrote: > > From: Markus Jelsma > To: User > Cc: > Bcc: > Date: Wed, 7 Mar 2018 11:24:09 + > Subject: index-metadata, lowercasing field

Re: index-metadata, lowercasing field names?

2018-03-07 Thread lewis john mcgibbney
Patch it Markus. On Wed, Mar 7, 2018 at 1:58 PM, wrote: > > From: Markus Jelsma > To: User > Cc: > Bcc: > Date: Wed, 7 Mar 2018 11:24:09 + > Subject: index-metadata, lowercasing field names? > Hi, > > I've got metadata, containing a capital in the field name. But > index-metadata lowercase

Re: Need Tutorial on Nutch

2018-03-07 Thread Eric Valencia
Yeah, I'm currently learning Java (from scratch) and a crash course in Solr / Hadoop / Pig / Hive and Cloudera after hearing your prior response. The result of my efforts must be the scraper, data analysis pipeline (data munging), and ultimately refine the output to populate a mysql database (which

RE: Need Tutorial on Nutch

2018-03-07 Thread Markus Jelsma
Hello, Yes, we have used headless browsers with and without Nutch. But i am unsure which of the mentioned challenges a headless browser is going to help solving, except for dealing with sites that serve only AJAXed web pages. Semyon is right, if you really want this, Nutch and Hadoop can be gre

Re: Need Tutorial on Nutch

2018-03-07 Thread Eric Valencia
How about using nutch with a headless browser like CasperJS? Will this work? Have any of you tried this? On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma wrote: > Hi, > > Yes you are going to need code, and a lot more than just that, probably > including dropping the 'every two hour' requirement. >

indexer-solr is failing to de-duplicate URL encoded URLs

2018-03-07 Thread Michael Portnoy
indexer-solr is failing to de-duplicate URL encoded URLs. Nutch writes URLs as URL encoded into Solr, however, SolrIndexWriter.java explicitly decodes when deleting, hence failing to match the URL in Solr and therefore failing to deleting them. In SolrIndexWriter.java, there is a comment: // WORK

RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari
1. Go to https://issues.apache.org/jira/projects/NUTCH 2. Click "Log-In" (upper right corner). Create a user if needed and log in. 3. Click "Create" (in the top banner). 4. Fill in the fields. They are mostly self-explanatory, and those that you don't understand can probably be ignored. The import

index-metadata, lowercasing field names?

2018-03-07 Thread Markus Jelsma
Hi, I've got metadata, containing a capital in the field name. But index-metadata lowercases its field names: parseFieldnames.put(metatag.toLowerCase(Locale.ROOT), metatag); This means index-metadata is useless if your metadata fields contain uppercase characters. Was this done for a reason?

RE: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
Yossi I tried with both the original url and the newer one but it didn't worked!! However for now I disabled the scoring opic as suggested by Sebastian and it worked for now. And I will open a jira issue but I am new to open source world so can you please help me regarding this? Thanks a lot yossi

RE: Regarding Internal Links

2018-03-07 Thread Yossi Tamari
Yas, just to be sure, you are using the original URL (the one that was in the ParseResult passed as parameter to the filter) in the ParseResult constructor, right? > -Original Message- > From: Sebastian Nagel > Sent: 07 March 2018 12:36 > To: user@nutch.apache.org > Subject: Re: Regardi

Re: Regarding Internal Links

2018-03-07 Thread Sebastian Nagel
Hi, that needs to be fixed. It's because there is no CrawlDb entry for the partial documents. May also be happen after NUTCH-2456. Could you open a Jira issue to address the problem? Thanks! As a quick work-around: - either disable scoring-opic while indexing - or check dbDatum for null in scorin

Re: Regarding Internal Links

2018-03-07 Thread Yash Thenuan Thenuan
Thanks Yossi, I am now able to parse the data successfully but I am getting Error at the time of indexing. Below are the hadoop logs for indexing. ElasticRestIndexWriter elastic.rest.host : hostname elastic.rest.port : port elastic.rest.index : elastic index command elastic.rest.max.bulk.docs : el