+1
On 3/7/18, 3:00 PM, "lewis john mcgibbney" wrote:
Patch it Markus.
On Wed, Mar 7, 2018 at 1:58 PM, wrote:
>
> From: Markus Jelsma
> To: User
> Cc:
> Bcc:
> Date: Wed, 7 Mar 2018 11:24:09 +
> Subject: index-metadata, lowercasing field
Patch it Markus.
On Wed, Mar 7, 2018 at 1:58 PM, wrote:
>
> From: Markus Jelsma
> To: User
> Cc:
> Bcc:
> Date: Wed, 7 Mar 2018 11:24:09 +
> Subject: index-metadata, lowercasing field names?
> Hi,
>
> I've got metadata, containing a capital in the field name. But
> index-metadata lowercase
Yeah, I'm currently learning Java (from scratch) and a crash course in Solr
/ Hadoop / Pig / Hive and Cloudera after hearing your prior response. The
result of my efforts must be the scraper, data analysis pipeline (data
munging), and ultimately refine the output to populate a mysql database
(which
Hello,
Yes, we have used headless browsers with and without Nutch. But i am unsure
which of the mentioned challenges a headless browser is going to help solving,
except for dealing with sites that serve only AJAXed web pages.
Semyon is right, if you really want this, Nutch and Hadoop can be gre
How about using nutch with a headless browser like CasperJS? Will this
work? Have any of you tried this?
On Tue, Mar 6, 2018 at 1:00 PM Markus Jelsma
wrote:
> Hi,
>
> Yes you are going to need code, and a lot more than just that, probably
> including dropping the 'every two hour' requirement.
>
indexer-solr is failing to de-duplicate URL encoded URLs. Nutch writes URLs
as URL encoded into Solr, however, SolrIndexWriter.java explicitly decodes
when deleting, hence failing to match the URL in Solr and therefore failing
to deleting them.
In SolrIndexWriter.java, there is a comment:
// WORK
1. Go to https://issues.apache.org/jira/projects/NUTCH
2. Click "Log-In" (upper right corner). Create a user if needed and log in.
3. Click "Create" (in the top banner).
4. Fill in the fields. They are mostly self-explanatory, and those that you
don't understand can probably be ignored. The import
Hi,
I've got metadata, containing a capital in the field name. But index-metadata
lowercases its field names:
parseFieldnames.put(metatag.toLowerCase(Locale.ROOT), metatag);
This means index-metadata is useless if your metadata fields contain uppercase
characters. Was this done for a reason?
Yossi I tried with both the original url and the newer one but it didn't
worked!!
However for now I disabled the scoring opic as suggested by Sebastian and
it worked for now.
And I will open a jira issue but I am new to open source world so can you
please help me regarding this?
Thanks a lot yossi
Yas, just to be sure, you are using the original URL (the one that was in the
ParseResult passed as parameter to the filter) in the ParseResult constructor,
right?
> -Original Message-
> From: Sebastian Nagel
> Sent: 07 March 2018 12:36
> To: user@nutch.apache.org
> Subject: Re: Regardi
Hi,
that needs to be fixed. It's because there is no CrawlDb entry for the
partial documents. May also be happen after NUTCH-2456. Could you open
a Jira issue to address the problem? Thanks!
As a quick work-around:
- either disable scoring-opic while indexing
- or check dbDatum for null in scorin
Thanks Yossi, I am now able to parse the data successfully but I am getting
Error at the time of indexing.
Below are the hadoop logs for indexing.
ElasticRestIndexWriter
elastic.rest.host : hostname
elastic.rest.port : port
elastic.rest.index : elastic index command
elastic.rest.max.bulk.docs : el
12 matches
Mail list logo