Hi Michael,
If you are using 1.14, there is a parameter -sample that allows you to request
a random sample. See https://issues.apache.org/jira/browse/NUTCH-2463.
Yossi.
> -Original Message-
> From: Michael Coffey
> Sent: 01 May 2018 23:47
> To: User
> Subject: random sampling
Ah crap, i got it wrong, >0.1 should not get 10% but 90% of the records.
If you could add debugging lines that emit the direct output of Math.random()
and the equation as well, we might learn more. Maybe Math.random() is evaluated
just once, i have no idea how Jexl works under the hood.
Again,
Just to clarify: .99 does NOT work fine. It should have rejected most of the
records when I specified "((Math.random())>=.99)".
I have used expressions not involving Math.random. For example, I can extract
records above a specific score with "score>1.0". But the random thing doesn't
work even
Hello Michael,
I would think this should work as well. But since you mention .99 works fine,
did you try .1 as well to get ~10% output? It seems the expressions itself do
work at some level, and since this is a Jexl specific thing, you might want to
try the Jexl list as well. I could not find a
I want to extract a random sample of URLS from my big crawldb. I think I should
be able to do this using readdb -dump with a Jexl expression, but I haven't
been able to get it to work.
I have tried several variations of the following command.
$NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/p
Hi Sebastian,
Yes, that explains it! Now I wish I'd pasted my crawl command in the first
place. I'll leave it alone for now, but if it becomes an issue again I know
where to check. Thank you.
Chip
From: Sebastian Nagel
Sent: Monday, April 30, 2018 4:53:20 PM
Dear Apache Enthusiast,
We are pleased to announce our schedule for ApacheCon North America
2018. ApacheCon will be held September 23-27 at the Montreal Marriott
Chateau Champlain in Montreal, Canada.
Registration is open! The early bird rate of $575 lasts until July 21,
at which time it goe
7 matches
Mail list logo