Just to clarify: .99 does NOT work fine. It should have rejected most of the 
records when I specified "((Math.random())>=.99)".
 
I have used expressions not involving Math.random. For example, I can extract 
records above a specific score with "score>1.0". But the random thing doesn't 
work even though I have tried various thresholds.

    On Tuesday, May 1, 2018, 2:00:48 PM PDT, Markus Jelsma 
<markus.jel...@openindex.io> wrote:  
 
 Hello Michael,

I would think this should work as well. But since you mention .99 works fine, 
did you try .1 as well to get ~10% output? It seems the expressions itself do 
work at some level, and since this is a Jexl specific thing, you might want to 
try the Jexl list as well. I could not find an online Jexl parser to test this 
question, it would be really helpful! 

Regards,
Markus

-----Original message-----
> From:Michael Coffey <mcof...@yahoo.com.INVALID>
> Sent: Tuesday 1st May 2018 22:47
> To: User <user@nutch.apache.org>
> Subject: random sampling of crawlDb urls
> 
> I want to extract a random sample of URLS from my big crawldb. I think I 
> should be able to do this using readdb -dump with a Jexl expression, but I 
> haven't been able to get it to work.
> 
> I have tried several variations of the following command.
> $NUTCH_HOME/runtime/deploy/bin/nutch readdb /crawls/pop2/data/crawldb -dump 
> /crawls/pop2/data/crawldb/pruned/current -format crawldb -expr 
> "((Math.random())>=0.1)"
> 
> 
> Typically, it produces zero records. I know the expression is getting through 
> to the CrawlDbReader (without quotes) because I get this message:
> 18/05/01 13:22:48 INFO crawl.CrawlDbReader: CrawlDb db: expr: 
> ((Math.random())>=0.1)
> 
> Even when I use the expression "((Math.random())>=0.0)" I get zero output 
> records.
> 
> If I use the expression "((Math.random())>=.99)" it lets all records pass 
> through to the output. I guess it has something to do with the lack of 
> leading zero on the numeric constant.
> 
> Does anyone know a good way to extract a random sample of records from a 
> crawlDb?
>   

Reply via email to