It looks like the InjectorJob phase successfully injects your 1 URL in to
Cassandra Keyspace.

On Thu, Jun 5, 2014 at 12:14 PM, Manikandan Saravanan <
[email protected]> wrote:

>
> 14/06/05 15:01:02 INFO mapred.JobClient:     Map input records=1
>
> ...

> 14/06/05 15:01:02 INFO mapred.JobClient:     Map output records=1
> 14/06/05 15:01:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=110
> 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of
> urls rejected by filters: 0
> 14/06/05 15:01:02 INFO crawl.InjectorJob: InjectorJob: total number of
> urls injected after normalization and filtering: 1
> 14/06/05 15:01:02 INFO crawl.InjectorJob: Injector: finished at 2014-06-05
> 15:01:02, elapsed: 00:00:28
>

So that looks fine. What I would advise you to do is read the dump after
injecting.


> Thu Jun 5 15:01:02 EDT 2014 : Iteration 1 of 2
>

What does this mean? Did you manually edit this? I have never seen this
logging before.


>
>
> 14/06/05 15:02:14 INFO mapred.JobClient:     Map input records=0
>
> If the URL has already been fetched then a fetchmark will not exist for it
to be re-fetched. Can this perhaps be the case.

It seems that you have been tinkering with crawl cycles without
understanding and/or recognizing the crawl cycle itself. If you are just
starting out, I really advise you to use the nutch script with individual
commands. Reading the database dump is an essential step in a young crawl
cycle.
Lewis

Reply via email to