RE: Arch 1.9.2 is available

2016-09-29 Thread Arkadi.Kosmynin
You are welcome.

> -Original Message-
> From: lewis john mcgibbney [mailto:lewi...@apache.org]
> Sent: Friday, 30 September 2016 2:22 AM
> To: user@nutch.apache.org
> Subject: Re: Arch 1.9.2 is available
> 
> Cool... thanks for posting.
> 
> On Wed, Sep 28, 2016 at 1:36 AM, 
> wrote:
> 
> >
> > user Digest 28 Sep 2016 08:36:56 - Issue 2648
> >
> > Topics (messages 32792 through 32792)
> >
> > Arch 1.9.2 is available
> > 32792 by: Arkadi.Kosmynin.csiro.au
> >
> > Administrivia:
> >
> > -
> > To post to the list, e-mail: user@nutch.apache.org To unsubscribe,
> > e-mail: user-digest-unsubscr...@nutch.apache.org
> > For additional commands, e-mail: user-digest-h...@nutch.apache.org
> >
> > --
> >
> >
> >
> > -- Forwarded message --
> > From: 
> > To: 
> > Cc:
> > Date: Tue, 27 Sep 2016 07:00:18 +
> > Subject: Arch 1.9.2 is available
> > Hello,
> >
> > I am announcing release of Arch 1.9.2, based on Nutch 1.9.
> >
> > Arch is a free, open source extension of Nutch designed for indexing
> > and searching of intranets. Many features have been added that make
> > this task easier and deliver high precision search results.
> >
> > For details and downloads, please see Arch home page:
> >
> > http://www.atnf.csiro.au/computing/software/arch/
> >
> > You may know that Google Search Appliance is being discontinued. See,
> > for example,
> > http://fortune.com/2016/02/04/google-ends-search-appliance/. If you
> > need a replacement, you may want to try Arch. It is at least comparable to
> GSA in terms of search quality. See more in this article:
> >
> > http://www.atnf.csiro.au/computing/software/arch/ArchVsGSA.pdf
> >
> > Regards,
> >
> > Arkadi Kosmynin
> >
> >
> >
> 
> 
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney


Arch 1.9.2 is available

2016-09-27 Thread Arkadi.Kosmynin
Hello,

I am announcing release of Arch 1.9.2, based on Nutch 1.9.

Arch is a free, open source extension of Nutch designed for indexing and 
searching of intranets. Many features have been added that make this task 
easier and deliver high precision search results.

For details and downloads, please see Arch home page:

http://www.atnf.csiro.au/computing/software/arch/

You may know that Google Search Appliance is being discontinued. See, for 
example, http://fortune.com/2016/02/04/google-ends-search-appliance/. If you 
need a replacement, you may want to try Arch. It is at least comparable to GSA 
in terms of search quality. See more in this article:

http://www.atnf.csiro.au/computing/software/arch/ArchVsGSA.pdf

Regards,

Arkadi Kosmynin



RE: Bug: redirected URLs lost on indexing stage?

2015-11-05 Thread Arkadi.Kosmynin
Hi Sebastian,

I meant #1 and used if http.redirect.max == 3.

Thanks,
Arkadi

> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Tuesday, 3 November 2015 6:13 PM
> To: user@nutch.apache.org
> Subject: Re: Bug: redirected URLs lost on indexing stage?
> 
> Hi Arkadi,
> 
> > Example: use http://www.atnf.csiro.au/observers/  as seed and set
> > depth to 1. It will be redirected to
> > http://www.atnf.csiro.au/observers/index.html, fetched and parsed
> successfully and then lost. If you set depth to 2, it will get indexed.
> 
> Just to be sure we use the same terminology: What does "depth" mean?
> 1 number of rounds: number of generate-fetch-update cycles when running
> nutch,
>   see command-line help of bin/crawl
> 2 value of property http.redirect.max
> 3 value of property scoring.depth.max (used by plugin scoring-depth)
> 
> If it's about #1 and if http.redirect.max == 0 (the default):
> you need at least two rounds to index a redirected page.
> During the first round the redirect is fetched and the redirect target is
> recorded. The second round will fetch, parse and index the redirect target.
> 
> If http.redirect.max is set to a value > 0, the fetcher will follow redirects
> immediately in the current round. But there are some drawbacks, and that's
> why this isn't the default:
> - no deduplication if multiple pages are redirected
>   to the same target, e.g., an error page.
>   This means you'll spend extra network bandwidth
>   to fetch the same content multiple times.
>   Nutch will keep only one instance of the page anyway.
> - by setting http.redirect.max to a high value you
>   may get lost in round-trip redirects
> - if http.redirect.max is too low longer redirect
>   chains are cut-off. Nutch will not follow these
>   redirects.
> 
> Cheers,
> Sebastian
> 
> 
> On 11/03/2015 01:21 AM, arkadi.kosmy...@csiro.au wrote:
> > Hi Sebastian,
> >
> > Thank you for very quick and detailed response. I've checked again and
> found that redirected URLs get lost if they had been injected in the last
> iteration.
> >
> > Example: use http://www.atnf.csiro.au/observers/  as seed and set depth
> to 1. It will be redirected to http://www.atnf.csiro.au/observers/index.html,
> fetched and parsed successfully and then lost. If you set depth to 2, it will 
> get
> indexed.
> >
> > If you use http://www.atnf.csiro.au/observers/index.html as seed, it will
> be fetched, parsed and indexed successfully even if you set depth to 1.
> >
> >  Regards,
> > Arkadi
> >
> >> -Original Message-
> >> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> >> Sent: Thursday, 29 October 2015 7:23 AM
> >> To: user@nutch.apache.org
> >> Subject: Re: Bug: redirected URLs lost on indexing stage?
> >>
> >> Hi Arkadi,
> >>
> >>> In my experience, Nutch follows redirects OK (after NUTCH-2124
> >>> applied),
> >>
> >> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max
> >> > 0
> >>
> >>
> >>> fetches target content, parses and saves it, but loses on the indexing
> stage.
> >>
> >> Can you give a concrete example?
> >>
> >> While testing NUTCH-2124, I've verified that redirect targets get indexed.
> >>
> >>
> >>> Therefore, when this condition is checked
> >>>
> >>> if (fetchDatum == null || dbDatum == null|| parseText == null ||
> >>> parseData
> >> == null) {
> >>>   return; // only have inlinks
> >>> }
> >>>
> >>> both sets get ignored because each one is incomplete.
> >>
> >> This code snippet is correct, a redirect is pretty much the same as a
> >> link: the crawler follows it. Ok, there are many differences, but the
> >> central point: a link does not get indexed, but only the link target.
> >> And that's the same for redirects. There are always at least 2 URLs:
> >> - the source or redirect
> >> - and the target of the redirection
> >> Only the latter gets indexed after it has been fetched and it is not
> >> a redirect itself.
> >>
> >> The source has no parseText and parseData, and that's why cannot be
> >> indexed.
> >>
> >> If the target does not make it into the index:
> >> - first, check whether it passes URL filters and is not changed by
> >> normalizers
> >> - was it successfully fetched and parsed?
> >> - not excluded by robots=noindex?
> >>
> >> You should check the CrawlDb and the segments for this URL.
> >>
> >> If you could provide a concrete example, I'm happy to have a detailed
> >> look on it.
> >>
> >> Cheers,
> >> Sebastian
> >>
> >>
> >> On 10/28/2015 08:57 AM, arkadi.kosmy...@csiro.au wrote:
> >>> Hi,
> >>>
> >>> I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a
> >>> question
> >> mark in the subject because I work with Nutch modification called
> >> Arch (see http://www.atnf.csiro.au/computing/software/arch/). This is
> >> why I am only 99% sure that the same bug would occur in the original
> Nutch 1.9.
> >>>
> >>> In my experience, Nutch follows redirects OK (after NUTCH-2124
> >>> 

RE: Bug: redirected URLs lost on indexing stage?

2015-11-02 Thread Arkadi.Kosmynin
Hi Sebastian,

Thank you for very quick and detailed response. I've checked again and found 
that redirected URLs get lost if they had been injected in the last iteration. 

Example: use http://www.atnf.csiro.au/observers/  as seed and set depth to 1. 
It will be redirected to http://www.atnf.csiro.au/observers/index.html, fetched 
and parsed successfully and then lost. If you set depth to 2, it will get 
indexed.

If you use http://www.atnf.csiro.au/observers/index.html as seed, it will be 
fetched, parsed and indexed successfully even if you set depth to 1.

 Regards,
Arkadi

> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Thursday, 29 October 2015 7:23 AM
> To: user@nutch.apache.org
> Subject: Re: Bug: redirected URLs lost on indexing stage?
> 
> Hi Arkadi,
> 
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied),
> 
> Yes, 1.9 is affected by NUTCH-2124 / NUTCH-1939 if http.redirect.max > 0
> 
> 
> > fetches target content, parses and saves it, but loses on the indexing 
> > stage.
> 
> Can you give a concrete example?
> 
> While testing NUTCH-2124, I've verified that redirect targets get indexed.
> 
> 
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >   return; // only have inlinks
> > }
> >
> > both sets get ignored because each one is incomplete.
> 
> This code snippet is correct, a redirect is pretty much the same as a link: 
> the
> crawler follows it. Ok, there are many differences, but the central point: a
> link does not get indexed, but only the link target. And that's the same for
> redirects. There are always at least 2 URLs:
> - the source or redirect
> - and the target of the redirection
> Only the latter gets indexed after it has been fetched and it is not a 
> redirect
> itself.
> 
> The source has no parseText and parseData, and that's why cannot be
> indexed.
> 
> If the target does not make it into the index:
> - first, check whether it passes URL filters and is not changed by normalizers
> - was it successfully fetched and parsed?
> - not excluded by robots=noindex?
> 
> You should check the CrawlDb and the segments for this URL.
> 
> If you could provide a concrete example, I'm happy to have a detailed look
> on it.
> 
> Cheers,
> Sebastian
> 
> 
> On 10/28/2015 08:57 AM, arkadi.kosmy...@csiro.au wrote:
> > Hi,
> >
> > I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question
> mark in the subject because I work with Nutch modification called Arch (see
> http://www.atnf.csiro.au/computing/software/arch/). This is why I am only
> 99% sure that the same bug would occur in the original Nutch 1.9.
> >
> > In my experience, Nutch follows redirects OK (after NUTCH-2124
> > applied), fetches target content, parses and saves it, but loses on
> > the indexing stage. This happens because the db datum is being mapped
> > with the original URL as the key, but the fetch and parse data and
> > parse text are being mapped with the final URL in IndexerMapReduce.
> > Therefore, when this condition is checked
> >
> > if (fetchDatum == null || dbDatum == null|| parseText == null || parseData
> == null) {
> >   return; // only have inlinks
> > }
> >
> > both sets get ignored because each one is incomplete.
> >
> > I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. 
> > This is
> because I am not completely sure that this is a bug in Nutch (see above) and
> also because what will work for Arch may not work for Nutch. They are
> different in the use of crawl db.
> >
> > Regards,
> >
> > Arkadi
> >
> >
> >



Bug: redirected URLs lost on indexing stage?

2015-10-28 Thread Arkadi.Kosmynin
Hi,

I am using Nutch 1.9 with NUTCH-2124 patch applied. I've put a question mark in 
the subject because I work with Nutch modification called Arch (see 
http://www.atnf.csiro.au/computing/software/arch/). This is why I am only 99% 
sure that the same bug would occur in the original Nutch 1.9.

In my experience, Nutch follows redirects OK (after NUTCH-2124 applied), 
fetches target content, parses and saves it, but loses on the indexing stage. 
This happens because the db datum is being mapped with the original URL as the 
key, but the fetch and parse data and parse text are being mapped with the 
final URL in IndexerMapReduce. Therefore, when this condition is checked

if (fetchDatum == null || dbDatum == null|| parseText == null || parseData == 
null) {
  return; // only have inlinks
}

both sets get ignored because each one is incomplete.

I am going to fix this for Arch, but can't offer a patch for Nutch, sorry. This 
is because I am not completely sure that this is a bug in Nutch (see above) and 
also because what will work for Arch may not work for Nutch. They are different 
in the use of crawl db.

Regards,

Arkadi




RE: A parser failure on a single document may fail crawling job

2015-07-29 Thread Arkadi.Kosmynin
Hi Sebastian,

> -Original Message-
> From: Sebastian Nagel [mailto:wastl.na...@googlemail.com]
> Sent: Friday, 24 July 2015 6:39 AM
> To: user@nutch.apache.org
> Cc: Kosmynin, Arkadi (CASS, Marsfield) 
> Subject: Re: A parser failure on a single document may fail crawling job
> 
> Hi Arkadi,
> 
> does the problem persist?

Yes.

> Which version of Nutch are you using?

1.9

> Can you point to one file or URL to reproduce it?

To reproduce:

- Remove a jar file that one of your parsers depends on. 
- Make Nutch parse any file using this parser.

This will result in NoSuchMethodError thrown and crawling job failed.

I've created a JIRA issue NUTCH-2071 and attached a patch. I believe that this 
problem should be handled at ParseUtil level because people may use their own 
or third party parsers and Nutch should be protected from parsers problems.

Regards,
Arkadi

> 
> Thanks,
> Sebastian
> 
> On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> > Hi Arkadi,
> >
> > thanks for reporting that. Can you open a Jira ticket [1] to address this 
> > bug?
> >
> > It's rather a bug of the plugin parse-tika and should be solved there,
> > cf. https://issues.apache.org/jira/browse/TIKA-1240
> > A plugin should be able to load all required classes.
> >
> > Thanks,
> > Sebastian
> >
> > [1] https://issues.apache.org/jira/browse/NUTCH
> >
> > 2015-06-23 3:59 GMT+02:00  >:
> >
> > Hi,
> >
> > This is what happened:
> >
> > java.io.IOException: Job failed!
> > at 
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> > at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> > <...>
> > Caused by: java.lang.IncompatibleClassChangeError: class
> > org.apache.tika.parser.asm.XHTMLClassVisitor has interface
> org.objectweb.asm.ClassVisitor as
> > super class
> > at java.lang.ClassLoader.defineClass1(Native Method)
> > at 
> > java.lang.ClassLoader.defineClass(ClassLoader.java:760)
> > at
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> > at 
> > java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> > at 
> > java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> > at 
> > java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> > at 
> > java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> > at java.security.AccessController.doPrivileged(Native 
> > Method)
> > at 
> > java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> > at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> > at
> org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
> > at
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
> > at
> > org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
> >
> > Suggested fix in ParseUtil:
> >
> > Replace
> >
> > if (maxParseTime!=-1)
> >parseResult = runParser(parsers[i], content);
> > else
> >parseResult = parsers[i].getParse(content);
> >
> > with
> >
> >   try
> >   {
> > if (maxParseTime!=-1)
> >parseResult = runParser(parsers[i], content);
> > else
> >parseResult = parsers[i].getParse(content);
> >   } catch( Throwable e )
> >   {
> > LOG.warn( "Parsing " + content.getUrl() + " with " +
> parsers[i].getClass().getName() + "
> > failed: " + e.getMessage() ) ;
> > parseResult = null ;
> >   }
> >
> > Also replace
> >
> >   if (maxParseTime!=-1)
> >   parseResult = runParser(p, content);
> >else
> >   parseResult = p.getParse(content);
> >
> > with
> >
> > try
> > {
> >   if (maxParseTime!=-1)
> >   parseResult = runParser(p, content);
> >else
> >   parseResult = p.getParse(content);
> > } catch( Throwable e )
> > {
> >   LOG.warn( "Parsing " + content.getUrl() + " with " +
> p.getClass().getName() + " failed: "
> > + e.getMessage() ) ;
> > }
> >
> > Regards,
> > Arkadi
> >
> >



RE: A parser failure on a single document may fail crawling job

2015-07-23 Thread Arkadi.Kosmynin
Hi Sebastian,

I apologise for a long silence on this issue. I have been out of town, back on 
Monday. Then I will do what you are asking in 2-3 days.

Regards,
Arkadi

From: Sebastian Nagel [wastl.na...@googlemail.com]
Sent: Friday, 24 July 2015 6:38 AM
To: user@nutch.apache.org
Cc: Kosmynin, Arkadi (CASS, Marsfield)
Subject: Re: A parser failure on a single document may fail crawling job

Hi Arkadi,

does the problem persist?
Which version of Nutch are you using?
Can you point to one file or URL to reproduce it?

Thanks,
Sebastian

On 06/26/2015 03:26 PM, Sebastian Nagel wrote:
> Hi Arkadi,
>
> thanks for reporting that. Can you open a Jira ticket [1] to address this bug?
>
> It's rather a bug of the plugin parse-tika and should be solved there,
> cf. https://issues.apache.org/jira/browse/TIKA-1240
> A plugin should be able to load all required classes.
>
> Thanks,
> Sebastian
>
> [1] https://issues.apache.org/jira/browse/NUTCH
>
> 2015-06-23 3:59 GMT+02:00  >:
>
> Hi,
>
> This is what happened:
>
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at 
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
> <...>
> Caused by: java.lang.IncompatibleClassChangeError: class
> org.apache.tika.parser.asm.XHTMLClassVisitor has interface 
> org.objectweb.asm.ClassVisitor as
> super class
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
> at 
> java.net.URLClassLoader.access$100(URLClassLoader.java:73)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
> at java.security.AccessController.doPrivileged(Native 
> Method)
> at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:361)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> at 
> org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
> at 
> org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
> at 
> org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)
>
> Suggested fix in ParseUtil:
>
> Replace
>
> if (maxParseTime!=-1)
>parseResult = runParser(parsers[i], content);
> else
>parseResult = parsers[i].getParse(content);
>
> with
>
>   try
>   {
> if (maxParseTime!=-1)
>parseResult = runParser(parsers[i], content);
> else
>parseResult = parsers[i].getParse(content);
>   } catch( Throwable e )
>   {
> LOG.warn( "Parsing " + content.getUrl() + " with " + 
> parsers[i].getClass().getName() + "
> failed: " + e.getMessage() ) ;
> parseResult = null ;
>   }
>
> Also replace
>
>   if (maxParseTime!=-1)
>   parseResult = runParser(p, content);
>else
>   parseResult = p.getParse(content);
>
> with
>
> try
> {
>   if (maxParseTime!=-1)
>   parseResult = runParser(p, content);
>else
>   parseResult = p.getParse(content);
> } catch( Throwable e )
> {
>   LOG.warn( "Parsing " + content.getUrl() + " with " + 
> p.getClass().getName() + " failed: "
> + e.getMessage() ) ;
> }
>
> Regards,
> Arkadi
>
>



A parser failure on a single document may fail crawling job

2015-06-22 Thread Arkadi.Kosmynin
Hi,

This is what happened:

java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:213)
<...>
Caused by: java.lang.IncompatibleClassChangeError: class 
org.apache.tika.parser.asm.XHTMLClassVisitor has interface 
org.objectweb.asm.ClassVisitor as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at 
org.apache.tika.parser.asm.ClassParser.parse(ClassParser.java:51)
at 
org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:98)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:103)

Suggested fix in ParseUtil:

Replace

if (maxParseTime!=-1)
   parseResult = runParser(parsers[i], content);
else
   parseResult = parsers[i].getParse(content);

with

  try
  {
if (maxParseTime!=-1)
   parseResult = runParser(parsers[i], content);
else
   parseResult = parsers[i].getParse(content);
  } catch( Throwable e )
  {
LOG.warn( "Parsing " + content.getUrl() + " with " + 
parsers[i].getClass().getName() + " failed: " + e.getMessage() ) ;
parseResult = null ;
  }

Also replace

  if (maxParseTime!=-1)
  parseResult = runParser(p, content);
   else
  parseResult = p.getParse(content);

with

try
{
  if (maxParseTime!=-1)
  parseResult = runParser(p, content);
   else
  parseResult = p.getParse(content);
} catch( Throwable e )
{
  LOG.warn( "Parsing " + content.getUrl() + " with " + 
p.getClass().getName() + " failed: " + e.getMessage() ) ;
}

Regards,
Arkadi


RE: A bug in org.apache.nutch.parse.ParseUtil?

2015-04-20 Thread Arkadi.Kosmynin
Hi Sebastian,

Yes, I considered parseResult.isSuccess(), but the problem is, it returns 
success only if all parses were successful. So, if the first parser succeeds, 
it will break the loop, else all parsers will be used - I don't think this was 
the idea.

If retaining ParseStatus of failed parses is important, perhaps a similar 
isAnySuccess() function could help.

Regards,

Arkadi

-Original Message-
From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
Sent: Saturday, 18 April 2015 7:37 AM
To: user@nutch.apache.org
Subject: Re: A bug in org.apache.nutch.parse.ParseUtil?

Hi Arkadi,

agreed that's a bug.

> if ( parseResult != null ) parseResult.filter() ;

parseResult.isSuccess()
  would do the check without modifying the ParseResult

In case, that also fall-back parsers fail it could useful to return one (the 
first? the last?) failed ParseResult. Luckily the parser places a meaningful 
error message or minor ParseStatus which could be used by the caller for 
diagnostics.

Thanks,
Sebastian

On 04/17/2015 06:31 AM, arkadi.kosmy...@csiro.au wrote:
> Hi,
> 
> From reading the code it is clear that it is designed to allow using 
> several parsers to parse a document in a sequence, until it is 
> successfully parsed. In practice, this does not work because these 
> lines
> 
> f (parseResult != null && !parseResult.isEmpty())
> return parseResult;
> 
> break the loop even if the parsing has failed because parseResult is not 
> empty anyway, it contains a ParseData with ParseStatus.FAILED.
> This is easy to fix, for example, by adding a line before the two lines 
> mentioned above:
> 
> if ( parseResult != null ) parseResult.filter() ;
> 
> This will remove failed ParseData objects from the parseResult and leave it 
> empty if the parsing had been unsuccessful. I believe that this fix is 
> important because it allows use of backup parsers as originally designed and 
> thus increase index completeness.
> 
> Regards,
> Arkadi
> 
> 
> 



A bug in org.apache.nutch.parse.ParseUtil?

2015-04-17 Thread Arkadi.Kosmynin
Hi,

>From reading the code it is clear that it is designed to allow using several 
>parsers to parse a document in a sequence, until it is successfully parsed. In 
>practice, this does not work because these lines

f (parseResult != null && !parseResult.isEmpty())
return parseResult;

break the loop even if the parsing has failed because parseResult is not empty 
anyway, it contains a ParseData with ParseStatus.FAILED.
This is easy to fix, for example, by adding a line before the two lines 
mentioned above:

if ( parseResult != null ) parseResult.filter() ;

This will remove failed ParseData objects from the parseResult and leave it 
empty if the parsing had been unsuccessful. I believe that this fix is 
important because it allows use of backup parsers as originally designed and 
thus increase index completeness.

Regards,
Arkadi




RE: Nutch crawl commands and efficiency

2012-09-03 Thread Arkadi.Kosmynin
Hi,

I can't see from your description what exactly is slow, but I'd suggest to make 
sure that Nutch is using Hadoop native libraries. They make a huge difference 
for some operations.

Regards,

Arkadi

> -Original Message-
> From: george123 [mailto:daniel.tarase...@gmail.com]
> Sent: Tuesday, 28 August 2012 9:59 PM
> To: user@nutch.apache.org
> Subject: Nutch crawl commands and efficiency
> 
> Hi
> as per the below I am running Nutch 1.2 in what I think is local mode.
> http://lucene.472066.n3.nabble.com/nutch-stops-when-my-ssh-connection-
> drops-out-td4001938.html
> 
> I have a large crawl, about eventually 5000 sites that I am using nutch
> to scrape from.
> Right now I have a list of sites, 20 of them, and the total urls within
> those sites will amount to about 200 000 eventually crawled/scraped.
> 
> I have to seed these sites with a range of urls for them to crawl. Some
> are very simple (like domain.com/results.php) others are very
> difficult, some sites have between 1500 and 15000 seed urls just to
> make sure they are crawled properly.
> 
> So my seeds.txt has about 20 000 seed urls in it (but only 20 domains -
> 1 domain has 15000 seed urls).
> 
> I SSH in, navigate to the nutch install, and run *bin/nutch crawl urls
> -dir crawl -depth 1000 -topN 100 -threads 500
> *
> Now, its very very very slow, it has been running for 2 hrs and not
> much is happening. It seems to take about 10 minutes to generate a
> crawl list, then about 30 seconds to crawl that.
> 
> The average website will have a list of anywhere between 10 and 200
> results on each page, that require crawling further to get the listing.
> There is anywhere between 10 and 5000 pages of results so there is a
> bit of a crawl involved.
> 
> I still think its slow, its certainly not even close to the servers
> resources.
> 
> So I think there is either a nutch politeness delay because its trying
> to crawl everything in the seeds.txt first, including the first 15000
> urls from
> 1 site (so not making many requests to kick that off).
> 
> Or its just trying to generate a long list for the 500 threads setting
> that takes 10+ minutes.
> Or trying to go 1000 deep is slowing it down, but I dont think so
> because my crawlfilter is pretty tightly controlled.
> 
> Any ideas on how to speed this up? Am I just needing  to wait for nutch
> to process this original 20 000 list then it speeds up?
> 
> What are some other things I can look at?
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-
> crawl-commands-and-efficiency-tp4003690.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


RE: focused crawl extended with user generated content

2012-06-12 Thread Arkadi.Kosmynin
Hi Magnus

> -Original Message-
> From: Magnús Skúlason [mailto:magg...@gmail.com]
> Sent: Wednesday, 13 June 2012 1:57 AM
> To: nutch-u...@lucene.apache.org
> Subject: focused crawl extended with user generated content
> 
> Hi,
> 
> I am using nutch for a focused crawl vertical search engine, so far I
> am only extracting information to be stored in the index in the crawl
> process. However I would like to allow users to edit and extend the
> content showed on my site. Like adding a better description, adding
> tags and sorting items into categories.
> 
> What would be the best approach to do that? If I simply store the
> additional information in the index what happens next time when a page
> is re indexed? Would the user generated content be overwritten?

If you store your additional information as extra fields that you add to Nutch 
documents before sending them to Solr, yes, this content will be overwritten. 
You can store it separately from your Nutch document, even in the same Solr 
index. Then it will not be overwritten by Nutch, but will be less trivial to 
search and retrieve together with Nutch index entries. 

> If so what would be the best way to prevent that? creating a solr pluggin
> (that would not re index documents that have been modified externally)
> or shhould I maybe store the user generated content in a database
> instead and flash the index with the information from the database
> after each crawl if changed? Something completely different?

Should you decide to add your extra information to Nutch documents, you can do 
it in Nutch index filter plugin. You will have to add it each time you re-index 
your documents. To do that, you can either maintain it separately in a database 
(including same Solr index, just different Ids), or get it from the old 
document in Solr about to be replaced, and copy to the new document. 

What exactly is optimal to do depends on what you are trying to achieve. 

> Are there already some plugins for nutch or solr to do something like
> this?

AFAIK, there are none to do exactly this, but the index-more plugin will give 
you an example of how to add extra fields. You will also have to extend Solr 
schema (see schema.xml) and Nutch->Solr mapping (see solrindex-mapping.xml).

Regards,

Arkadi
> 
> Any thoughts and / or best practices on this would be greatly
> appreciated :)
> 
> best regards,
> Magnus


RE: Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

2012-01-16 Thread Arkadi.Kosmynin
> 
> hi
> 
> 
> > Hi,
> >
> > I started having this problem recently. For some reason, I did not
> have it
> > before, when working with Nutch 1.4 pre-release code. The stack trace
> > would be:
> >
> >
> >
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getS
> pli
> > ts(SolrDeleteDuplicates.java:200) at
> > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at
> >
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781
> )
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at
> > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) <...>
> >
> > I wrote "would be" because it happens in my derivative from
> > SolrDeleteDuplicates, but there is almost no difference between them.
> 
> Please make sure and try the original class.

Did this. The bug is confirmed.

> 
> > The
> > stack trace on the Solr side is:
> >
> > SEVERE: org.apache.lucene.search.BooleanQuery$TooManyClauses:
> > maxClauseCount is set to 1024 at
> > org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) at
> > org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127) at
> >
> org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java
> :51
> > ) at
> >
> org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java
> :41
> > ) at
> >
> org.apache.lucene.search.ScoringRewrite$3.collect(ScoringRewrite.java:9
> 5)
> > at
> >
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollect
> ing
> > Rewrite.java:38) at
> >
> org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93)
> at
> >
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:296
> )
> > <...>
> > This happens because the "get all" query used to get Solr index size
> is
> > "id:[* TO *]" which is a range query. Lucene is trying to expand it
> to a
> > Boolean query and gets as many clauses as there are ids in the index.
> This
> > is too many in a real situation and it throws an exception.
> >
> > Am I missing something? If I am right, to correct this problem,
> change the
> > "get all" (SOLR_GET_ALL_QUERY) query to "*:*", which is the standard
> Solr
> > "get all" query.
> 
> I would think so too. I've no idea why the author of solrdedup choose
> that
> approach. Please file an issue at Jira with a description on how to
> reproduce
> this bad behaviour.

The solution is tested and confirmed too. I tried to create an issue at Jira, 
but my browser timed out: "The server at issues.apache.org is taking too long 
to respond." Will try again later.

> 
> >
> > Regards,
> >
> > Arkadi


Deletion of duplicates fails with org.apache.lucene.search.BooleanQuery$TooManyClauses

2012-01-15 Thread Arkadi.Kosmynin
Hi,

I started having this problem recently. For some reason, I did not have it 
before, when working with Nutch 1.4 pre-release code. The stack trace would be:


org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getSplits(SolrDeleteDuplicates.java:200)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
<...>

I wrote "would be" because it happens in my derivative from 
SolrDeleteDuplicates, but there is almost no difference between them. The stack 
trace on the Solr side is:

SEVERE: org.apache.lucene.search.BooleanQuery$TooManyClauses: maxClauseCount is 
set to 1024
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
   at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127)
   at 
org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:51)
   at 
org.apache.lucene.search.ScoringRewrite$1.addClause(ScoringRewrite.java:41)
   at 
org.apache.lucene.search.ScoringRewrite$3.collect(ScoringRewrite.java:95)
   at 
org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38)
   at 
org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93)
   at 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:296)
<...>
This happens because the "get all" query used to get Solr index size is "id:[* 
TO *]" which is a range query. Lucene is trying to expand it to a Boolean query 
and gets as many clauses as there are ids in the index. This is too many in a 
real situation and it throws an exception.

Am I missing something? If I am right, to correct this problem, change the "get 
all" (SOLR_GET_ALL_QUERY) query to "*:*", which is the standard Solr "get all" 
query.

Regards,

Arkadi




RE: Start crawl from Java without bin/nutch script

2012-01-15 Thread Arkadi.Kosmynin
The path should be C:/server/nutch/urls. I know this is not what you would 
expect from Cygwin, but it works.

Regards,

Arkadi

> -Original Message-
> From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> Sent: Monday, 16 January 2012 4:35 AM
> To: user@nutch.apache.org
> Subject: Re: Start crawl from Java without bin/nutch script
> 
> Mmmm, I am not using Nutch on Windows at all, generally don't know too
> much
> about configuring Cygwin and really hope thetreis some more help out
> there.
> 
> The main problem here seems to be that the relative path to
> /cygdrive/c/server/nutch/urls is not being interpreted correctly.
> 
> You mention
> {bq}
> where al files are in C:/server/nutch
> {bq}
> 
> would this not mean that your rootUrlDir should be something like
> /cygdrive/C:/server/nutch/urls???
> 
> HTH
> 
> On Sun, Jan 15, 2012 at 4:15 PM, Max Stricker 
> wrote:
> 
> > Hi Mailinglist,
> >
> > I currently need to start the nutch crawl process from Java, as it
> should
> > be accessible through a WebApp.
> > I fugured out that calling Crawl.main() with the right parameters
> should
> > be the right way, as this is also done
> > by the nutch script.
> > However I get an exception I cannot solve:
> >
> > crawl started in: /cygdrive/c/server/nutch/crawl
> > rootUrlDir = /cygdrive/c/server/nutch/urls
> > threads = 1
> > depth = 1
> > indexer=solr
> > solrUrl=http://localhost:8983/solr/
> > topN = 10
> > Injector: starting at 2012-01-15 16:51:44
> > Injector: crawlDb: /cygdrive/c/server/nutch/crawl/crawldb
> > Injector: urlDir: /cygdrive/c/server/nutch/urls
> > Injector: Converting injected urls to crawl db entries.
> > org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> > file:/cygdrive/c/server/nutch/urls
> >  at
> >
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.jav
> a:232)
> >  at
> >
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java
> :252)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.ja
> va:428)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:
> 420)
> >  at
> >
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter
> .java:338)
> >  at org.apache.hadoop.mapreduce.Job.submit(Job.java:960)
> >  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:534)
> >  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:779)
> >  at org.apache.nutch.crawl.Injector.inject(Injector.java:217)
> >  at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
> >  at testapp.MyTest.main(MaxTest.java:33)
> >  at testapp.Main.main(Main.java:26)
> >
> > However /cygdrive/c/server/nutch/urls exists and contains a file
> holding
> > the urls to be crawled.
> > The development environment is Windows 7 where al files are in
> > C:/server/nutch
> > From my App I build a jar, put it into nutch/libs and call it using
> > bin/nutch testapp.Main from within Cygwin.
> > I call it through Cygwin because executing it on Windows throws an
> > Exception because the Injector
> > wants to perform a chmod.
> >
> > Any ideas what is going wrong here?
> > Or is there any other way to start a full nutch cycle from within
> Java?
> > I could not find a dedicated API for that.
> >
> > Regards
> 
> 
> 
> 
> --
> *Lewis*


RE: Drupal Integration with Nutch via CSIRO's Arch ?

2011-12-29 Thread Arkadi.Kosmynin
Hi Nicholas,

Thank you very much for your interest. I have a good news for you: we are 
moving our web sites to Drupal and thus will _have_ to integrate Arch with 
Drupal pretty soon. Probably, will do it for Joomla and Wordpress too, but 
first for Drupal. So, this is in the plans. I am on leave right now (it is 
summer in Australia), coming back to work on the 9th of January. 

Regards,

Arkadi

> -Original Message-
> From: niccolo.robe...@gmail.com [mailto:niccolo.robe...@gmail.com] On
> Behalf Of Nicholas Roberts
> Sent: Thursday, 29 December 2011 6:27 PM
> To: user@nutch.apache.org
> Subject: Re: Drupal Integration with Nutch via CSIRO's Arch ?
> 
> Sarnia documented
> http://drupal.org/node/1379476
> 
> On Wed, Dec 28, 2011 at 11:24 PM, Nicholas Roberts <
> nicho...@themediasociety.org> wrote:
> 
> > will add this new module to the top of the list
> > http://drupal.org/project/sarnia
> >
> >
> >
> > On Wed, Dec 28, 2011 at 10:31 PM, Nicholas Roberts <
> > nicho...@themediasociety.org> wrote:
> >
> >> hi Arkadi
> >>
> >> just poking around the website for Arch and am really excited by the
> >> potential
> >>
> >> am wondering if there are possible integration points with Drupal ?
> >>
> >>  thinking possible integration points could be via these Drupal
> contrib
> >> modules;
> >>
> >> Nutch http://drupal.org/project/nutch
> >> Apache Solr Integration http://drupal.org/project/apachesolr
> >> Search API http://drupal.org/project/search_api
> >>
> >>
> >> cheers
> >>
> >> -N
> >>
> >>
> >> On Thu, Dec 22, 2011 at 6:23 PM,  wrote:
> >>
> >>>
> >>>
> >>> http://www.atnf.csiro.au/computing/software/arch/
> >>>
> >>>
> >>>


RE: nutch solr index process to add tag when indexing solr

2011-12-22 Thread Arkadi.Kosmynin
Hi,

This can be done using an index filter. For a source code example see this:

http://www.atnf.csiro.au/computing/software/arch/

Please see class au.csiro.cass.arch.filters.Index.

If you are trying to implement a corporate search engine or a search hosting 
service for multiple web sites, it is quite likely that Arch can do everything 
you need. We've just released a version based on Nutch 1.4.

Regards,

Arkadi




> -Original Message-
> From: abhayd [mailto:ajdabhol...@hotmail.com]
> Sent: Friday, 23 December 2011 6:21 AM
> To: nutch-u...@lucene.apache.org
> Subject: nutch solr index process to add tag when indexing solr
> 
> hi
> We use use nuth to crawl site and index data is pushed using sorlindex
> command.
> 
> We have three sites that we crawl using nutch
> http://xyz1.com/,http://xyz2.com/,http://xyz3.com/
> 
> we create one crawldb's for each site.  We use single solr core to
> consolidate three sites, And when we send data from each crawl db to
> solr we
> want to tag each site docs with source info
> 
> so
> doc1|xyz1|
> doc23|xyz2|
> 
> I dont see anyway to do this in nutch. Any help?
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/nutch-
> solr-index-process-to-add-tag-when-indexing-solr-tp3607311p3607311.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


RE: Runaway fetcher threads

2011-12-19 Thread Arkadi.Kosmynin


> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Tuesday, 20 December 2011 10:08 AM
> To: user@nutch.apache.org
> Subject: Re: Runaway fetcher threads
> 
> Hi,
> 
> > Hi Markus,
> >
> > > -Original Message-
> > > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > > Sent: Monday, 19 December 2011 9:24 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Runaway fetcher threads
> > >
> > > On Monday 19 December 2011 08:32:53 arkadi.kosmy...@csiro.au wrote:
> > > > Hi,
> > > >
> > > > I've observed an interesting phenomenon that is not hard to
> reproduce
> > >
> > > and
> > >
> > > > that I think should not be happening:
> > > >
> > > > If you have N fetcher threads, inject, say, 2xN URLs of VERY
> large
> > >
> > > files
> > >
> > > > plus a few smaller files to fetch and run something that uses
> > > > org.apache.nutch.crawl.Crawl. The big files will take forever to
> > >
> > > download
> > >
> > > > and the threads will be killed. The process then will proceed to
> the
> > > > indexing stage. However, you will see fetcher threads output in
> the
> > >
> > > logs
> > >
> > > > intermixed with the output of the indexer. This shows that they
> were
> > >
> > > not
> > >
> > > > terminated properly (or at all?).
> > >
> > > Hi, what version are you running? Sounds like a old one. Can you
> try
> > > with a more recent version if that is the case?
> >
> > I am using 1.4 latest release.
> 
> Then how can fetcher logs be `intermixed` with indexer logs? Or is this
> a
> local instance where you run multiple local jobs concurrently?

Yes, I am running Nutch in local mode. All output goes to one log file. But, in 
this file fetcher records appear after/mixed with the indexer records. This is 
what looks abnormal. By the time the indexer starts, the fetcher call must have 
returned (see the Crawl class). Evidently, some fetcher threads were left 
running.

> 
> I've never seen fetcher and indexer output together in one log or part
> of a
> log (in that case it's running local).
> 
> 
> >
> > > In anyway, if this is about evenly distributing files across fetch
> > > lists, this
> > > cannot be based on file size as it is unknown beforehand. That is
> only
> > > possible when recrawling large files with a modified generator and
> and
> > > updater
> > > that adds the Content-Length field as CrawlDatum metadata.
> >
> > No, this is not related to evenly distributing files across fetch
> lists.
> >
> > > > Regards,
> > > >
> > > > Arkadi
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex


RE: Runaway fetcher threads

2011-12-19 Thread Arkadi.Kosmynin
Hi Markus,

> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Monday, 19 December 2011 9:24 PM
> To: user@nutch.apache.org
> Subject: Re: Runaway fetcher threads
> 
> 
> 
> On Monday 19 December 2011 08:32:53 arkadi.kosmy...@csiro.au wrote:
> > Hi,
> >
> > I've observed an interesting phenomenon that is not hard to reproduce
> and
> > that I think should not be happening:
> >
> > If you have N fetcher threads, inject, say, 2xN URLs of VERY large
> files
> > plus a few smaller files to fetch and run something that uses
> > org.apache.nutch.crawl.Crawl. The big files will take forever to
> download
> > and the threads will be killed. The process then will proceed to the
> > indexing stage. However, you will see fetcher threads output in the
> logs
> > intermixed with the output of the indexer. This shows that they were
> not
> > terminated properly (or at all?).
> 
> Hi, what version are you running? Sounds like a old one. Can you try
> with a more recent version if that is the case?

I am using 1.4 latest release.


> 
> In anyway, if this is about evenly distributing files across fetch
> lists, this
> cannot be based on file size as it is unknown beforehand. That is only
> possible when recrawling large files with a modified generator and and
> updater
> that adds the Content-Length field as CrawlDatum metadata.

No, this is not related to evenly distributing files across fetch lists.

> 
> >
> > Regards,
> >
> > Arkadi
> 
> --
> Markus Jelsma - CTO - Openindex


Runaway fetcher threads

2011-12-18 Thread Arkadi.Kosmynin
Hi,

I've observed an interesting phenomenon that is not hard to reproduce and that 
I think should not be happening:

If you have N fetcher threads, inject, say, 2xN URLs of VERY large files plus a 
few smaller files to fetch and run something that uses 
org.apache.nutch.crawl.Crawl. The big files will take forever to download and 
the threads will be killed. The process then will proceed to the indexing 
stage. However, you will see fetcher threads output in the logs intermixed with 
the output of the indexer. This shows that they were not terminated properly 
(or at all?).

Regards,

Arkadi


RE: Nutch Hadoop Optimization

2011-12-18 Thread Arkadi.Kosmynin
Hi,

Some info on optimisation that I can share:

1. Make sure that Hadoop can use the native libraries. Nutch comes without 
them, but it is not hard to get them from a Hadoop pack or compile. They make a 
BIG difference. In my tests, a job that was running for two days without them, 
took only 5 hours to finish when they were available.

2. Apparently, Nutch is faster if the load (the numbers of URLs to fetch) is 
spread more or less evenly over iterations. This is not surprising, given that 
sorting is involved. As you may know, Arch (an extension of Nutch I am working 
on) allows dividing web sites into areas and processing the areas sequentially 
one by one. This is convenient when you want to get intranet crawling right and 
do not want to have to reindex everything after fixing a problem local to some 
area, or if you want to configure different refresh intervals for different 
parts of your site. The latest version of Arch (in testing, not released yet) 
also allows processing all areas in parallel, injecting all known links as 
seeds. It turned out that the sequential mode was faster, probably because the 
load was spread much more evenly in sequential processing. It was finishing the 
job on our sites in about 12 hours. When I switched the mode to parallel, it 
took 2 days. I added the native libraries then and the time dripped to 5 hours.

It is possible that, when the native libraries are available, the dependency on 
the load size is not so prominent. If it is still a problem, use the topN 
parameter to limit the number of URLs fetched per iteration and this spread the 
load over iterations.

Regards,

Arkadi

  

> -Original Message-
> From: Bai Shen [mailto:baishen.li...@gmail.com]
> Sent: Saturday, 17 December 2011 3:05 AM
> To: user@nutch.apache.org
> Subject: Re: Nutch Hadoop Optimization
> 
> The parse takes under two minutes.
> 
> One of the problems I'm running into is how to make nutch run more
> jobs,
> and how to run that many jobs on the machine without thrashing the hard
> drive.
> 
> On Fri, Dec 16, 2011 at 5:33 AM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
> 
> > It looks like its the parsing of these segments that is taking
> time... no?
> >
> > On Thu, Dec 15, 2011 at 9:57 PM, Bai Shen 
> wrote:
> > > On Thu, Dec 15, 2011 at 12:47 PM, Lewis John Mcgibbney <
> > > lewis.mcgibb...@gmail.com> wrote:
> > >
> > >> This is overwhelmingly weighted towards Hadoop configuration.
> > >>
> > >> There are some guidance notes on the Nutch wiki for performance
> issues
> > >> so you may wish to give them a try first.
> > >> --
> > >>  Lewis
> > >>
> > >
> > > I'm assuming you're referring to this page?
> > > http://wiki.apache.org/nutch/OptimizingCrawls
> > >
> > >
> > > On Thu, Dec 15, 2011 at 2:01 PM, Markus Jelsma
> > > wrote:
> > >
> > >> Well, if performance is low its likely not a Hadoop issue. Hadoop
> > tuning is
> > >> only required if you start pushing it to limits.
> > >>
> > >> I would indeed check the Nutch wiki. There are important settings
> such
> > as
> > >> threads, queues etc that are very important.
> > >>
> > >>
> > > I did end up tweaking some of the hadoop settings, as it looked
> like it
> > was
> > > thrashing the disk due to not spreading out the map tasks.
> > >
> > >
> > > On Thu, Dec 15, 2011 at 3:00 PM, Julien Nioche <
> > > lists.digitalpeb...@gmail.com> wrote:
> > >
> > >>
> > >> Having beefy machines is not going to be very useful for the
> fetching
> > step
> > >> which is IO bound and usually takes most of the time.
> > >> How big is your crawldb?  How long do the generate / parse and
> update
> > steps
> > >> take? Having more than one machine won't make a massive difference
> if
> > your
> > >> crawldb or segments are small.
> > >>
> > >> Julien
> > >>
> > >>
> > > The machines were all I had handy to make the cluster with.
> > >
> > >
> > > I'm looking at the time for a recent job and here's what I'm
> seeing.
> >  This
> > > is with 12k urls queued by domain with a max of 50 urls per domain.
> > > I know why the fetcher takes so long.  Most of the fetcher map jobs
> > finish
> > > in 3-4 minutes, but 1-2 always end up getting stuck on a single
> site and
> > > taking an additional ten minutes to work through the remaining
> urls.  Not
> > > sure how to fix that.
> > > The crawldb had around 1.2 million urls in it when I looked this
> > afternoon.
> > >
> > > nutch-1.4.job SUCCEEDED Thu Dec 15 16:14:30 EST 2011 Thu Dec 15
> 16:14:44
> > > EST 2011generate: select from crawl/crawldb SUCCEEDED Thu Dec 15
> 16:14:45
> > > EST 2011 Thu Dec 15 16:16:17 EST 2011generate: partition
> > > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:16:19 EST
> 2011 Thu
> > > Dec 15 16:16:42 EST 2011fetch crawl/segments/20111215161618
> SUCCEEDED Thu
> > > Dec 15 16:16:44 EST 2011 Thu Dec 15 16:33:29 EST 2011parse
> > > crawl/segments/20111215161618 SUCCEEDED Thu Dec 15 16:33:30 EST
> 2011 Thu
> > > Dec 15 16:35:11 EST 2011crawldb crawl/crawldb

RE: Fetching just some urls outside domain

2011-12-01 Thread Arkadi.Kosmynin
Hi Adriana,

You can try Arch for this:

http://www.atnf.csiro.au/computing/software/arch

You can configure it to crawl your web sites plus sets of miscellaneous URLs 
called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, only 
Arch based on Nutch 1.2 is available for downloading. We are about to release 
Arch based on Nutch 1.4.

Regards,

Arkadi



> -Original Message-
> From: Adriana Farina [mailto:adriana.farin...@gmail.com]
> Sent: Thursday, 1 December 2011 7:58 PM
> To: user@nutch.apache.org
> Subject: Re: Fetching just some urls outside domain
> 
> Hi!
> 
> Thank you for your answer. You're right, maybe an example would explain
> better what I need to do.
> 
> I have to perform the following task. I have to explore a specific
> domain (.
> gov.it) and I have an initial set of seeds, for example www.aaa.it,
> www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
> pages outside that domain. However some resources I need to download
> (documents) are stored on web sites that are not inside the domain I'm
> interested in.
> For example: www.aaa.it/subfolder/albi redirects to www.somesite.it
> (where
> www.somesite.it is not inside "my" domain). Nutch will not fetch that
> page
> since I told it to behave that way, but I need to download documents
> stored
> on www.somesite.it. So I need nutch to go outside the domain I
> specified
> only when it sees the words "albi" or "albo" inside the url, since that
> words identify the documents I need. How can I do this?
> 
> I hope I've been clear. :)
> 
> 
> 
> 2011/11/30 Lewis John Mcgibbney 
> 
> > Hi Adriana,
> >
> > This should be achievable through fine grained URL filters. It is
> kindof
> > hard to substantiate on this without you providing some examples of
> the
> > type of stuff you're trying to do!
> >
> > Lewis
> >
> > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> > adriana.farin...@gmail.com
> > > wrote:
> >
> > > Hello,
> > >
> > > I'm using nutch 1.3 from just a month, so I'm not an expert. I
> configured
> > > it so that it doesn't fetch pages outside a specific domain.
> However now
> > I
> > > need to let it fetch pages outside the domain I choosed but only
> for some
> > > urls (not for all the urls I have to crawl). How can I do this? I
> have to
> > > write a new plugin?
> > >
> > > Thanks.
> > >
> >
> >
> >
> > --
> > *Lewis*
> >


RE: Nutch and Sharepoint authentication

2011-11-29 Thread Arkadi.Kosmynin
Hi Lewis,

Thank you for the nice invitation. I don't consider myself an expert in the 
area, but I have added a small section on troubleshooting which hopefully will 
help people to pinpoint and fix their problems quicker. Please feel free to add 
more or correct my text.

Regards,

Arkadi 

> -Original Message-
> From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> Sent: Saturday, 26 November 2011 12:35 AM
> To: user@nutch.apache.org
> Subject: Re: Nutch and Sharepoint authentication
> 
> Yes thanks for the feedback Arkadi.
> 
> I know this is possibly outside the scope of your work, but it would be
> really great if you could add some of your experience to
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
> 
> This is an area which has been unclear for some users for sometime, if
> you
> are happy with your working implementation, your thoughts would be
> extremely appreciated from the rest of the community.
> 
> Thank you, and glad to hear that things are working.
> 
> On Fri, Nov 25, 2011 at 7:16 AM,  wrote:
> 
> > Hi Lewis,
> >
> > I am saying that my configuration works with our SharePoint server.
> The
> > authentication scheme is NTLM. Two versions of Nutch are working: a
> > snapshot of Nutch 1.4 in my development and Nutch 1.2 that is being
> used in
> > production.
> >
> > I have to admit that it took some tweaking to get authentication
> working.
> >
> > Regards,
> >
> > Arkadi
> >
> > > -Original Message-
> > > From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> > > Sent: Thursday, 24 November 2011 10:29 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch and Sharepoint authentication
> > >
> > > Hi Arkadi,
> > >
> > > Are you saying that this has been solved and that are successfully
> able
> > > to
> > > crawl the server?
> > >
> > > Thanks
> > >
> > > On Thu, Nov 24, 2011 at 12:48 AM,  wrote:
> > >
> > > > Hi,
> > > >
> > > > I am crawling a SharePoint server, no major problems. I do have
> to
> > > use
> > > > protocol-httpclient for this. Here is an extract from my
> > > > httpclient-auth.xml file, if it helps:
> > > >
> > > > 
> > > >  
> > > >
> > > >  
> > > > 
> > > >
> > > > Regards,
> > > >
> > > > Arkadi
> > > >
> > > > > -Original Message-
> > > > > From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> > > > > Sent: Tuesday, 22 November 2011 9:43 PM
> > > > > To: user@nutch.apache.org
> > > > > Subject: Re: Nutch and Sharepoint authentication
> > > > >
> > > > > Hi,
> > > > >
> > > > > From what I have read on the Nutch user@ archives [1] it is
> > > possible to
> > > > > crawl a MS Sharepoint server which includes setting up NTLM
> > > > > authentication
> > > > > for your crawler. It is becoming a pretty major problem now the
> the
> > > > > protocol-httpclient plugin is unstable, there are Jira issues
> open
> > > for
> > > > > this.
> > > > >
> > > > > Unfortunately as Manifold CF is in incubation status, it can
> only
> > > be
> > > > > expected that they might have not completed all documentation
> yet,
> > > > > however
> > > > > I advise you to try there as well, as them about the Sharepoint
> > > > > configuration/documentation if it is not possible for you to
> work
> > > with
> > > > > Nutch protocol-httpclient.
> > > > >
> > > > > hth
> > > > >
> > > > > [1]
> > > > > http://www.mail-
> > > > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org
> > > > >
> > > > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing
> > > 
> > > > > wrote:
> > > > >
> > > > > > Hello guys,
> > > > > >
> > > > > > I read the wiki on
> > > > > > "HttpAuthenticationSchemes<
> > > > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>".
> > > > > > I previously managed to make Nutch crawl local folders and
> > > websites
> > > > > (with
> > > > > > SSL authentication). However, I'm trying to crawl some sites
> in a
> > > > > corporate
> > > > > > intranet environment running under MS Sharepoint. I was
> > > unsucceful so
> > > > > far
> > > > > > and I believe it's because of authentication.
> > > > > >
> > > > > >
> > > > > >   - Is Nutch able to crawl Sharepoint? If yes, do you have a
> > > > > link/mail
> > > > > >   tutorial on this?
> > > > > >
> > > > > >
> > > > > > I was recently aware of the ManifoldCF initiative and it
> seems to
> > > be
> > > > > an
> > > > > > eventual solution to my problem. But it's currently poorly
> > > documented
> > > > > (as
> > > > > > far as Sharepoint connector is concerned).
> > > > > >
> > > > > >   - Do you have any recommendation on this regards?
> > > > > >
> > > > > >
> > > > > > Thanks in advance for your help, I'll really appreciate it!
> > > > > >
> > > > > > --
> > > > > > Remi Tassing
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> >
> 
> 
> 
> --
> *Lewis*


RE: Nutch and Sharepoint authentication

2011-11-24 Thread Arkadi.Kosmynin
Hi Lewis,

I am saying that my configuration works with our SharePoint server. The 
authentication scheme is NTLM. Two versions of Nutch are working: a snapshot of 
Nutch 1.4 in my development and Nutch 1.2 that is being used in production.

I have to admit that it took some tweaking to get authentication working. 

Regards,

Arkadi

> -Original Message-
> From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> Sent: Thursday, 24 November 2011 10:29 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch and Sharepoint authentication
> 
> Hi Arkadi,
> 
> Are you saying that this has been solved and that are successfully able
> to
> crawl the server?
> 
> Thanks
> 
> On Thu, Nov 24, 2011 at 12:48 AM,  wrote:
> 
> > Hi,
> >
> > I am crawling a SharePoint server, no major problems. I do have to
> use
> > protocol-httpclient for this. Here is an extract from my
> > httpclient-auth.xml file, if it helps:
> >
> > 
> >  
> >
> >  
> > 
> >
> > Regards,
> >
> > Arkadi
> >
> > > -Original Message-
> > > From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> > > Sent: Tuesday, 22 November 2011 9:43 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Nutch and Sharepoint authentication
> > >
> > > Hi,
> > >
> > > From what I have read on the Nutch user@ archives [1] it is
> possible to
> > > crawl a MS Sharepoint server which includes setting up NTLM
> > > authentication
> > > for your crawler. It is becoming a pretty major problem now the the
> > > protocol-httpclient plugin is unstable, there are Jira issues open
> for
> > > this.
> > >
> > > Unfortunately as Manifold CF is in incubation status, it can only
> be
> > > expected that they might have not completed all documentation yet,
> > > however
> > > I advise you to try there as well, as them about the Sharepoint
> > > configuration/documentation if it is not possible for you to work
> with
> > > Nutch protocol-httpclient.
> > >
> > > hth
> > >
> > > [1]
> > > http://www.mail-
> > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org
> > >
> > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing
> 
> > > wrote:
> > >
> > > > Hello guys,
> > > >
> > > > I read the wiki on
> > > > "HttpAuthenticationSchemes<
> > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>".
> > > > I previously managed to make Nutch crawl local folders and
> websites
> > > (with
> > > > SSL authentication). However, I'm trying to crawl some sites in a
> > > corporate
> > > > intranet environment running under MS Sharepoint. I was
> unsucceful so
> > > far
> > > > and I believe it's because of authentication.
> > > >
> > > >
> > > >   - Is Nutch able to crawl Sharepoint? If yes, do you have a
> > > link/mail
> > > >   tutorial on this?
> > > >
> > > >
> > > > I was recently aware of the ManifoldCF initiative and it seems to
> be
> > > an
> > > > eventual solution to my problem. But it's currently poorly
> documented
> > > (as
> > > > far as Sharepoint connector is concerned).
> > > >
> > > >   - Do you have any recommendation on this regards?
> > > >
> > > >
> > > > Thanks in advance for your help, I'll really appreciate it!
> > > >
> > > > --
> > > > Remi Tassing
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> >
> 
> 
> 
> --
> *Lewis*


RE: Nutch and Sharepoint authentication

2011-11-23 Thread Arkadi.Kosmynin
Hi,

I am crawling a SharePoint server, no major problems. I do have to use 
protocol-httpclient for this. Here is an extract from my httpclient-auth.xml 
file, if it helps:


  

  


Regards,

Arkadi

> -Original Message-
> From: Lewis John Mcgibbney [mailto:lewis.mcgibb...@gmail.com]
> Sent: Tuesday, 22 November 2011 9:43 PM
> To: user@nutch.apache.org
> Subject: Re: Nutch and Sharepoint authentication
> 
> Hi,
> 
> From what I have read on the Nutch user@ archives [1] it is possible to
> crawl a MS Sharepoint server which includes setting up NTLM
> authentication
> for your crawler. It is becoming a pretty major problem now the the
> protocol-httpclient plugin is unstable, there are Jira issues open for
> this.
> 
> Unfortunately as Manifold CF is in incubation status, it can only be
> expected that they might have not completed all documentation yet,
> however
> I advise you to try there as well, as them about the Sharepoint
> configuration/documentation if it is not possible for you to work with
> Nutch protocol-httpclient.
> 
> hth
> 
> [1]
> http://www.mail-
> archive.com/search?q=sharepoint&l=user%40nutch.apache.org
> 
> On Tue, Nov 22, 2011 at 5:27 AM, remi tassing 
> wrote:
> 
> > Hello guys,
> >
> > I read the wiki on
> > "HttpAuthenticationSchemes<
> > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>".
> > I previously managed to make Nutch crawl local folders and websites
> (with
> > SSL authentication). However, I'm trying to crawl some sites in a
> corporate
> > intranet environment running under MS Sharepoint. I was unsucceful so
> far
> > and I believe it's because of authentication.
> >
> >
> >   - Is Nutch able to crawl Sharepoint? If yes, do you have a
> link/mail
> >   tutorial on this?
> >
> >
> > I was recently aware of the ManifoldCF initiative and it seems to be
> an
> > eventual solution to my problem. But it's currently poorly documented
> (as
> > far as Sharepoint connector is concerned).
> >
> >   - Do you have any recommendation on this regards?
> >
> >
> > Thanks in advance for your help, I'll really appreciate it!
> >
> > --
> > Remi Tassing
> >
> 
> 
> 
> --
> *Lewis*


RE: A bug has been fixed in protocol-httpclient

2011-11-08 Thread Arkadi.Kosmynin
Before opening a ticket, searched the JIRA and found that this is the same 
problem as reported in NUTCH-1089, NUTCH-990 and NUTCH-1112. I just found it 
via different symptoms. NUTCH-1089 offers a patch.  

> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Tuesday, 8 November 2011 7:19 PM
> To: user@nutch.apache.org
> Subject: Re: A bug has been fixed in protocol-httpclient
> 
> Hi,
> 
> Can you open a Jira ticket and attach a patch file so we can track it?
> 
> Thanks
> 
> > Hi guys,
> >
> > I know that protocol-httpclient is not recommended to use because of
> known
> > problems, but I don't have much choice because I need authentication
> > support, as a few other people do as well, I am sure.
> >
> > I've reported a problem with too aggressive de-duplication recently.
> On the
> > example that I had, I traced that problem to an empty content field.
> > Digging further, I found this in httpclient/HttpResponse.java (lines
> > 126-130):
> >
> > while ((bufferFilled = in.read(buffer, 0, buffer.length)) !=
> -1
> > && totalRead + bufferFilled < contentLength) {
> >   totalRead += bufferFilled;
> >   out.write(buffer, 0, bufferFilled);
> > }
> >
> > This should be changed to
> >
> > while ( ( bufferFilled = in.read( buffer, 0, buffer.length )
> ) !=
> > -1 ) {
> >   int toWrite = totalRead + bufferFilled < contentLength ?
> > totalRead +
> bufferFilled :
> > contentLength - totalRead ; totalRead += bufferFilled;
> >   out.write( buffer, 0, toWrite ) ;
> >   if ( totalRead >= contentLength ) break ;
> > }
> >
> > Else the last read portion quite often is not stored. Obviously, this
> is
> > causing problems, especially in small documents where the last read
> > portion is the only one, and in PDF documents, as well as other
> document
> > types that are sensitive to truncation.
> >
> > This problem explains a large part of false de-duplication cases, as
> well
> > as parsing errors with truncated content symptoms, but it does not
> seem to
> > explain all of them.
> >
> > Regards,
> >
> > Arkadi


A bug has been fixed in protocol-httpclient

2011-11-07 Thread Arkadi.Kosmynin
Hi guys,

I know that protocol-httpclient is not recommended to use because of known 
problems, but I don't have much choice because I need authentication support, 
as a few other people do as well, I am sure.

I've reported a problem with too aggressive de-duplication recently. On the 
example that I had, I traced that problem to an empty content field. Digging 
further, I found this in httpclient/HttpResponse.java (lines 126-130):

while ((bufferFilled = in.read(buffer, 0, buffer.length)) != -1
&& totalRead + bufferFilled < contentLength) {
  totalRead += bufferFilled;
  out.write(buffer, 0, bufferFilled);
}

This should be changed to

while ( ( bufferFilled = in.read( buffer, 0, buffer.length ) ) != -1 )
{
  int toWrite = totalRead + bufferFilled < contentLength ?
totalRead + bufferFilled : 
contentLength - totalRead ;
  totalRead += bufferFilled;
  out.write( buffer, 0, toWrite ) ;
  if ( totalRead >= contentLength ) break ;
}

Else the last read portion quite often is not stored. Obviously, this is 
causing problems, especially in small documents where the last read portion is 
the only one, and in PDF documents, as well as other document types that are 
sensitive to truncation.

This problem explains a large part of false de-duplication cases, as well as 
parsing errors with truncated content symptoms, but it does not seem to explain 
all of them.

Regards,

Arkadi



RE: De-duplication seems to work too aggressively

2011-11-02 Thread Arkadi.Kosmynin
Hi Markus,

The default one, which, I believe, is MD5. I did not change anything in this 
part.

Regards,

Arkadi

> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Wednesday, 2 November 2011 6:34 PM
> To: user@nutch.apache.org
> Subject: Re: De-duplication seems to work too aggressively
> 
> Hi
> 
> You didn't mention the signature algorithm you're using.
> 
> Thanks
> 
> > Hi,
> >
> > I stopped using de-duplication in Nutch 0.9-1.2 versions because too
> many
> > URLs were being removed for no apparent reason. I did not report the
> > problem to the list though. I am working with version 1.4 now, tried
> > de-duplication again, and the problem appears to be still there.
> There are
> > significant numbers of URLs being removed when de-duplication is
> applied.
> > I could blame it on duplicated content, but it is hard to believe
> that so
> > much is duplicated. One small site is represented by 1639 URLs in the
> > index, and this number goes down to 1068 after de-duplication is
> done. OK,
> > theoretically, this can happen, but, here is another example. Another
> site
> > has just one (root) page in the index. This entry gets removed by
> > de-duplication. How can this happen? There can be a collision in
> digests,
> > but this is hard to believe, especially given other suspicious
> phenomena.
> >
> > I am not going to use de-duplication anyway, because duplicated
> entries may
> > exist in Arch index for a valid reason (e.g. different owners).
> However,
> > it seems that I have a good case that could help to pinpoint the
> problem,
> > if it indeed exists. If anyone would want to do it, I am happy to
> help.
> >
> > Regards,
> >
> > Arkadi


De-duplication seems to work too aggressively

2011-11-01 Thread Arkadi.Kosmynin
Hi,

I stopped using de-duplication in Nutch 0.9-1.2 versions because too many URLs 
were being removed for no apparent reason. I did not report the problem to the 
list though. I am working with version 1.4 now, tried de-duplication again, and 
the problem appears to be still there. There are significant numbers of URLs 
being removed when de-duplication is applied. I could blame it on duplicated 
content, but it is hard to believe that so much is duplicated. One small site 
is represented by 1639 URLs in the index, and this number goes down to 1068 
after de-duplication is done. OK, theoretically, this can happen, but, here is 
another example. Another site has just one (root) page in the index. This entry 
gets removed by de-duplication. How can this happen? There can be a collision 
in digests, but this is hard to believe, especially given other suspicious 
phenomena.

I am not going to use de-duplication anyway, because duplicated entries may 
exist in Arch index for a valid reason (e.g. different owners).  However, it 
seems that I have a good case that could help to pinpoint the problem, if it 
indeed exists. If anyone would want to do it, I am happy to help.

Regards,

Arkadi



RE: OutOfMemoryError when indexing into Solr

2011-10-30 Thread Arkadi.Kosmynin
Confirming that this worked. Also, times look interesting: to send 73K 
documents in 1000 doc batches (default) took 16 minutes; to send 73K documents 
in 100 doc batches took 15 minutes 24 seconds.

Regards,

Arkadi

> -Original Message-
> From: arkadi.kosmy...@csiro.au [mailto:arkadi.kosmy...@csiro.au]
> Sent: Friday, 28 October 2011 12:11 PM
> To: user@nutch.apache.org; markus.jel...@openindex.io
> Subject: [ExternalEmail] RE: OutOfMemoryError when indexing into Solr
> 
> Hi Markus,
> 
> > -Original Message-
> > From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> > Sent: Thursday, 27 October 2011 11:33 PM
> > To: user@nutch.apache.org
> > Subject: Re: OutOfMemoryError when indexing into Solr
> >
> > Interesting, how many records and how large are your records?
> 
> There a bit more than 80,000 documents.
> 
> 
>   http.content.limit 15000
> 
> 
> 
>indexer.max.tokens10
> 
> 
> > How did you increase JVM heap size?
> 
> opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -
> XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -
> XX:+CMSClassUnloadingEnabled"
> 
> > Do you have custom indexing filters?
> 
> Yes. They add a few fields to each document. These fields are small,
> within a hundred of bytes per document.
> 
> > Can you decrease the commit.size?
> 
> Yes. Thank you. Good idea. I did not even consider it because, for
> whatever reason, this option was not in my nutch-default.xml. I've put
> it to 100. I hope that Solr commit is not done after sending each
> bunch. Else this would have a very negative impact on performance
> because Solr commits are very expensive.
> 
> 
> > Do you also index large amounts of anchors (without deduplication)
> and pass in a very large linkdb?
> 
> I do index anchors, but don't think that there is anything
> extraordinary about them. As I only index less than 100K pages, my
> linkdb should not be nearly as large as in cases when people index
> millions of documents.
> 
> > The reducer of IndexerMapReduce is a notorious RAM consumer.
> 
> If reducing solr.commit.size helps, it would make sense to decrease the
> default value. Sending small bunches of documents to Solr without
> commits is not that expensive to risk having memory problems.
> 
> Thanks again.
> 
> Regards,
> 
> Arkadi
> 
> 
> >
> > On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
> > > Hi,
> > >
> > > I am working with a Nutch 1.4 snapshot and having a very strange
> > problem
> > > that makes the system run out of memory when indexing into Solr.
> This
> > does
> > > not look like a trivial lack of memory problem that can be solved
> by
> > > giving more memory to the JVM. I've increased the max memory size
> > from 2Gb
> > > to 3Gb, then to 6Gb, but this did not make any difference.
> > >
> > > A log extract is included below.
> > >
> > > Would anyone have any idea of how to fix this problem?
> > >
> > > Thanks,
> > >
> > > Arkadi
> > >
> > >
> > > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000
> documents
> > > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000
> documents
> > > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner -
> job_local_0254
> > > java.lang.OutOfMemoryError: Java heap space
> > >at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > >at java.lang.String.(String.java:215)
> > >at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> > >at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> > >at org.apache.hadoop.io.Text.decode(Text.java:350)
> > >at org.apache.hadoop.io.Text.decode(Text.java:322)
> > >at org.apache.hadoop.io.Text.readString(Text.java:403)
> > >at
> > org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> > >at
> > >
> >
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
> > tab
> > > leConfigurable.java:54) at
> > >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > zer
> > > .deserialize(WritableSerialization.java:67) at
> > >
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> > zer
> > > .deserialize(WritableSerialization.java:40) at
> > >
> >
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
> > 1)
> > > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> > at
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
> > uce
> > > Task.java:241) at
> > >
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
> > k.j
> > > ava:237) at
> > >
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > 81)
> > > at
> > >
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> > 50)
> > > at
> >
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> > >
> >
> org.apache.hado

RE: OutOfMemoryError when indexing into Solr

2011-10-27 Thread Arkadi.Kosmynin
Hi Markus,

> -Original Message-
> From: Markus Jelsma [mailto:markus.jel...@openindex.io]
> Sent: Thursday, 27 October 2011 11:33 PM
> To: user@nutch.apache.org
> Subject: Re: OutOfMemoryError when indexing into Solr
> 
> Interesting, how many records and how large are your records?

There a bit more than 80,000 documents.


  http.content.limit 15000



   indexer.max.tokens10 


> How did you increase JVM heap size?

opts="-XX:+UseConcMarkSweepGC -Xms500m -Xmx6000m -XX:MinHeapFreeRatio=10 
-XX:MaxHeapFreeRatio=30 -XX:MaxPermSize=512m -XX:+CMSClassUnloadingEnabled"

> Do you have custom indexing filters?

Yes. They add a few fields to each document. These fields are small, within a 
hundred of bytes per document.

> Can you decrease the commit.size?

Yes. Thank you. Good idea. I did not even consider it because, for whatever 
reason, this option was not in my nutch-default.xml. I've put it to 100. I hope 
that Solr commit is not done after sending each bunch. Else this would have a 
very negative impact on performance because Solr commits are very expensive.  
 

> Do you also index large amounts of anchors (without deduplication) and pass 
> in a very large linkdb?

I do index anchors, but don't think that there is anything extraordinary about 
them. As I only index less than 100K pages, my linkdb should not be nearly as 
large as in cases when people index millions of documents.
 
> The reducer of IndexerMapReduce is a notorious RAM consumer.

If reducing solr.commit.size helps, it would make sense to decrease the default 
value. Sending small bunches of documents to Solr without commits is not that 
expensive to risk having memory problems.

Thanks again.

Regards,

Arkadi


> 
> On Thursday 27 October 2011 05:54:54 arkadi.kosmy...@csiro.au wrote:
> > Hi,
> >
> > I am working with a Nutch 1.4 snapshot and having a very strange
> problem
> > that makes the system run out of memory when indexing into Solr. This
> does
> > not look like a trivial lack of memory problem that can be solved by
> > giving more memory to the JVM. I've increased the max memory size
> from 2Gb
> > to 3Gb, then to 6Gb, but this did not make any difference.
> >
> > A log extract is included below.
> >
> > Would anyone have any idea of how to fix this problem?
> >
> > Thanks,
> >
> > Arkadi
> >
> >
> > 2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
> > 2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
> > 2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
> > java.lang.OutOfMemoryError: Java heap space
> >at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >at java.lang.String.(String.java:215)
> >at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
> >at java.nio.CharBuffer.toString(CharBuffer.java:1157)
> >at org.apache.hadoop.io.Text.decode(Text.java:350)
> >at org.apache.hadoop.io.Text.decode(Text.java:322)
> >at org.apache.hadoop.io.Text.readString(Text.java:403)
> >at
> org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
> >at
> >
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWri
> tab
> > leConfigurable.java:54) at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> zer
> > .deserialize(WritableSerialization.java:67) at
> >
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeseriali
> zer
> > .deserialize(WritableSerialization.java:40) at
> >
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:99
> 1)
> > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
> at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(Red
> uce
> > Task.java:241) at
> >
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTas
> k.j
> > ava:237) at
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> 81)
> > at
> >
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:
> 50)
> > at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216
> )
> > 2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException:
> Job
> > failed!
> 
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350


OutOfMemoryError when indexing into Solr

2011-10-26 Thread Arkadi.Kosmynin
Hi,

I am working with a Nutch 1.4 snapshot and having a very strange problem that 
makes the system run out of memory when indexing into Solr. This does not look 
like a trivial lack of memory problem that can be solved by giving more memory 
to the JVM. I've increased the max memory size from 2Gb to 3Gb, then to 6Gb, 
but this did not make any difference.

A log extract is included below.

Would anyone have any idea of how to fix this problem?

Thanks,

Arkadi


2011-10-27 07:08:22,162 INFO  solr.SolrWriter - Adding 1000 documents
2011-10-27 07:08:42,248 INFO  solr.SolrWriter - Adding 1000 documents
2011-10-27 07:13:54,110 WARN  mapred.LocalJobRunner - job_local_0254
java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOfRange(Arrays.java:3209)
   at java.lang.String.(String.java:215)
   at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
   at java.nio.CharBuffer.toString(CharBuffer.java:1157)
   at org.apache.hadoop.io.Text.decode(Text.java:350)
   at org.apache.hadoop.io.Text.decode(Text.java:322)
   at org.apache.hadoop.io.Text.readString(Text.java:403)
   at org.apache.nutch.parse.ParseText.readFields(ParseText.java:50)
   at 
org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
   at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
   at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
   at 
org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:991)
   at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:931)
   at 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:241)
   at 
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:237)
   at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:81)
   at 
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:50)
   at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2011-10-27 07:13:54,382 ERROR solr.SolrIndexer - java.io.IOException: Job 
failed!



RE: example of searching Nutch with Lucene

2011-08-17 Thread Arkadi.Kosmynin
Nutch 1.3 indexes into Solr. Solr supports Lucene syntax, with a few minor 
differences:

http://wiki.apache.org/solr/SolrQuerySyntax

If, for any reason, you would rather use an earlier version of Nutch, you can 
have a look how Lucene syntax support is implemented in Arch, which is an 
extension of Nutch:

http://www.atnf.csiro.au/computing/software/arch/

Regards,

Arkadi

> -Original Message-
> From: acse [mailto:a20047...@yahoo.com]
> Sent: Wednesday, 17 August 2011 7:25 PM
> To: nutch-u...@lucene.apache.org
> Subject: Re: example of searching Nutch with Lucene
> 
> Could anybody find a solution to this problem ? Is there a way to add
> Lucene
> search capabilities to Nutch or to search indexes ,indexed by Nutch,
> with
> Lucene?
> For example, I want to create a PrefixSearch or Wildcard Search with
> Nutch.
> Please help me if anybody knows anything about this topic.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/example-of-searching-Nutch-with-
> Lucene-tp617702p3261086.html
> Sent from the Nutch - User mailing list archive at Nabble.com.


RE: How to crawl fast a large site

2011-03-06 Thread Arkadi.Kosmynin
Hello Marseld,

I think you should have a look at Arch:

http://www.atnf.csiro.au/computing/software/arch/

Arch is a free, open source extension of Nutch. Among other added features, it 
supports partial recrawls.

Regards,

Arkadi


>-Original Message-
>From: Marseld Dedgjonaj [mailto:marseld.dedgjo...@ikubinfo.com]
>Sent: Saturday, March 05, 2011 3:21 AM
>To: user@nutch.apache.org
>Subject: How to crawl fast a large site
>
>Hello everybody,
>
>I am trying to use nutch for searching within my site.
>
>I have configured an instance of nutch and start it to crawl the whole
>website. Now that all urls of my site are crawled (about 150'000 urls)
>and I
>need to crawl only the newest urls(about 10-20 per hour), a crawl
>process
>with depth = 1 and topN = 50 takes more than 15 hours.
>
>The most consuming time steps are merging segments and indexing.
>
>I need to have the newest urls searchable in my website as soon as
>possible.
>
>
>
>I was trying to configure an other instance  of nutch just to take the
>latest articles.
>
>In this instance I injected 40 urls that changes very often and the new
>articles added to the site will appear in one of this links.(links are:
>Homepage, latest news, etc)
>
>I set "db.fetch.interval.default" to 3600 (1 hr) to crawl my injected
>pages
>and fetch all newest urls .
>
>And clear crawldb,  segments and indexes of this site every 24 hr
>because
>now this urls should been crawled by the main instance.
>
>When a user search will search in both of instances and merge the
>results.
>
>
>
>My problem is:
>
>I need that my second instance to fetch only the injected urls and urls
>founded to the injected pages, but if I run the crawl continually to
>crawl
>fast newest urls, the crawl process crawls every url founded.
>
>
>
>Please any suggestion to make possible that when update crawlDB, to put
>inside only the urls that agree with my requests.
>
>
>
>Any other suggestion will be very valuable to me.
>
>
>
>Thanks in advance and
>
>Best regards,
>
>Marseldi
>
>
>
>
>
>Gjeni
>Punë të Mirë dhe të Mirë për
>Punë... Vizitoni: href="http://www.punaime.al/";>www.punaime.al
>http://www.punaime.al/";>src="http://www.ikub.al/images/punaime.al_small.png"; />


RE: Indexing question - Setting low boost

2011-02-07 Thread Arkadi.Kosmynin
You can exclude documents by returning NULL from an index filter.

Regards,

Arkadi

>-Original Message-
>From: .: Abhishek :. [mailto:ab1s...@gmail.com]
>Sent: Tuesday, February 08, 2011 11:44 AM
>To: markus.jel...@openindex.io; user@nutch.apache.org
>Subject: Re: Indexing question - Setting low boost
>
>Hi folks,
>
> Some help would be appreciated. Thanks a bunch..
>
>Cheers,
>Abi
>
>
>On Mon, Feb 7, 2011 at 10:46 AM, .: Abhishek :. 
>wrote:
>
>> Hi,
>>
>>  Thanks again for your time and patience.
>>
>>  The boost makes sense now. I am kind of not sure how to exclude the
>entire
>> document because there are only two methods,
>>
>>- public NutchDocument filter(NutchDocument doc, Parse parse, Text
>url,
>>CrawlDatum datum, Inlinks inlinks)
>>throws IndexingException
>>- public void addIndexBackendOptions(Configuration conf)
>>
>>
>>  May be should I add nothing in the document and/or return a null??
>>
>> ./Abi
>>
>>
>> On Mon, Feb 7, 2011 at 10:07 AM, Markus Jelsma
>> > wrote:
>>
>>> Hi,
>>>
>>> A high boost depends on the index and query time boosts on other
>fields.
>>> If the
>>> highest boost on a field is N, then N*100 will certainly do the
>trick.
>>>
>>> I haven't studied the LuceneWriter but storing and indexing
>parameters are
>>> very familiar. Storing a field means it can be retrieved along with
>the
>>> document if it's queried. Having it indexed just means it can be
>queried.
>>> But
>>> this is about fields, not on the entire document itself.
>>>
>>> In an indexing filter you want to exclude the entire document.
>>>
>>> Cheers,
>>>
>>> > Hi Markus,
>>> >
>>> >  Thanks for the quick reply.
>>> >
>>> >  Could you tell me a possible a value for the high boost such that
>its
>>> to
>>> > be negated? or Is there a way I can calculate or find that out.
>>> >
>>> >  Also, for the other approach on using indexing filter does the
>("...",
>>> > LuceneWriter.STORE.YES, LuceneWriter.INDEX.NO, conf); does the
>work?
>>> >
>>> > Thanks,
>>> > Abi
>>> >
>>> > On Mon, Feb 7, 2011 at 9:34 AM, Markus Jelsma
>>> wrote:
>>> > > Hi,
>>> > >
>>> > > A negative boost does not exist and a very low boost is still a
>boost.
>>> In
>>> > > queries, you can work around the problem by giving a very high
>boost
>>> do
>>> > > documents that do not match; the negation parameter with a high
>boost
>>> > > will do
>>> > > the trick.
>>> > >
>>> > > If you don't want to index certain documents then you'll need an
>>> indexing
>>> > > filter. That's a different approach.
>>> > >
>>> > > Cheers,
>>> > >
>>> > > > Hi all,
>>> > > >
>>> > > >  I was looking at the following example,
>>> > > >
>>> > > >  http://wiki.apache.org/nutch/WritingPluginExample
>>> > > >
>>> > > >  In the example, the author sets a boost of 5.0f for the
>recommended
>>> > > >  tag.
>>> > > >
>>> > > >  In this same way, can I also set a boost value such that a tag
>or
>>> > >
>>> > > content
>>> > >
>>> > > > is never indexed at all? If so, what would be the boost value?
>On a
>>> > >
>>> > > related
>>> > >
>>> > > > note, what are the default content that are usually(by default)
>>> indexed
>>> > >
>>> > > by
>>> > >
>>> > > > Lucene?
>>> > > >
>>> > > >  Thanks a bunch for all your time and patience. Have a good
>day.
>>> > > >
>>> > > > Cheers,
>>> > > > Abi
>>>
>>
>>


RE: Restarting Tomcat after a crawl.

2011-02-01 Thread Arkadi.Kosmynin
Hi Jonathan,

>-Original Message-
>From: Alexander Aristov [mailto:alexander.aris...@gmail.com]
>Sent: Friday, January 28, 2011 7:37 AM
>To: user@nutch.apache.org; jonathan_ou...@mcafee.com
>Subject: Re: Restarting Tomcat after a crawl.
>
>Hi
>
>Which Nutch version are you using? If you use old Nutch which might work
>stand-along then yes, you need to restart Tomcat in order to initialize
>new
>index otherwise some functionality will not work.
>
>But if you are using a current version and use solr as indexer then you
>don't need to restart Tomcat.
>
>For old Nutch I wrote code (in NutchBean) which reinitialized index and
>eliminated necessity of restarting but it was a year ago.

Working code based on Alexander's idea can be found here:

http://www.atnf.csiro.au/computing/software/arch/

See arch.jsp. If using OpenSearch interface, OpenSearchServlet should also be 
modified. See Arch OpenSearchServlet for example.

>
>Best Regards
>Alexander Aristov
>
>
>On 27 January 2011 21:29, Jonathan Oulds 
>wrote:
>
>> Hello there,
>>
>> This is my first foray into managing a search engine so please bear
>with
>> me.  I am trying to index all our in house documentation that we
>generate
>> during development.  I am trying Nutch on an Ubuntu server using
>Tomcat to
>> serve the results.
>>
>> In the process of building our indexes I have noticed that I need to
>> restart Tomcat after a crawl for it to properly respond to search
>queries.
>>  Is this the expected behaviour?
>>
>> Jonathan.
>>
>>


RE: Crawling PDF documents

2011-01-11 Thread Arkadi.Kosmynin
Hi Julien,

>-Original Message-
>From: Julien Nioche [mailto:lists.digitalpeb...@gmail.com]
>Sent: Tuesday, January 11, 2011 8:38 PM
>To: user@nutch.apache.org
>Subject: Re: Crawling PDF documents
>
>Hi Arkadi,
>
>The latest release of Tika (0.8) has indeed some issues with pdf but the
>version in 1.2 (0.7) should be fine. Did you have any specific issues?

Sorry, it was a few months ago. I can't tell now. Generally, I found that Tika 
was at that stage not as mature as some of the pre-Tika plugins, and used 
whatever worked best for a document type.

>Note that the parse-pdf plugin has been removed from the next release of
>Nutch (1.3)

Let's hope that Tika is more mature now. If not, I will have to stick with 1.2 
for a while. We have a lot of PDF docs on our site and I want them indexed.

Regards,

Arkadi

 

>
>J.
>
>On 11 January 2011 03:12,  wrote:
>
>> Hi,
>>
>> >-Original Message-
>> >From: nutch_guy [mailto:adrian.stadelm...@bluewin.ch]
>> >Sent: Tuesday, January 11, 2011 12:17 AM
>> >To: nutch-u...@lucene.apache.org
>> >Subject: RE: Crawling PDF documents
>> >
>> >
>> >
>> >Hi
>> >
>> >Thanks for your answer.
>> >My Problem is stil existing, i can crawl pdf documents
>> >but, there are a lot of pdf documents wich are
>> >not supported.
>>
>> I also had this problem. The parse-pdf plugin uses old pdf libraries.
>Many
>> of the problems will go away if you upgrade it to the new libraries
>and use
>> it (not Tika!) to parse pdf. You can do it yourself or get upgraded
>sources
>> from here:
>>
>> http://www.atnf.csiro.au/computing/software/arch/
>>
>> Regards,
>>
>> Arkadi
>>
>> >
>> >Thank for help
>> >
>> >nutch_guy
>> >
>> >--
>> >View this message in context:
>> >http://lucene.472066.n3.nabble.com/Crawling-PDF-documents-
>> >tp1173626p2226962.html
>> >Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>
>
>--
>*
>*Open Source Solutions for Text Engineering
>
>http://digitalpebble.blogspot.com/
>http://www.digitalpebble.com


RE: Crawling PDF documents

2011-01-10 Thread Arkadi.Kosmynin
Hi,

>-Original Message-
>From: nutch_guy [mailto:adrian.stadelm...@bluewin.ch]
>Sent: Tuesday, January 11, 2011 12:17 AM
>To: nutch-u...@lucene.apache.org
>Subject: RE: Crawling PDF documents
>
>
>
>Hi
>
>Thanks for your answer.
>My Problem is stil existing, i can crawl pdf documents
>but, there are a lot of pdf documents wich are
>not supported.

I also had this problem. The parse-pdf plugin uses old pdf libraries. Many of 
the problems will go away if you upgrade it to the new libraries and use it 
(not Tika!) to parse pdf. You can do it yourself or get upgraded sources from 
here:

http://www.atnf.csiro.au/computing/software/arch/

Regards,

Arkadi

>
>Thank for help
>
>nutch_guy
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Crawling-PDF-documents-
>tp1173626p2226962.html
>Sent from the Nutch - User mailing list archive at Nabble.com.


RE: Problem with tika and MS-Office file - Nucth 1.2

2010-11-18 Thread Arkadi.Kosmynin
Hi German,

I was having same problem (and others). For an immediate effect, try switching 
the parser to use, e.g.:






In my experience, this parser is a lot more tolerant. For a long-term solution, 
these problems should be reported to Tika, but I doubt that they don't know it 
yet.

Regards,

Arkadi


>-Original Message-
>From: Germán Biozzoli [mailto:germanbiozz...@gmail.com]
>Sent: Friday, November 19, 2010 1:53 PM
>To: user@nutch.apache.org
>Subject: Problem with tika and MS-Office file - Nucth 1.2
>
>Sorry to repeat this mail, no one using 1.2 nutch has problems
>indexing MS-Office documents? Should I send this problem to tika list?
>
>Regards and thanks
>German
>
>
>-- Forwarded message --
>From: Germán Biozzoli 
>Date: Wed, Nov 17, 2010 at 3:16 PM
>Subject: problem with tika and MS-Office file - Nucth 1.2
>To: user@nutch.apache.org
>
>
>Hi everybody
>
>I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse
>OK html and pdf files, but when it tries to parse doc files, the
>following message appears:
>
>Unable to successfully
>parse content http://xxx of type
>application/x-tika-msoffice
>
>I've tried to follow what is shown here:
>
>http://www.mail-archive.com/user@nutch.apache.org/msg01073.html
>
>But really cannot find a solution. Only if I test the same command,
>nutch returns:
>
>
>r...@tango06:/home/apache-nutch-1.2# bin/nutch
>org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc
>Exception in thread "main" org.apache.nutch.parse.ParseException:
>parser not found for contentType=application/x-tika-msoffice
>url=http://ridder.uio.no/wtest2.doc
>       at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
>       at
>org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97)
>
>I have at nutch-default.xml the plugin folder in
>
>
> plugin.folders
> /home/apache-nutch-1.2/build/plugins
> Directories where nutch plugins are located.  Each
> element may be a relative or absolute path.  If absolute, it is used
> as is.  If relative, it is searched for on the classpath.
>
>
>The path is ok
>
>and the tika-mimetypes.xml
>
> 
>   
>   Microsoft Word Document
>   
>     offset="2080"/>
>     offset="2080"/>
>     
>     
>     
>     
>     
>     
>     offset="0"/>
>     
>     offset="512"/>
>   
>   
>   
>   
> 
>
>I can't imagine what I'm doing wrong. Somebody could help me?
>Regards and thanks
>German


RE: Excluding javascript files from indexing and search results.

2010-09-29 Thread Arkadi.Kosmynin
Hi Mark,

I am not sure, maybe there is a simpler way, but if you want to something to be 
fetched and processed but not indexed, you can write an index filter plugin and 
return null for documents that you don't want in the index. This is relatively 
easy to do, just use the index-basic filter as an example.

Regards,

Arkadi

>-Original Message-
>From: Mark Stephenson [mailto:mstep...@us.ibm.com]
>Sent: Thursday, September 30, 2010 9:29 AM
>To: user@nutch.apache.org
>Subject: Excluding javascript files from indexing and search results.
>
>Hi,
>
>I'm wondering if there's a way to prevent nutch from indexing
>javascript files.  I still would like to fetch and parse javascript
>files to find valuable outlinks, but I don't want them to show up in
>my search results.  Is there a good way to do this?
>
>Thanks a lot,
>Mark


Arch 1.2 has been released

2010-09-17 Thread Arkadi.Kosmynin
Hello,

I am announcing release of Arch 1.2 based on Nutch 1.2. Arch is an extension of 
Nutch. It is designed for indexing and search of intranets. Many features have 
been added that make this task easier and deliver high precision search results.

For details and downloads, please see Arch home page:

http://www.atnf.csiro.au/computing/software/arch/

This version includes a tuning and evaluation module that lets you compare a 
few search engines in blind tests and/or side-by-side. This is a useful thing 
if you want to get an idea of real performance of different search engines or 
trace effects of changes you have made.  You can use it even if you don't use 
Arch. It includes search plugins for Nutch, Arch, Google and Funnelback. It is 
easy to write ones for other engines.

People who use Nutch may want to use the upgraded version of the parse-pdf 
plugin. You can do it yourself though. It just requires switching to the latest 
libraries (this requires a minor change in the sources to fix the imports). I 
highly recommend this upgrade because it fixed a lot of PDF parsing errors for 
us.


Regards,

Arkadi Kosmynin


RE: Embed the Crawl API in my application

2010-08-08 Thread Arkadi.Kosmynin
Hi,

>-Original Message-
>From: Roger Marin [mailto:rsman...@gmail.com]
>Sent: Saturday, August 07, 2010 5:01 AM
>To: user@nutch.apache.org
>Subject: Embed the Crawl API in my application
>

...

>
>The other stuff I need to figure out is if it's possible to
>programmaticaly
>set some of the parameters needed to use the crawler, for instance I
>need to
>programmatically set the values of the urls instead of having a url
>file, or
>a crawl-urlfilter file as well as the properties in the nutch-site.xml,
>because these can be configured dynamically by the application and are
>relative to the application itself
>so I cannot hardcode these properties.

This is done in Arch. You can get source here 

http://www.atnf.csiro.au/computing/software/arch/

and use it as an example.

>
>Any help you can give me or documentation you can point me to, will be
>greatly appreciated.
>
>
>Thank you.


RE: mysql

2010-07-19 Thread Arkadi.Kosmynin
Yes, it is an extension of Nutch. You could use its index filter code as an 
example. It does exactly what you want to do. I will send you the source off 
the list.

>-Original Message-
>From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
>Sent: Tuesday, July 20, 2010 2:50 PM
>To: user@nutch.apache.org
>Subject: Re: mysql
>
>I don't understand, it is some kind of a extension of nutch which seems
>to have
>nothing to do with my question.  Even if it does, it seems to be a
>overkill...
>
>
>
>
>
>From: "arkadi.kosmy...@csiro.au" 
>To: user@nutch.apache.org
>Sent: Mon, July 19, 2010 9:46:03 PM
>Subject: RE: mysql
>
>Hi Savannah,
>
>>-Original Message-
>>From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
>>Sent: Tuesday, July 20, 2010 2:42 PM
>>To: user@nutch.apache.org
>>Subject: mysql
>>
>>Hi,
>>  I use DriverManager.getConnection to connect to my mysql db and do
>>query in my
>>plugin that extend indexfilter.  It seems that indexfilter is run for
>>every url
>>being indexed.  It means that I have to open and close connection to
>>mysql db
>>each time a url is being indexed, it is unefficient.  Is there a way to
>>open and
>>close connection to mysql db only once for all urls, not for each url?
>
>Yes, there is. See Arch index filter:
>
>http://www.atnf.csiro.au/computing/software/arch/
>
>Regards,
>
>Arkadi
>
>>
>>Thanks.
>>
>>
>>
>
>
>
>


RE: mysql

2010-07-19 Thread Arkadi.Kosmynin
Hi Savannah,

>-Original Message-
>From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
>Sent: Tuesday, July 20, 2010 2:42 PM
>To: user@nutch.apache.org
>Subject: mysql
>
>Hi,
>  I use DriverManager.getConnection to connect to my mysql db and do
>query in my
>plugin that extend indexfilter.  It seems that indexfilter is run for
>every url
>being indexed.  It means that I have to open and close connection to
>mysql db
>each time a url is being indexed, it is unefficient.  Is there a way to
>open and
>close connection to mysql db only once for all urls, not for each url?

Yes, there is. See Arch index filter:

http://www.atnf.csiro.au/computing/software/arch/

Regards,

Arkadi

>
>Thanks.
>
>
>


RE: How to Index Only Pages with Certain Urls?

2010-07-15 Thread Arkadi.Kosmynin
Hi Savannah,

You can control indexing with an index plugin. If you don't want a particular 
url in the index, just return null.

Regards,

Arkadi 

>-Original Message-
>From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
>Sent: Friday, July 16, 2010 1:41 AM
>To: user@nutch.apache.org
>Subject: How to Index Only Pages with Certain Urls?
>
>Hi,
>  I want nutch to crawl abc.com, but  I want to index only car.abc.com.
> car.abc.com links can in any levels in abc.com.  So, basically, I want
>nutch to
>keep crawl abc.com normally, but index only pages that start as
>car.abc.com.
> e.g. car.abc.com/toyota...car.abc.com/honda...
>
>
>
>I set the regex-urlfilter.txt to include only car.abc.com and run the
>command
>"generate crawl/crawldb crawl/segments", but it just say "Generator: 0
>records
>selected for fetching, exiting ..." .  I guess car.abc.com links exist
>only in
>several levels deep.
>
>
>How to do this?  I am using nutch 1.1 and solr 1.4.1
>Thanks.
>
>
>


RE: More question about plugin entry point

2010-07-12 Thread Arkadi.Kosmynin
Hi Jeff,

>-Original Message-
>From: jeff [mailto:jefferson.z...@gmail.com]
>Sent: Tuesday, July 13, 2010 4:22 PM
>To: user@nutch.apache.org
>Subject: More question about plugin entry point
>
>Hi,
>
>I am looking at the plugin.xml under $nutch_home\src\plugin\query-site\,
>and notice the following code:
>
>   id="query-site"
>   name="Site Query Filter"
>   version="1.0.0"
>   provider-name="nutch.org">
>
>   
>  
> 
>  
>   
>
>   
>  
>   
>
> name="Nutch Site Query Filter"
>  point="org.apache.nutch.searcher.QueryFilter">
>  
>class="org.apache.nutch.searcher.site.SiteQueryFilter">
>
>  
>
>   
>
>
>
>
>For this plugin to work, there must be some entry point in some .xml
>file such that nutch will know where to locate this SiteQueryFilter
>class when the filter function within
>org.apache.nutch.searcher.QueryFilter is called. However, I searched all
>the workspace and tried to find some entry and had no luck as of now.

Please see plugins/nutch-extensionpoints/plugin.xml

Regards,

Arkadi

>
>Does any of you know the mechanism that would make this SiteQueryFilter
>plugin to work?
>
>Thanks,
>Jeff



RE: Nutch Categorizer Plugin

2010-07-04 Thread Arkadi.Kosmynin
Hi Sravan,

>-Original Message-
>From: Sravan Suryadevara [mailto:sravan.suryadev...@gmail.com]
>Sent: Monday, June 28, 2010 11:58 PM
>To: user@nutch.apache.org
>Subject: Nutch Categorizer Plugin
>
>Hi guys,
>
>I'm slightly confused about how the plugin is supposed to be built and
>included. Since different tutorials say different things.
>
>All I need to do is build the jar file and create a plugin.xml, include
>it
>in nutch-site.xml and that's good correct?

Almost. Your plugin must be "pluggable" to one of Nutch extension points. You 
have to place your plugin folder in the plugins folder where Nutch can find it. 
You have to name your folder and your jar right, too.

>
>Or do I have to download the svn nutch and rebuild the entire nutch
>project,
>with my plugin included? (making a build.xml)

No, this is not required.

>
>I tried both ways, but I have test print statements in my code, and it
>never
>appeared.

To start troubleshooting, look in the logs. Nutch lists found plugins there.

Regards,

Arkadi

>
>Sincerely,
>Sravan Suryadevara


RE: Generator problems in Nutch 1.1

2010-07-01 Thread Arkadi.Kosmynin
>>you have set adddays to 5 days.
>>in this case it will be refetched and refetched again.
>>try adddays with value 0.

This helped. However, the other problem is still there. The Generator schedules 
for fetching 1 URL per iteration. Please see an extract from the log. These are 
fetching attempts - a partial output of "grep fetching hadoop.log".

2010-07-01 20:14:34,397 INFO  fetcher.Fetcher - fetching 
http://www.atnf.csiro.au/vlbi/wiki/index.php?n=Pmcal.Pmcal?month=5&day=1&year=2002
2010-07-01 20:15:08,188 INFO  fetcher.Fetcher - fetching 
http://www.atnf.csiro.au/vlbi/wiki/index.php?n=Pmcal.Pmcal?month=4&day=1&year=2002
2010-07-01 20:15:43,830 INFO  fetcher.Fetcher - fetching 
http://www.atnf.csiro.au/vlbi/wiki/index.php?n=Pmcal.Pmcal?month=3&day=1&year=2002
2010-07-01 20:16:17,529 INFO  fetcher.Fetcher - fetching 
http://www.atnf.csiro.au/vlbi/wiki/index.php?n=Pmcal.Pmcal?month=2&day=1&year=2002
2010-07-01 20:16:50,117 INFO  fetcher.Fetcher - fetching 
http://www.atnf.csiro.au/vlbi/wiki/index.php?n=Pmcal.Pmcal?month=1&day=1&year=2002
2010-07-01 20:17:26,211 INFO  fetcher.Fetcher - fetching 
http://www.atnf.csiro.au/vlbi/wiki/index.php?n=Pmcal.Pmcal?month=12&day=1&year=2001

The time gaps are explained by different iterations. 

This would not trouble me, but the crawling was finished because the depth was 
reached. There are still 1000+ of URLs with status db_unfetched and score 30 
left in the crwaldb. They should have been scheduled for fetching. Most fetched 
URLs have this score.

>
>I will try 0, thanks. The interesting thing is, I don't recall this
>happening with Nutch 1.0. I used it with the same script. The other
>problem that I mentioned, scheduling 1 new URL per iteration, is also
>new. I definitely did not have it with Nutch 1.0.
>
>
>Regards,
>
>Arkadi
>
>>
>>arkadi.kosmy...@csiro.au schrieb:
>>> Hi,
>>>
>>> I am trying to use Nutch 1.1 to build a complete index of our
>>corporate web sites. I am using a script based on this one:
>>>
>>> http://wiki.apache.org/nutch/Crawl
>>>
>>> The text of my script is included below. I set the crawling depth to
>>100 to make sure that everything is indexed, expecting that the process
>>will stop after about 20 iterations. The problem is that the Generator
>>keeps re-scheduling fetching of failed URLs. The process stopped
>because
>>the max number of iterations (the depth) was reached. After a few
>>iterations, only failing URLs were being repeatedly fetched. I checked
>>the log for one of them. It failed with code 403 and was re-scheduled
>>for fetching 94 times. This does not seem right.
>>>
>>> Another problem that I've noticed is that sometimes the Generators
>>schedules fetching of just one URL per iteration. This happened once
>and
>>I did not try to repeat this effect, but this does not seem right
>>either.
>>>
>>> Here is my script text:
>>>
>>> ---
>>> #!/bin/bash
>>>
>>> depth=100
>>> threads=10
>>> adddays=5
>>> topN=5
>>>
>>> NUTCH_HOME=/data/HORUS_1/nutch1.1
>>> NUTCH_HEAPSIZE=8000
>>> nutch=$NUTCH_HOME
>>> JAVA_HOME=/usr/lib/jvm/java-6-sun
>>> export NUTCH_HOME
>>> export JAVA_HOME
>>> export NUTCH_HEAPSIZE
>>>
>>> steps=6
>>> echo "- Inject (Step 1 of $steps) -"
>>> $nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed
>>>
>>> echo "- Generate, Fetch, Parse, Update (Step 2 of $steps) -"
>>> for((i=0; i < $depth; i++))
>>> do
>>>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>>   $nutch/bin/nutch generate $nutch/crawl/crawldb
>$nutch/crawl/segments
>>$topN -adddays $adddays
>>>   if [ $? -ne 0 ]
>>>   then
>>> echo "Stopping at depth $depth. No more URLs to fetch."
>>> break
>>>   fi
>>>   segment=`ls -d $nutch/crawl/segments/* | tail -1`
>>>
>>>   $nutch/bin/nutch fetch $segment -threads $threads
>>>   if [ $? -ne 0 ]
>>>   then
>>> echo "fetch $segment at depth $depth failed. Deleting it."
>>> rm -rf $segment
>>> continue
>>>   fi
>>>
>>> #  echo "--- Parsing Segment $segment ---"
>>> #  $nutch/bin/nutch parse $segment
>>>
>>>   $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment
>>> done
>>>
>>> echo "- Invert Links (Step 3 of $steps) -"
>>> $nutch/bin/nutch invertlinks $nutch/crawl/linkdb
>>$nutch/crawl/segments/*
>>>
>>> echo "- Index (Step 4 of $steps) -"
>>> $nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb
>>$nutch/crawl/linkdb $nutch/crawl/segments/*
>>>
>>> echo "- Dedup (Step 5 of $steps) -"
>>> $nutch/bin/nutch dedup $nutch/crawl/preIndex
>>>
>>> echo "- Merge Indexes (Step 6 of $steps) -"
>>> $nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex
>>>
>>> # in nutch-site, hadoop.tmp.dir points to crawl/tmp
>>> rm -rf $nutch/crawl/tmp/*
>>> ---
>>>
>>> Is anyone experiencing same problems? Is there anything wrong in what
>>I am doing?
>>>
>>> Regards,
>>>
>>> Arkadi
>>>
>>>
>>>
>>>
>>>
>>>
>>>



RE: Generator problems in Nutch 1.1

2010-06-30 Thread Arkadi.Kosmynin
Hi Reinhard,

>-Original Message-
>From: reinhard schwab [mailto:reinhard.sch...@aon.at]
>Sent: Thursday, July 01, 2010 2:10 PM
>To: user@nutch.apache.org
>Subject: Re: Generator problems in Nutch 1.1
>
>could you dump the entry of this url in crawl db with
>
>bin/nutch readdb crawl/crawldb -url 

Here it is:

URL: http://www.atnf.csiro.au/people/bkoribal/obs/C848.html
Version: 7
Status: 3 (db_gone)
Fetch time: Thu Jul 01 20:51:33 GMT+10:00 2010
Modified time: Thu Jan 01 10:00:00 GMT+10:00 1970
Retries since fetch: 94
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _pst_: exception(16), lastModified=0: Http code=403, 
url=http://www.atnf.csiro.au/people/bkoribal/obs/C848.html

Please note that this is not the only URL which was repeatedly re-fetched. 
There were a number of them, about 150. 

>it also depends on your configuration.
>what is the value for
>db.fetch.interval.default and for
>db.fetch.schedule.class?

I did not change these parameters in nutch-default.xml:


  db.default.fetch.interval
  30
  (DEPRECATED) The default number of days between re-fetches of a 
page.
  



  db.fetch.interval.default
  2592000
  The default number of seconds between re-fetches of a page (30 
days).
  



  db.fetch.interval.max
  7776000
  The maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.
  



  db.fetch.schedule.class
  org.apache.nutch.crawl.DefaultFetchSchedule
  The implementation of fetch schedule. DefaultFetchSchedule simply
  adds the original fetchInterval to the last fetch time, regardless of
  page changes.




>
>a guess. i have not done crawling with nutch for a while, so im not
>sure.
>i guess it has CrawlDatum.STATUS_FETCH_RETRY when crawling and it uses
>
> schedule.setPageRetrySchedule((Text)key, result, prevFetchTime,
>  prevModifiedTime, fetch.getFetchTime());
>
>in AbstractFetchSchedule you can read
>
> /**
>   * This method adjusts the fetch schedule if fetching needs to be
>   * re-tried due to transient errors. The default implementation
>   * sets the next fetch time 1 day in the future and increases
>   * the retry counter.
>   * @param url URL of the page
>   * @param datum page information
>   * @param prevFetchTime previous fetch time
>   * @param prevModifiedTime previous modified time
>   * @param fetchTime current fetch time
>   * @return adjusted page information, including all original
>information.
>   * NOTE: this may be a different instance than {...@param datum}, but
>   * implementations should make sure that it contains at least all
>   * information from {...@param datum}.
>   */
>  public CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum,
>  long prevFetchTime, long prevModifiedTime, long fetchTime) {
>datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);
>datum.setRetriesSinceFetch(datum.getRetriesSinceFetch() + 1);
>return datum;
>  }
>
>you have set adddays to 5 days.
>in this case it will be refetched and refetched again.
>try adddays with value 0.
>or change the code in AbstractFetchSchedule
>
>datum.setFetchTime(fetchTime + (long)SECONDS_PER_DAY*1000);

I will try 0, thanks. The interesting thing is, I don't recall this happening 
with Nutch 1.0. I used it with the same script. The other problem that I 
mentioned, scheduling 1 new URL per iteration, is also new. I definitely did 
not have it with Nutch 1.0.

Regards,

Arkadi

>
>arkadi.kosmy...@csiro.au schrieb:
>> Hi,
>>
>> I am trying to use Nutch 1.1 to build a complete index of our
>corporate web sites. I am using a script based on this one:
>>
>> http://wiki.apache.org/nutch/Crawl
>>
>> The text of my script is included below. I set the crawling depth to
>100 to make sure that everything is indexed, expecting that the process
>will stop after about 20 iterations. The problem is that the Generator
>keeps re-scheduling fetching of failed URLs. The process stopped because
>the max number of iterations (the depth) was reached. After a few
>iterations, only failing URLs were being repeatedly fetched. I checked
>the log for one of them. It failed with code 403 and was re-scheduled
>for fetching 94 times. This does not seem right.
>>
>> Another problem that I've noticed is that sometimes the Generators
>schedules fetching of just one URL per iteration. This happened once and
>I did not try to repeat this effect, but this does not seem right
>either.
>>
>> Here is my script text:
>>
>> ---
>> #!/bin/bash
>>
>> depth=100
>> threads=10
>> adddays=5
>> topN=5
>>
>> NUTCH_HOME=/data/HORUS_1/nutch1.1
>> NUTCH_HEAPSIZE=8000
>> nutch=$NUTCH_HOME
>> JAVA_HOME=/usr/lib/jvm/java-6-sun
>> export NUTCH_HOME
>> export JAVA_HOME
>> export NUTCH_HEAPSIZE
>>
>> steps=6
>> echo "- Inject (Step 1 of $steps) -"
>> $nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed
>>
>> echo "- G

Generator problems in Nutch 1.1

2010-06-30 Thread Arkadi.Kosmynin
Hi,

I am trying to use Nutch 1.1 to build a complete index of our corporate web 
sites. I am using a script based on this one:

http://wiki.apache.org/nutch/Crawl

The text of my script is included below. I set the crawling depth to 100 to 
make sure that everything is indexed, expecting that the process will stop 
after about 20 iterations. The problem is that the Generator keeps 
re-scheduling fetching of failed URLs. The process stopped because the max 
number of iterations (the depth) was reached. After a few iterations, only 
failing URLs were being repeatedly fetched. I checked the log for one of them. 
It failed with code 403 and was re-scheduled for fetching 94 times. This does 
not seem right.

Another problem that I've noticed is that sometimes the Generators schedules 
fetching of just one URL per iteration. This happened once and I did not try to 
repeat this effect, but this does not seem right either.

Here is my script text:

---
#!/bin/bash

depth=100
threads=10
adddays=5
topN=5

NUTCH_HOME=/data/HORUS_1/nutch1.1
NUTCH_HEAPSIZE=8000
nutch=$NUTCH_HOME
JAVA_HOME=/usr/lib/jvm/java-6-sun
export NUTCH_HOME
export JAVA_HOME
export NUTCH_HEAPSIZE

steps=6
echo "- Inject (Step 1 of $steps) -"
$nutch/bin/nutch inject $nutch/crawl/crawldb $nutch/crawl/seed

echo "- Generate, Fetch, Parse, Update (Step 2 of $steps) -"
for((i=0; i < $depth; i++))
do
  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
  $nutch/bin/nutch generate $nutch/crawl/crawldb $nutch/crawl/segments $topN 
-adddays $adddays
  if [ $? -ne 0 ]
  then
echo "Stopping at depth $depth. No more URLs to fetch."
break
  fi
  segment=`ls -d $nutch/crawl/segments/* | tail -1`

  $nutch/bin/nutch fetch $segment -threads $threads
  if [ $? -ne 0 ]
  then
echo "fetch $segment at depth $depth failed. Deleting it."
rm -rf $segment
continue
  fi

#  echo "--- Parsing Segment $segment ---"
#  $nutch/bin/nutch parse $segment

  $nutch/bin/nutch updatedb $nutch/crawl/crawldb $segment
done

echo "- Invert Links (Step 3 of $steps) -"
$nutch/bin/nutch invertlinks $nutch/crawl/linkdb $nutch/crawl/segments/*

echo "- Index (Step 4 of $steps) -"
$nutch/bin/nutch index $nutch/crawl/preIndex $nutch/crawl/crawldb 
$nutch/crawl/linkdb $nutch/crawl/segments/*

echo "- Dedup (Step 5 of $steps) -"
$nutch/bin/nutch dedup $nutch/crawl/preIndex

echo "- Merge Indexes (Step 6 of $steps) -"
$nutch/bin/nutch merge $nutch/crawl/index $nutch/crawl/preIndex

# in nutch-site, hadoop.tmp.dir points to crawl/tmp
rm -rf $nutch/crawl/tmp/*
---

Is anyone experiencing same problems? Is there anything wrong in what I am 
doing?

Regards,

Arkadi







RE: How to make nutch take distance between terms in document in account?

2010-06-27 Thread Arkadi.Kosmynin
Hi Dmitriy,


>-Original Message-
>From: Dmitriy V. Kazimirov [mailto:dmitriy.kazimi...@viorsan.com]
>Sent: Saturday, June 26, 2010 9:53 PM
>To: user@nutch.apache.org
>Subject: How to make nutch take distance between terms in document in
>account?
>
>Hi,
>
>Is it possible to make nutch scoring take into account distance between
>terms?
>
>i.e. if we have query president bush medvev, document where all 3 terms
>are near(how to define 'near' hear is also interesting) each are scoring
>higher than they are away from each other?
>
>If that's not possible right now - I'm correct that new QueryFilter
>should
>be implemented?how this should be made?

This is implemented, but is not being used, if I am not wrong. Please see
addSloppyPhrases in BasicQueryFilter.java. Note that SLOP (the proximity
parameter) is set to Integer.MAX_VALUE which defines 'near' as 'very far'.
I did not find any code that would change it in Nutch.

Regards,

Arkadi

>
>
>
>
>With regards, Dmitriy



RE: Parsing PostScript files

2010-06-24 Thread Arkadi.Kosmynin
Thanks, Andrzej!

>-Original Message-
>From: Andrzej Bialecki [mailto:a...@getopt.org]
>Sent: Friday, June 25, 2010 3:41 AM
>To: user@nutch.apache.org
>Subject: Re: Parsing PostScript files
>
>On 2010-06-24 10:56, arkadi.kosmy...@csiro.au wrote:
>> Hi,
>>
>> It looks like Tika does not include a PostScript parser. At least the
>> copy that comes with Nutch 1.1. Is this right? I just want to double
>> check because PostScript is a major file format. I get errors "Can't
>> retrieve Tika parser for mime-type application/postscript" in the log
>> when Nutch comes across a PostScript file. I've found a reference to
>> parser-pdf associated with PostScript, but it does not work any
>> better. It tries to treat PostScript files as pdf and fails, if I
>> correctly interpret its complains.
>
>PDF parser can't properly parse Postscript, sorry. On the other hand,
>Postscript parsers may be (and often are) able to parse PDF-s.
>
>>
>> Could anyone help with parsing PostScript in Nutch, please? It is
>> hard to believe that this is not implemented.
>
>You can use Ghostscript via the parse-ext plugin - see examples in
>plugin.xml file there.
>
>(...and BTW, parsing Postscript is definitely not on the same level of
>complexity as parsing PDF - Postscript is a full programming language,
>whereas PDF is "just" a page description format).
>
>--
>Best regards,
>Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
>[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>___|||__||  \|  ||  |  Embedded Unix, System Integration
>http://www.sigram.com  Contact: info at sigram dot com



Parsing PostScript files

2010-06-24 Thread Arkadi.Kosmynin
Hi,

It looks like Tika does not include a PostScript parser. At least the copy that 
comes with Nutch 1.1. Is this right? I just want to double check because 
PostScript is a major file format. I get errors "Can't retrieve Tika parser for 
mime-type application/postscript" in the log when Nutch comes across a 
PostScript file. I've found a reference to parser-pdf associated with 
PostScript, but it does not work any better. It tries to treat PostScript files 
as pdf and fails, if I correctly interpret its complains.

Could anyone help with parsing PostScript in Nutch, please? It is hard to 
believe that this is not implemented.

Thanks,

Arkadi


RE: Running Nutch in a single VM

2010-05-25 Thread Arkadi.Kosmynin
Hi Hannes,

You are welcome. Forking is optional, based on a config parameter. The 
interface can be used within a single JVM as well. I use this for debugging and 
small crawls.

Regards,

Arkadi



> -Original Message-
> From: Hannes Carl Meyer [mailto:hannesc...@googlemail.com]
> Sent: Tuesday, May 25, 2010 5:01 PM
> To: Kosmynin, Arkadi (CASS, Marsfield); user@nutch.apache.org
> Subject: Re: Running Nutch in a single VM
> 
> Hi Arkadi,
> 
> thanks for the great reference. I guess I can't fork child JVMs in my
> use
> case though. Fortunately I don't have to crawl more than 100k sites and
> also
> don't need a lot of multiple threads.
> I'm going to take a look at your referenced class, thank you!
> 
> Regards,
> 
> Hannes
> 
> On Mon, May 24, 2010 at 1:07 AM,  wrote:
> 
> > Hi Hannes,
> >
> > This is done in Arch. See the au.csiro.cass.arch.utils.Starter class
> and
> > its use from other classes. It was not straightforward because the
> used RAM
> > tends to grow with iterations. To get around this, I had to fork
> child JVMs.
> >
> > You can find Arch here:
> >
> > http://www.atnf.csiro.au/computing/software/arch/
> >
> > Regards,
> >
> > Arkadi
> >
> > > -Original Message-
> > > From: Hannes Carl Meyer [mailto:hannesc...@googlemail.com]
> > > Sent: Friday, May 21, 2010 7:33 PM
> > > To: user@nutch.apache.org
> > > Subject: Running Nutch in a single VM
> > >
> > > Hi,
> > >
> > > is it possible to run nutch in a single virtual machine for
> intranet
> > > crawling? Even inside a Java Application Server?
> > >
> > > Normally I'm using custom Nutch crawl scripts and start from the OS
> > > command
> > > line by cron. In a new project it is required to use a running
> Virtual
> > > Machine for deloyment and invocation of crawler tasks.
> > >
> > > Does anybody has experiences in deploying Nutch in such a scenario?
> > >
> > > Kind Regards
> > >
> > > Hannes
> > >
> > > --
> >
> 
> 
> 
> --
> 
> https://www.xing.com/profile/HannesCarl_Meyer
> http://de.linkedin.com/in/hannescarlmeyer
> http://twitter.com/hannescarlmeyer


RE: Running Nutch in a single VM

2010-05-23 Thread Arkadi.Kosmynin
Hi Hannes,

This is done in Arch. See the au.csiro.cass.arch.utils.Starter class and its 
use from other classes. It was not straightforward because the used RAM tends 
to grow with iterations. To get around this, I had to fork child JVMs.

You can find Arch here:

http://www.atnf.csiro.au/computing/software/arch/

Regards,

Arkadi

> -Original Message-
> From: Hannes Carl Meyer [mailto:hannesc...@googlemail.com]
> Sent: Friday, May 21, 2010 7:33 PM
> To: user@nutch.apache.org
> Subject: Running Nutch in a single VM
> 
> Hi,
> 
> is it possible to run nutch in a single virtual machine for intranet
> crawling? Even inside a Java Application Server?
> 
> Normally I'm using custom Nutch crawl scripts and start from the OS
> command
> line by cron. In a new project it is required to use a running Virtual
> Machine for deloyment and invocation of crawler tasks.
> 
> Does anybody has experiences in deploying Nutch in such a scenario?
> 
> Kind Regards
> 
> Hannes
> 
> --


RE: Writing a Book on Nutch

2010-05-17 Thread Arkadi.Kosmynin
Hi Dennis,

I think you should include info on:

- Data structures and data flow in Nutch, since understanding of these helps 
understand other things better;
- Common problems, solutions, troubleshooting and tuning, because everyone 
working with Nutch faces these issues sooner or later.

Regards,

Arkadi



> -Original Message-
> From: Dennis Kubes [mailto:ku...@apache.org]
> Sent: Monday, May 17, 2010 11:28 AM
> To: user@nutch.apache.org
> Subject: Writing a Book on Nutch
> 
> Hi Everyone,
> 
> It has been a long time coming but I have finally started to write a
> book on Nutch.  It will be self published and should be available in
> PDF
> / paperback form in less than a month hopefully.
> 
> A while back we discussed a Nutch training seminar on the list.  I am
> not ready to do a full on seminar yet but I will be putting up some
> training and tutorial videos in the next few weeks.  I will update the
> list as those become available.
> 
> I already have a general outline but it would help me to know the
> following:
> 
> 1) What types of things you would want explained in a book / videos on
> Nutch?
> 2) What are the biggest problems you face using Nutch?
> 3) Anything special you would like answered or explained?
> 
> Thanks in advance for any responses.
> 
> Dennis