[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-05-30 Thread Marcel Schnippe (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413983 ] 

Marcel Schnippe commented on NUTCH-292:
---

The cause for the OutOfMemoryError in my document, was an (large) Document 
containing a very large set of token. Most of the tokens are made of 
overlapping substrings like in 

"all your base are belong to us" => all, all-your, your, your-base, 
all-your-base, your-base, base-are etc


> OpenSearchServlet: OutOfMemoryError: Java heap space
> 
>
>  Key: NUTCH-292
>  URL: http://issues.apache.org/jira/browse/NUTCH-292
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>   
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
>   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
>   
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the 
> start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad 
> search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / 
> what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-05-30 Thread Marcel Schnippe (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413982 ] 

Marcel Schnippe commented on NUTCH-292:
---

Hi Stefan, 

Thanks for trying out the Patch. Yes, you were right, it was for 0.7. I should 
definitly switch, but i made so many custom changes.
The proper place to apply would be in summary-basic.getTokens like in 

  private Token[] getTokens(String text) {
ArrayList result = new ArrayList();
TokenStream ts = analyzer.tokenStream("content", new StringReader(text));
Token token = null;
- while (true)  {
+while (result.size()Beware of the above code. I have only proven it correct, not tested it  
(D.Knuth)

> OpenSearchServlet: OutOfMemoryError: Java heap space
> 
>
>  Key: NUTCH-292
>  URL: http://issues.apache.org/jira/browse/NUTCH-292
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>   
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
>   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
>   
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the 
> start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad 
> search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / 
> what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: java 1.4 versus 1.5

2006-05-30 Thread Matthew Hannigan
Do it; java 1.5 has much better profilability too.


On Tue, May 30, 2006 at 03:21:00PM -0700, Owen O'Malley wrote:
> Java 1.5 has been out for a couple of years now and has some nice 
> improvements in the libraries. In particular, I wish I had access to 
> the timeout settings on UrlConnections. Would anyone object if starting 
> with the 0.3 release this week, we required java 1.5 to compile and run 
> Hadoop?
> 
> -- Owen


[jira] Commented: (NUTCH-272) Max. pages to crawl/fetch per site (emergency limit)

2006-05-30 Thread Matt Kangas (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-272?page=comments#action_12413959 ] 

Matt Kangas commented on NUTCH-272:
---

Thanks Doug, that makes more sense now. Running URLFilters.filter() during 
Generate seems very handy, albeit costly for large crawls. (Should have an 
option to turn off?)

I also see that URLFilters.filter() is applied in Fetcher (for redirects) and 
ParseOutputFormat, plus other tools.

Another possibie choke-point: CrawlDbMerger.Merger.reduce(). The key is URL, 
and they're sorted. You can veto crawldb additions here. Could you effectively 
count URLs/host here? (Not sure when distributed.) Would it require setting a 
Partitioner, like crawl.PartitionUrlByHost?

> Max. pages to crawl/fetch per site (emergency limit)
> 
>
>  Key: NUTCH-272
>  URL: http://issues.apache.org/jira/browse/NUTCH-272
>  Project: Nutch
> Type: Improvement

> Reporter: Stefan Neufeind

>
> If I'm right, there is no way in place right now for setting an "emergency 
> limit" to fetch a certain max. number of pages per site. Is there an "easy" 
> way to implement such a limit, maybe as a plugin?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-30 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ] 

Stefan Groschupf commented on NUTCH-289:


+1
Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as 
Doug suggested.
The biggest problem nutch has at the moment is spam. The most often seen spam 
method is to setup a dns return the same  ip for all subdomains and than 
deliver dynamically generated content. 
Than spammers just randomly generate subdomains within the content. Also it 
happens often that they have many url but all of them pointing to the same 
server == ip. 
Buying more ip addresses is possible but in the moment more expansive than 
buying more domains. 

Limit the urls by Ip is  a great approach to prevent the crawler staying in 
honey pots with ten thousends of urls pointing to the same ip. 
However to do so  we need to have the ip already until generation and not 
lookup it when fetching. 
We would be able to reuse the ip in the fetcher, also we can try catch the 
parts in the fetcher and in case the ip is not available we can re lookup the 
ip. 
I don't think round robbing dns are huge problem, since only large sites have 
them and in such a case each ip is able to handle requests.
In any case storing the ip in crawl-datum and use it for urls by ip limitations 
will be a gib step forward to in the fight against web spam.

> CrawlDatum should store IP address
> --
>
>  Key: NUTCH-289
>  URL: http://issues.apache.org/jira/browse/NUTCH-289
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-30 Thread Matt Kangas (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413939 ] 

Matt Kangas commented on NUTCH-289:
---

+1 to saving IP address in CrawlDatum, wherever the value comes from. (Fetcher 
or otherwise)

> CrawlDatum should store IP address
> --
>
>  Key: NUTCH-289
>  URL: http://issues.apache.org/jira/browse/NUTCH-289
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



java 1.4 versus 1.5

2006-05-30 Thread Owen O'Malley
Java 1.5 has been out for a couple of years now and has some nice 
improvements in the libraries. In particular, I wish I had access to 
the timeout settings on UrlConnections. Would anyone object if starting 
with the 0.3 release this week, we required java 1.5 to compile and run 
Hadoop?


-- Owen



Do analyzer plugins have acces to the Configuration?

2006-05-30 Thread Teruhiko Kurosaka
Jérôme, or anybody familiar with language plugin architecture,

I am writing a language analyzer plugin. This plugin has configurable
parameters, which I am hoping I can add to nutch-site.xml. But
the German and French plugin examples don't access to the
Configuration object.  Does the current analyzer plugin architecture
allows each plugin implementation to access the Configuration
object? If not, what would it take to allow such access? It would be 
best if it is allowed at the plugin class loading time and insantiation 
time separately.

-kuro


Re: Fetcher and MapReduce

2006-05-30 Thread Stefan Groschupf

Hi,

so you have 3 boxes, since you run 3 reduce tasks?
What happens is that 3 splits of your data are sorted. In the very  
end you will get as much out put files as you have reduce tasks.

The sorting itself does happen in memory.
Check in hadoop-default.xml (it is may be in the hadoop jar)
 io.sort.factor
and
  io.sort.mb

HTH
Stefan


Am 24.05.2006 um 11:13 schrieb Hamza Kaya:


Hi,

I'm trying to crawl approx. 500.000 urls. After inject and generate I
started fetchers using 6 map tasks and 3 reduce tasks. All the map  
tasks had

successfully completed while all the reduce tasks got an OutOfMemory
exception. This exception was caught after the append phase (during  
the sort
phase). As far as I observed, during a fetch operation, all the map  
tasks
outputs to a temp. sequence file. During the reduce operation, each  
reducer
copies all map outputs to their local disk and append them to a  
single seq.
file. After this operation reducer try to sort this file and output  
the
sorted file to its local disk. And then, a record writer is opened  
to write

this sorted file to the segment, which is in DFS. If this scenario is
correct, then all the reduce tasks are supposed to do the same job.  
All try
to sort the whole map outputs and the winner of this operation will  
be able
to write to dfs. So only one reducer is expected to write to dfs.  
If this is

the case then an OutOfMemory exception will not be surprising for
500.000+urls. Since reducers will try to sort a file bigger then 1GB.
Any comments
on this scenario are welcome. And how can I avoid these exceptions?  
Thanx,


--
Hamza KAYA




Re: Mailing List nutch-agent Reports of Bots Submitting Forms

2006-05-30 Thread Doug Cutting

Ken Krugler wrote:

2. Are the Nutch Devs replying to the emails sent to this list? I could
understand if they are replying off-list, but to an outside observer 
such as
myself it appears as though webmasters are not getting many replies 
to their

inqueries.



I can speak for myself only .. I'm not tracking that list. What about 
others?


Folks who are running a nutch-based crawler that provides this email 
address as the contact address should subscribe to this list and respond 
to messages, especially those which may have been caused by their 
crawler.  Others are also encouraged to subscribe and help respond to 
messages here, as a bad reputation for the crawler affects the whole 
project.  This list is actually fairly low-volume.


This brings up an issue I've been thinking about. It might make sense to 
require everybody set the user-agent string, versus it having default 
values that point to Nutch.


The first time you run Nutch, it would display an error re the 
user-agent string not being set, but if the instructions for how to do 
this were explicit, this wouldn't be much of a hardship for anybody 
trying it out.


+1

That would be a better solution.

Doug


Re: Extract infos from documents and query external sites

2006-05-30 Thread Stefan Groschupf

Think about using the google API.

However the way to go could be:

+ fetch your pages
+ do not parse the pages
+ write a map reduce job that extract your data
++ make a xhtml dom from the html e.g. using neko
++ use xpath queries to extract your data
++ also check out gate as a named entity extraction tool to extract  
names based on patterns and heuristics.

++ write the names in a file.

+ build your query urls
+ inject the query urls in a empty crawl db
+ create a segment fetch it and update the segment agains a second  
empty crawl database

+ remove the first segment and db
+ create a segment with your second db and fetch it.
You second segment will only contains the paper pages.

HTH
Stefan




Am 30.05.2006 um 12:14 schrieb HellSpawn:



I'm working on a search engine for my university and they want me  
to do that

to create a repository of scientific articles on the web :D

I red something about xpath for extracting exact parts from a  
document, once
done this building the query is very easy but my doubts are about  
how to

insert all of this in the nutch crawler...

Thank you
--
View this message in context: http://www.nabble.com/Extract+infos 
+from+documents+and+query+external+sites-t1675003.html#a4624272

Sent from the Nutch - Dev forum at Nabble.com.






Re: JVM error while parsing

2006-05-30 Thread Stefan Groschupf

Hi,
I heard there is a bug in JVM  1.5_06 beta, can you try a older or  
may be a 1.4 jvm and report if this happens with a other jvm as well.

Thanks,
Stefan

Am 30.05.2006 um 14:14 schrieb Uygar Yüzsüren:


Hi everyone,

I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to  
complete

a 1-depth-crawl
by using DFS and mapreduce structures. However, after a fetch step, I
encounter the below JVM exception
at one or more task trackers at the parsing step. It does not  
differ whether

I use only the default parsers,
or I also use the additional ones (pdf excel etc.). My task  
trackers work on

AMD X2 64-bit machines
and my JVM version is 1.5_06.

Have you ever faced with such a problem at the parse stage?Or how  
do you

think I can spot the cause of
this JVM exception?The error report is :

060530 144113 task_0007_m_10_0  Using Signature impl:
org.apache.nutch.crawl.MD5Signature
060530 144113 task_0007_m_10_0
5.0391704E-6%/crawl/segments/20060521171305/content/part-4/data: 
0+12303612

060530 144114 task_0007_m_10_0  Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
060530 144114 task_0007_m_07_0
0.084114%/crawl/segments/20060521171305/content/part-00011/data:0 
+12493176

060530 144115 task_0007_m_07_0
0.09551566%/crawl/segments/20060521171305/content/part-00011/data:0 
+12493176

060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # An unexpected error has been  
detected

by HotSpot Virtual Machine:
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 #  SIGSEGV (0xb) at
pc=0x003d1d247c10, pid=25093, tid=182894086496
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # Java VM: Java HotSpot(TM) 64- 
Bit Server

VM (1.5.0_06-b05 mixed mode)
060530 144115 task_0007_m_07_0 # Problematic frame:
060530 144115 task_0007_m_07_0 # C  [libc.so.6+0x47c10]
printf_size+0x740
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # An error report file with more
information is saved as hs_err_pid25093.log
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # If you would like to submit a bug
report, please visit:
060530 144115 task_0007_m_07_0 #
http://java.sun.com/webapps/bugreport/crash.jsp
060530 144115 task_0007_m_07_0 #
060530 144115 Server connection on port 51950 from 192.168.15.61:  
exiting

060530 144115 task_0007_m_07_0 Child Error
java.io.IOException: Task process exit with nonzero status of 134.
   at org.apache.hadoop.mapred.TaskRunner.runChild 
(TaskRunner.java:242)

   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)


Thank you very much.




[jira] Updated: (NUTCH-283) If the Fetcher times out and abandons Fetcher Threads, severe errors will occur on those Threads

2006-05-30 Thread Scott Ganyo (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-283?page=all ]

Scott Ganyo updated NUTCH-283:
--

Attachment: patch.txt

There was a typo in the earlier patch.  This patch supersedes the first patch.

> If the Fetcher times out and abandons Fetcher Threads, severe errors will 
> occur on those Threads
> 
>
>  Key: NUTCH-283
>  URL: http://issues.apache.org/jira/browse/NUTCH-283
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Scott Ganyo
>  Attachments: patch.txt, patch.txt
>
> If a Fetcher has chosen to time out and has abandoned outstanding Fetcher 
> Threads, resources that those Fetcher Threads may be using are closed.  This 
> naturally causes any abandoned Fetcher Threads to fail when they later 
> attempt to finish up their work in progress.
> I have a patch that addresses this that I am attaching.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



RE: NPE When using a merged segment

2006-05-30 Thread Gal Nitzan
I was about to look into it, but wasn't sure which var was holding the new
segment name to replace with segment :) lucky for me you read this email...
:)



-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 30, 2006 6:31 PM
To: nutch-dev@lucene.apache.org
Subject: Re: NPE When using a merged segment

Gal Nitzan wrote:
> I think it is a bug. It saves the old segment name instead of replacing it
> with the new segment name
>
>   


I confirm, this is a bug - I forgot that Indexer relies on this metadata 
... I'll fix it in a moment - sorry for the trouble!

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com






Re: NPE When using a merged segment

2006-05-30 Thread Andrzej Bialecki

Gal Nitzan wrote:

I think it is a bug. It saves the old segment name instead of replacing it
with the new segment name

  



I confirm, this is a bug - I forgot that Indexer relies on this metadata 
... I'll fix it in a moment - sorry for the trouble!


--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: NPE When using a merged segment

2006-05-30 Thread Gal Nitzan
I think it is a bug. It saves the old segment name instead of replacing it
with the new segment name

-Original Message-
From: Dominik Friedrich [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 29, 2006 7:57 PM
To: nutch-dev@lucene.apache.org
Subject: Re: NPE When using a merged segment

I have the same problem with a merged segment. I had a look with luke at 
the index and it seems that the indexer puts the old segment names in 
there instead of the name of the merged segment. I'm not sure if I did 
something wrong or if this is a bug.

Dominik

Gal Nitzan schrieb:
> Hi,
>
> I have built a new index based on the new segment only.
>
>
>
> -Original Message-
> From: Stefan Neufeind [mailto:[EMAIL PROTECTED] 
> Sent: Monday, May 29, 2006 10:03 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: NPE When using a merged segment
>
> Gal Nitzan wrote:
>   
>> Hi,
>>
>> After using mergesegs to merge all my segments to one segment only, I
>> 
> moved
>   
>> the new segment to segments.
>>
>> When accessing the web UI I get:
>>
>> java.lang.RuntimeException: java.lang.NullPointerException
>>  
>>
>> 
>
org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:20
>   
>> 3)
>>  org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
>>  org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:175)
>> 
>
> Hi,
>
> I'm not sure - but have you tried reindexing that new segment? To my
> understanding the index holds refereences to the segment (segment-name)
> - and in your case those are invalid. This would also explain the error
> you get (in call to getSummary) because the summary is fetched from the
> segment.
>
> If this works, then maybe you'll need to find a better way of cleaning
> up the index - not reindexing everything but maybe just rewriting the
> segmeent-names all into one or so.
>
> Feedback welcome.
>
>
> Good luck,
>  Stefan
>
>
>
>
>   






JVM error while parsing

2006-05-30 Thread Uygar Yüzsüren

Hi everyone,

I am using Hadoop-0.2.0 and Nutch-0.8, and at the moment trying to complete
a 1-depth-crawl
by using DFS and mapreduce structures. However, after a fetch step, I
encounter the below JVM exception
at one or more task trackers at the parsing step. It does not differ whether
I use only the default parsers,
or I also use the additional ones (pdf excel etc.). My task trackers work on
AMD X2 64-bit machines
and my JVM version is 1.5_06.

Have you ever faced with such a problem at the parse stage?Or how do you
think I can spot the cause of
this JVM exception?The error report is :

060530 144113 task_0007_m_10_0  Using Signature impl:
org.apache.nutch.crawl.MD5Signature
060530 144113 task_0007_m_10_0
5.0391704E-6%/crawl/segments/20060521171305/content/part-4/data:0+12303612
060530 144114 task_0007_m_10_0  Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
060530 144114 task_0007_m_07_0
0.084114%/crawl/segments/20060521171305/content/part-00011/data:0+12493176
060530 144115 task_0007_m_07_0
0.09551566%/crawl/segments/20060521171305/content/part-00011/data:0+12493176
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # An unexpected error has been detected
by HotSpot Virtual Machine:
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 #  SIGSEGV (0xb) at
pc=0x003d1d247c10, pid=25093, tid=182894086496
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # Java VM: Java HotSpot(TM) 64-Bit Server
VM (1.5.0_06-b05 mixed mode)
060530 144115 task_0007_m_07_0 # Problematic frame:
060530 144115 task_0007_m_07_0 # C  [libc.so.6+0x47c10]
printf_size+0x740
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # An error report file with more
information is saved as hs_err_pid25093.log
060530 144115 task_0007_m_07_0 #
060530 144115 task_0007_m_07_0 # If you would like to submit a bug
report, please visit:
060530 144115 task_0007_m_07_0 #
http://java.sun.com/webapps/bugreport/crash.jsp
060530 144115 task_0007_m_07_0 #
060530 144115 Server connection on port 51950 from 192.168.15.61: exiting
060530 144115 task_0007_m_07_0 Child Error
java.io.IOException: Task process exit with nonzero status of 134.
   at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:242)
   at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:145)


Thank you very much.


Re: Extract infos from documents and query external sites

2006-05-30 Thread HellSpawn

I'm working on a search engine for my university and they want me to do that
to create a repository of scientific articles on the web :D

I red something about xpath for extracting exact parts from a document, once
done this building the query is very easy but my doubts are about how to
insert all of this in the nutch crawler...

Thank you
--
View this message in context: 
http://www.nabble.com/Extract+infos+from+documents+and+query+external+sites-t1675003.html#a4624272
Sent from the Nutch - Dev forum at Nabble.com.



[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-05-30 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12413778 ] 

Stefan Neufeind commented on NUTCH-292:
---

That patch is for the 0.7-branch, right? In 0.8-dev you'd want to do that in 
BasicSummarizer.java. But to me it looks like something similar is already in 
place:

// Iterate through as long as we're before the end of
// the document and we haven't hit the max-number-of-items
// -in-a-summary.
//
while ((j < endToken) && (j - startToken < sumLength)) {

But I also suspect it might have something to do with tokens. What I 
experienced is that several search-results currently contain arbitrary binary 
data. Those are the cases where a parser-plugin has "failed" and where 
parse-text was used as a fallback. If I'm right this might lead to actually 
quite large tokens because no whitespace is found in a row of characters.

@Marcel: Thank you for the fix anyway ... you help is very much appreciated.

> OpenSearchServlet: OutOfMemoryError: Java heap space
> 
>
>  Key: NUTCH-292
>  URL: http://issues.apache.org/jira/browse/NUTCH-292
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>   
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
>   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
>   
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the 
> start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad 
> search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / 
> what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-05-30 Thread Stefan Neufeind (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12413780 ] 

Stefan Neufeind commented on NUTCH-290:
---

The plugin itself imho works fine now. Does not throw an exception anymore and 
if allowed outputs text correctly.
However I still get the "garbage-output" from a PDF. Could that be due to the 
fact that in case no extraction is allowed (empty parsing-text returned) the 
parser will still fallback to using the raw text to index?

What I did was deleting crawl_parse and parse_* from the segments-directory, 
running "nutch parse" and reindexing everything. However the raw chars in the 
search-output (summary) remain. :-((

> parse-pdf: Garbage indexed when text-extraction not allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira