date:20060807

Patch: deflate encoding

2006-08-07 Thread Pascal Beis

Hi all,

I'v added support for deflate encoding (next to gzip) to nutch. Is there interest to 
include this into the main source repository? 

Patch attached.

Cheers

Pascal

Re: Patch: deflate encoding

2006-08-07 Thread Dawid Weiss



I believe both deflate and gzip (as well as zip) are included as servlet 
filters in:


http://sourceforge.net/projects/pjl-comp-filter/

Dawid

Pascal Beis wrote:

Hi all,

I'v added support for deflate encoding (next to gzip) to nutch. Is there
interest to
include this into the main source repository?

Patch attached.

Cheers

Pascal

Re: [Fwd: Re: 0.8 Recrawl script updated]

2006-08-07 Thread Lukas Vlcek

Hi again,

I just found related discussion here:
http://www.nabble.com/NullPointException-tf2045994r1.html

I think these guys are discussing similar problem and if I understood
the conclusion correctly then the only solution right now is to write
some code and test which segments are used in index and which are not.

Regards,
Lukas

On 8/4/06, Lukas Vlcek [EMAIL PROTECTED] wrote:

Matthew,

In fact I didn't realize you are doing merge stuff (sorry for that)
but frankly I don't know how exactly merging works and if this
strategy would work in the long time perspective and whether it is
universal approach in all variability of cases which may occur during
crawling (-topN, threads frozen, pages unavailable, crawling dies, ...
etc), may be it is correct path. I would appreciate if anybody can
answer this question precisely.

Thanks,
Lukas

On 8/4/06, Matthew Holt [EMAIL PROTECTED] wrote:
If anyone doesnt mind taking a look...

-- Forwarded message --
From: Matthew Holt [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Date: Fri, 04 Aug 2006 10:07:57 -0400
Subject: Re: 0.8 Recrawl script updated
Lukas,
Thanks for your e-mail. I assumed I could drop the $depth number of
oldest segments because I first merged them all into one segment (which
I don't drop). Am I incorrect in my assumption and can this cause
problems in the future? If so, then I'll go back to the original version
of my script when I kept all the segments without merging. However, it
just seemed like if that is the case, it will be a problem after enough
number of recrawls due to the large amount of segments being kept.

Thanks,
Matt

Lukas Vlcek wrote:
Hi Matthew,

I am surious about one thing. How do you know you can just drop $depth
number of the most oldest segments in the end? I haven't studied nutch
code regarding this topic yet but I thought that segment can be
dropped once you are sure that all its content is already crawled in
some newer segments (which should be checked somehow via some
function/script - which hasen't been yet implemented to my knowledge).

Also I don't think this question has been discussed on dev/user lists
in detail yet so I just wanted to ask you about your opinion. The
situation could get even more complicated if people add -topN
parameter into script (which can happen because some might prefer
crawling in ten smaller bunches over to two huge crawls due to various
technical reasons).

Anyway, never mind if you don't want to bother about my silly question
:-)

Regards,
Lukas

On 8/4/06, Matthew Holt [EMAIL PROTECTED] wrote:
Last email regarding this script. I found a bug in it that is sporadic
(i think it only affected different setups). However, since it would be
a problem sometimes, I refactored the script. I'd suggest you redownload
the script if you are using it.

Matt

Matthew Holt wrote:
I'm currently pretty busy at work. If I have I'll do it later.

The version 0.8 recrawl script has a working version online now. I
temporarily modified it on the website yesterday when I ran into some
problems, but I further tested it and the actual working code is
modified now. So if you got it off the web site any time yesterday, I
would redownload the script.

Matt

Lourival Júnior wrote:
Hi Matthew!

Could you update the script to the version 0.7.2 with the same
functionalities? I write a scritp that do this, but it don't work
very
well...

Regards!

On 8/2/06, Matthew Holt [EMAIL PROTECTED] wrote:

Just letting everyone know that I updated the recrawl script on the
Wiki. It now merges the created segments them deletes the old
segs to
prevent a lot of unneeded data remaining/growing on the hard drive.
Matt

http://wiki.apache.org/nutch/IntranetRecrawl?action=show#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03

[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb

2006-08-07 Thread Renaud Richardet (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-266?page=all ]

Renaud Richardet updated NUTCH-266:
---

Attachment: patch_hadoop-0.5.0.diff

Now that Hadoop 0.5 has been released, here's the patch to use hadoop-0.5.0.jar 
in Nutch-0.8.x
HTH,
Renaud

 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Fix For: 0.9.0, 0.8.1

 Attachments: patch.diff, patch_hadoop-0.5.0.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

2006-08-07 Thread Greg Kim (JIRA)

Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
-

 Key: NUTCH-344
 URL: http://issues.apache.org/jira/browse/NUTCH-344
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0
 Environment: All
Reporter: Greg Kim
 Attachments: cleanExpiredServerBlocks.patch

With the recent change to the following code in HttpBase.java has tendencies to 
block fetcher threads while one thread busy waits... 

  private static void cleanExpiredServerBlocks() {
synchronized (BLOCKED_ADDR_TO_TIME) {
  while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   = LINE 3:   
String host = (String) BLOCKED_ADDR_QUEUE.getLast();
long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
if (time = System.currentTimeMillis()) {   
  BLOCKED_ADDR_TO_TIME.remove(host);
  BLOCKED_ADDR_QUEUE.removeLast();
}
  }
}
  }

LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the 
thread that first enters this block busy-waits until it becomes empty while all 
other threads block on the synchronized block.  This leads to extremely poor 
fetcher performance.  

Since the checkin to respect crawlDelay in robots.txt, we are no longer 
guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to 
iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

2006-08-07 Thread KuroSaka TeruHiko (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-266?page=comments#action_12426377 ] 

KuroSaka TeruHiko commented on NUTCH-266:
-

Renaud, thank you for posting the patch.  Is there a patched version of hadoop 
jar file (precompiled) that I can download?


 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Fix For: 0.9.0, 0.8.1

 Attachments: patch.diff, patch_hadoop-0.5.0.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Patch: deflate encoding

2006-08-07 Thread ogjunk-nutch

Ja, ja!

Otis 

- Original Message  
From: Pascal Beis  
To: nutch-dev@lucene.apache.org 
Sent: Monday, August 7, 2006 4:17:33 AM 
Subject: Patch: deflate encoding 

Hi all, 

 I'v added support for deflate encoding (next to gzip) to nutch. Is there 
interest to  
 include this into the main source repository?  

 Patch attached. 

 Cheers 

 Pascal

Re: Patch: deflate encoding

2006-08-07 Thread ogjunk-nutch

Pascal,

Forgot to say - attachments get stripped.  Please put them in JIRA.

Thanks,
Otis


- Original Message 
From: Pascal Beis [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Monday, August 7, 2006 4:17:33 AM
Subject: Patch: deflate encoding

Hi all,
 
 I'v added support for deflate encoding (next to gzip) to nutch. Is there 
interest to 
 include this into the main source repository? 
 
 Patch attached.
 
 Cheers
 
 Pascal

Re: Patch: deflate encoding

2006-08-07 Thread Dawid Weiss



Just to complete this thread (for the archives :), Deflater in the JDK 
has a... feature -- flush() is basically not implemented and thus 
nonfunctional on compressed streams. This is a known limitation (bug's 
parade mentions it as an 'enhancement request', although for most people 
who faced the problem it'll be a plain bug.


Anyway, the workaround is to use a custom deflater (such as zlib) and 
perform Z_SYNC_FLUSH which pads the compressed stream to a complete 
block and allows flushing the content. This way you can flush a partial 
compressed stream to the browser (for these DHTML lovers who like to 
play with JavaScript, for instance).


We implemented a fixed GZIP/Deflate compression based on JZlib and 
PJL-comp-filter (which in turn we changed slightly to compile under 
JDK1.4). If you're interested, sources are in Carrot2 SVN.


https://svn.sourceforge.net/svnroot/carrot2/trunk/carrot2/components/carrot2-util-gzip/

Dawid

Dawid Weiss wrote:


I believe both deflate and gzip (as well as zip) are included as servlet 
filters in:


http://sourceforge.net/projects/pjl-comp-filter/

Dawid

Pascal Beis wrote:

Hi all,

I'v added support for deflate encoding (next to gzip) to nutch. Is there
interest to
include this into the main source repository?

Patch attached.

Cheers

Pascal

Patch: deflate encoding

Re: Patch: deflate encoding

Re: [Fwd: Re: 0.8 Recrawl script updated]

[jira] Updated: (NUTCH-266) hadoop bug when doing updatedb

[jira] Created: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

[jira] Commented: (NUTCH-266) hadoop bug when doing updatedb

Re: Patch: deflate encoding

Re: Patch: deflate encoding

Re: Patch: deflate encoding

9 matches

Site Navigation

Mail list logo

Footer information