Nutch trunk IndexWriter Plugin

2013-05-27 Thread AC Nutch
Hi All, I'm using Nutch 1.7 (trunk) and writing a plugin to index to HBase (using Nutch2.1 is not an option - I had to use 1.7 and write an indexer myself). I believe I'm well on my way, but I had a few questions. So my first step in the process was to make sure that the NutchDocument held the fi

Re: handshake alert:unrecognized_name----problems with ssl using https conection

2013-05-27 Thread Tejas Patil
I am not 100% sure if this would work, but you can try passing -Djsse. enableSNIExtension=false to the fetch command. On Mon, May 27, 2013 at 3:22 PM, Tejas Patil wrote: > Which version of Java are you using ? This problem is seen with Java 7 > [0]. Downgrading to Java 6 might help you. > > [0]

Re: Including urls in a nutch crawl that have previously been excluded (nutch 2.1)

2013-05-27 Thread Tejas Patil
The parser has already filtered out the unwanted urls as per the old regex rules. So "update" will not get those urls. Run "bin/nutch parse -all -force" to reparse the segments with the new regexes and then try what you did earlier ie. update -> generate -> fetch etc.. On Mon, May 27, 2013 at 2:4

Re: handshake alert:unrecognized_name----problems with ssl using https conection

2013-05-27 Thread Tejas Patil
Which version of Java are you using ? This problem is seen with Java 7 [0]. Downgrading to Java 6 might help you. [0] : http://stackoverflow.com/questions/7615645/ssl-handshake-alert-unrecognized-name-error-since-upgrade-to-java-1-7-0 On Mon, May 27, 2013 at 8:33 AM, Eyeris Rodriguez Rueda wrote

Re: Fetcher corrupting some segments

2013-05-27 Thread Sebastian Nagel
Hi Markus, a similar problem was posted some time ago: http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-quot-problem-advancing-port-rec-quot-during-fetching-tt3994633.html#a3996554 Sebastian On 05/27/2013 11:06 AM, Markus Jelsma wrote: > Hi, > > For some reason the fetcher som

handshake alert:unrecognized_name----problems with ssl using https conection

2013-05-27 Thread Eyeris Rodriguez Rueda
Hi all, Im tring to crawl a web site and I get one error with ssl, this is the exception javax.net.ssl.SSLProtocolException, specifically 2013-05-27 10:46:06,226 INFO fetcher.Fetcher - fetch of https://cubatravel.cidi.uci.cu:8443/ failed with: javax.net.ssl.SSLProtocolException: handshake alert:

Including urls in a nutch crawl that have previously been excluded (nutch 2.1)

2013-05-27 Thread Nicholas W
I had previously excluded some urls in a nutch crawl to limit the scope of the crawl during testing by including the appropriate regex in the regex-urlfilter.txt file . I would now like to lift those restrictions and have editing the regex-urlfilter.txt to allow more urls. However after executing

Fetcher corrupting some segments

2013-05-27 Thread Markus Jelsma
Hi, For some reason the fetcher sometimes produces corrupts unreadable segments. It then exists with exception like "problem advancing post", or "negative array size exception" etc. java.lang.RuntimeException: problem advancing post rec#702 at org.apache.hadoop.mapred.Task$ValuesIterat

Re: OutOfMemoryError for bin/nutch elasticindex -all

2013-05-27 Thread Nicholas W
Dear List, One thing I should add is that it works fine with SOLR: bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex will load the index up nicely Thanks again for your suggestions. Regards, Nicholas W. On Thu, May 23, 2013 at 10:47 AM, Nicholas W <4...@log1.net> wrote: > Dear List, >