Hi Mauricio, I wonder how can you use hbase 0.94 series. Nutch use Gora 0.3. But Gora doesn't support Hbase 0.94.x. You can only use hbase 0.90.x
IMHO your problem maybe hbase version. 30 Oca 2014 19:55 tarihinde "Tejas Patil" <tejas.patil...@gmail.com> yazdı: > Strange. > Is it possible for you to share nutch-site.xml and nutch-default.xml here ? > You can upload those somewhere and share link here so that you have control > over it and can delete those whenever you want to. > Also, can you please check if you have NUTCH_HOME and NUTCH_CONF_DIR > mistakenly exported which are pointing to some weird location ? > How are invoking Nutch ? Which command were you running ? > > > > On Thu, Jan 30, 2014 at 9:49 PM, Ciprian Rodriguez, Mauricio < > mauricio.cipr...@atos.net> wrote: > > > Thanks Tejas. > > > > Yes, the regex-urlfilter.txt is present, This is the content of that > file: > > ... > > # skip file: ftp: and mailto: urls > > -^(file|ftp|mailto): > > > > # skip image and other suffixes we can't yet parse > > # for a more extensive coverage use the urlfilter-suffix plugin > > > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > > > > # skip URLs containing certain characters as probable queries, etc. > > -[?*!@=] > > > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > > loops > > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > > # accept anything else > > +. > > > > Kind Regards, > > > > Mauricio > > -----Original Message----- > > From: Tejas Patil [mailto:tejas.patil...@gmail.com] > > Sent: Thursday, January 30, 2014 4:51 PM > > To: user@nutch.apache.org > > Subject: Re: regex-normalize.xml/regex-urlfilter.txt not found > > > > Can you confirm if 'regex-urlfilter.txt' is present inside 'conf' > > directory at the location where you are running the crawler ? If so, what > > are the contents of that file ? > > > > Thanks, > > Tejas > > > > > > On Thu, Jan 30, 2014 at 9:06 PM, Ciprian Rodriguez, Mauricio < > > mauricio.cipr...@atos.net> wrote: > > > > > Hi. > > > > > > I'm developing a Java software that uses Nutch (2.2.1)+Hbase(0.94.16) > > > integration. I'm getting a NullPointerException in > > > > > > org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:179). > > > I assume this error is related with following warnings in the log: > > > > > > .... > > > Jan 30, 2014 12:47:10 AM org.apache.hadoop.conf.Configuration > > > getConfResourceAsReader > > > INFO: regex-normalize.xml not found > > > Jan 30, 2014 12:47:10 AM > > > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer setConf > > > WARNING: Can't load the default rules! > > > Jan 30, 2014 12:47:10 AM org.apache.hadoop.conf.Configuration > > > getConfResourceAsReader > > > INFO: regex-urlfilter.txt not found > > > Jan 30, 2014 12:47:10 AM org.apache.hadoop.mapred.FileOutputCommitter > > > cleanupJob > > > WARNING: Output path is null in cleanup .... > > > > > > Both files are included in $NUTCH_HOME/conf folder. And both files are > > > correctly configured in the nutch-default.xml > > > > > > ... > > > <property> > > > <name>urlnormalizer.regex.file</name> > > > <value>regex-normalize.xml</value> > > > <description>Name of the config file used by the RegexUrlNormalizer > > > class. > > > </description> > > > </property> > > > ... > > > > > > <property> > > > <name>urlfilter.regex.file</name> > > > <value>regex-urlfilter.txt</value> > > > <description>Name of file on CLASSPATH containing regular expressions > > > used by urlfilter-regex (RegexURLFilter) plugin.</description> > > > </property> > > > > > > I don't understand why the Nutch don't find those files, everything > > > seems in the correct place. Could you help me with this error? Thanks > in > > advance. > > > > > > Kind Regards, > > > > > > Mauricio Ciprián Rodríguez > > > > > > > > > > > > > > > > > > > > >