[jira] [Created] (NUTCH-1331) limit crawler to defined depth
limit crawler to defined depth -- Key: NUTCH-1331 URL: https://issues.apache.org/jira/browse/NUTCH-1331 Project: Nutch Issue Type: New Feature Components: generator, parser, storage Affects Versions: 1.4 Reporter: behnam nikbakht there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls. an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations) we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1329) parser not extract outlinks to external web sites
parser not extract outlinks to external web sites - Key: NUTCH-1329 URL: https://issues.apache.org/jira/browse/NUTCH-1329 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: behnam nikbakht found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url so i replace these lines: URL url = URLUtil.resolveURL(base, target); outlinks.add(new Outlink(url.toString(), linkText.toString().trim())); with: String host_temp=null; try{ host_temp=URLUtil.getDomainName(new URL(target)); } catch(Exception eiuy){ host_temp=null; } URL url=null; if(host_temp==null)// it is an internal outlink url = URLUtil.resolveURL(base, target); else //it is an external link url=new URL(target); outlinks.add(new Outlink(url.toString(), linkText.toString().trim())); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1328) a problem with regex-normalize.xml
a problem with regex-normalize.xml -- Key: NUTCH-1328 URL: https://issues.apache.org/jira/browse/NUTCH-1328 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: behnam nikbakht there is a regex-pattern in regex-normalize.xml: ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$) that remove session ids from urls, but there is some sites, like: http://www.mehrnews.com/fa that have urls, like: http://www.mehrnews.com/fa/newsdetail.aspx?NewsID=1567539 and with this pattern, this url converted to an invalid url: http://www.mehrnews.com/fa/newsdetail.aspx?New -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1309) fetch queue management
fetch queue management -- Key: NUTCH-1309 URL: https://issues.apache.org/jira/browse/NUTCH-1309 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht when run fetch in hadoop with multiple concurrent mapper, there are multiple independent fetchQueues that make hard to manage them. i suggest that construct fetchQueues before begin of run with this line: feeder = new QueueFeeder(input, fetchQueues, threadCount * 50); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1303) Fetcher to skip queues for URLS getting repeated exceptions, based on percentage
Fetcher to skip queues for URLS getting repeated exceptions, based on percentage Key: NUTCH-1303 URL: https://issues.apache.org/jira/browse/NUTCH-1303 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht as described in https://issues.apache.org/jira/browse/NUTCH-769, it is a good solution to skip queues with high exception value, but it is not easy to set value of fetcher.max.exceptions.per.queue when size of queues are different. i suggest that define a ratio instead of value, so if the ratio of exceptions per requests exceeds, then queue cleared. also, it is not sufficient to keep fetcher from high exceptions, value of fetcher.throughput.threshold.pages ensures that a valueable throughput of fetch can gained against slow hosts, but it clean all queues not slow queue. i suggest for this one that this factor like fetcher.max.exceptions.per.queue enforce to each queue not all of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1297) it is better for fetchItemQueues to select items from greater queues first
it is better for fetchItemQueues to select items from greater queues first -- Key: NUTCH-1297 URL: https://issues.apache.org/jira/browse/NUTCH-1297 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht there is a situation that if we have multiple hosts in fetch, and size of hosts were different, large hosts have a long delay until the getFetchItem() in FetchItemQueues class select a url from them, so we can give them more priority. for example if we have 10 url from host1 and 1000 url from host2, and have 5 threads, if all threads first selected from host1, we had more delay on fetch rather than a situation that threads first selected from host2, and when host 2 was busy, then selected from host1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1288) Generator should not generate filter and not found and denied and gone and permanently moved pages
Generator should not generate filter and not found and denied and gone and permanently moved pages -- Key: NUTCH-1288 URL: https://issues.apache.org/jira/browse/NUTCH-1288 Project: Nutch Issue Type: Bug Components: fetcher, generator Affects Versions: 1.4 Reporter: behnam nikbakht Generator should not generate filter and not found and denied and gone and permanently moved pages. in the shouldFetch method in AbstractFetchSchedule, CrawlDatum must checked against special states of fetch like not found, and not generate them again. so we can add a status in CrawlDatum that indicates invalid urls, and set this status in fetch. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1282) linkdb scalability
linkdb scalability -- Key: NUTCH-1282 URL: https://issues.apache.org/jira/browse/NUTCH-1282 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 1.4 Reporter: behnam nikbakht as described in NUTCH-1054, the linkdb is optional in solrindex and it's usage is only for anchor and not impact on scoring. as seemed, size of linkdb in incremental crawl grow very fast and make it unscalable for huge size of web sites. so, here is two choises, one, ignore invertlinks and linkdb from crawl, and second, make it scalable in invertlinks, there is 2 jobs, first for construct new linkdb from new parsed segments, and second for merge new linkdb with old linkdb. the second job is unscalable and we can ignore it with this changes in solrIndex: in the class IndexerMapReduce, reduce method, if fetchDatum == null or dbDatum == null or parseText == null or parseData == null, then add anchor to doc and update solr (no insert) here also some changes required to NutchDocument. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch
tika parser not work properly with unwanted file types that passed from filters in nutch Key: NUTCH-1281 URL: https://issues.apache.org/jira/browse/NUTCH-1281 Project: Nutch Issue Type: Improvement Components: parser Reporter: behnam nikbakht when in parse-plugins.xml, set this property: all unwanted files that pass from all filters, refered to tika but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job. if this file types passed from regex-urlfilter and other filters, parse job failed. for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this: public ParseResult getParse(Content content) { String mimeType = content.getContentType(); + String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"}; + boolean valid=false; + for(int k=0;khttps://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1278) Fetch Improvement in threads per host
Fetch Improvement in threads per host - Key: NUTCH-1278 URL: https://issues.apache.org/jira/browse/NUTCH-1278 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 1.4 Reporter: behnam nikbakht the value of maxThreads is equal to fetcher.threads.per.host and is constant for every host there is a possibility with using of dynamic values for every host that influeced with number of blocked requests. this means that if number of blocked requests for one host increased, then we most decrease this value and increase http.timeout -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1270) some of Deflate encoded pages not fetched
some of Deflate encoded pages not fetched - Key: NUTCH-1270 URL: https://issues.apache.org/jira/browse/NUTCH-1270 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht it is a problem with some of web pages that fetched but their content can not retrived after this change, this error fixed we change lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java public byte[] processDeflateEncoded(byte[] compressed, URL url) throws IOException { if (LOGGER.isTraceEnabled()) { LOGGER.trace("inflating"); } byte[] content = DeflateUtils.inflateBestEffort(compressed, getMaxContent()); +if(content==null) + content = DeflateUtils.inflateBestEffort(compressed, 20); if (content == null) throw new IOException("inflateBestEffort returned null"); if (LOGGER.isTraceEnabled()) { LOGGER.trace("fetched " + compressed.length + " bytes of compressed content (expanded to " + content.length + " bytes) from " + url); } return content; } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1269) Generate main problems
Generate main problems -- Key: NUTCH-1269 URL: https://issues.apache.org/jira/browse/NUTCH-1269 Project: Nutch Issue Type: Improvement Components: generator Affects Versions: 1.4 Environment: software Reporter: behnam nikbakht there are some problems with current Generate method, with maxNumSegments and maxHostCount options: 1. first, size of generated segments are different 2. with maxHostCount option, it is unclear that it was applied or not 3. urls from one host are distributed non-uniform between segments we change Generator.java as described below: in Selector class: private int maxNumSegments; private int segmentSize; private int maxHostCount; public void config ... maxNumSegments = job.getInt(GENERATOR_MAX_NUM_SEGMENTS, 1); segmentSize=(int)job.getInt(GENERATOR_TOP_N, 1000)/maxNumSegments; maxHostCount=job.getInt("GENERATE_MAX_PER_HOST", 100); ... public void reduce(FloatWritable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int limit2=(int)((limit*3)/2); while (values.hasNext()) { if(count == limit) break; if (count % segmentSize == 0 ) { if (currentsegmentnum < maxNumSegments-1){ currentsegmentnum++; } else currentsegmentnum=0; } boolean full=true; for(int jk=0;jkhttps://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1204) not all of pages parsed
not all of pages parsed --- Key: NUTCH-1204 URL: https://issues.apache.org/jira/browse/NUTCH-1204 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: behnam nikbakht Priority: Critical when we fetch a site in multiple segments, and dump crawldb with readdb, the system says that some of pages are unfetched, and when we checked, we find that these pages were fetched and stored but was not parsed we try to crawl a site with only html pages and edit suffix-urlfilter.txt and parser.timeout property and test it and find that only some of html pages are parsed this is a critical situation for performance because fetching of sites is well but parsing of them in iterations cause refetching these sites -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1199) unfetched URLs problem
unfetched URLs problem -- Key: NUTCH-1199 URL: https://issues.apache.org/jira/browse/NUTCH-1199 Project: Nutch Issue Type: Improvement Components: fetcher, generator Reporter: behnam nikbakht Priority: Critical we write a script to fetch unfetched urls: #first dump from readdb to a text file, and extract unfetched urls to a text file: bin/nutch readdb $crawldb -dump $SITE_DIR/tmp/dump_urls.txt -format csv cat $SITE_DIR/tmp/dump_urls.txt/part-0 | grep db_unfetched > $SITE_DIR/tmp/dump_unf unfetched_urls_file="$SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt" cat $SITE_DIR/tmp/dump_unf | awk -F '"' '{print $2}' > $unfetched_urls_file unfetched_count=`cat $unfetched_urls_file|wc -l` #next, we have a list of unfetched urls in unfetched_urls.txt , then, we use command freegen to create segments for #these urls, we can not use command generate because these url's were generated previously if [[ $unfetched_count -lt $it_size ]] then echo "UNFETCHED $J , $it_size URLs from $unfetched_count generated" ((J++)) bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls.txt $crawlseg s2=`ls -d $crawlseg/2* | tail -1` bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb $crawldb $s2 echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt get_new_links exit fi # if number of urls are greater than it_size, then package them ij=1 while read line do let "ind = $ij / $it_size" mkdir $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/ echo $line >> $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt echo $ind ((ij++)) let "completed=$ij % $it_size" if [[ $completed -eq 0 ]] then echo "UNFETCHED $J , $it_size URLs from $unfetched_count generated" ((J++)) bin/nutch freegen $SITE_DIR/tmp/unfetched_urls/unfetched_urls$ind/unfetched_urls$ind.txt $crawlseg #finally fetch,parse and update new segment s2=`ls -d $crawlseg/2* | tail -1` bin/nutch fetch $s2 bin/nutch parse $s2 rm $crawldb/.locked bin/nutch updatedb $crawldb $s2 echo "bin/nutch updatedb $crawldb $s2" >> $SITE_DIR/updatedblog.txt fi done <$unfetched_urls_file -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira