[Nutch-dev] RE: [jira] Created: (NUTCH-67) I want crawl the websites including news.yahoo.com,game.yahoo.com,blog.yahoo.com,etc!

2005-07-03 Thread Ilia S. Yatsenko
Try this +^http://([a-z0-9\.\-]*)\.yahoo\.com/ I hope it help you :) -Original Message- From: zhangjin (JIRA) [mailto:[EMAIL PROTECTED] Sent: Monday, July 04, 2005 6:42 AM To: nutch-dev@incubator.apache.org Subject: [jira] Created: (NUTCH-67) I want crawl the websites including news.yah

[Nutch-dev] RE: both html parser have bug with javascript

2005-07-03 Thread Ilia S. Yatsenko
And this <%@ Language=VBScript %> shown in summaries I thought ANY text between < and > should be always ignored and unknown tags too. :) -Original Message- From: Ilia S. Yatsenko [mailto:[EMAIL PROTECTED] Sent: Monday, July 04, 2005 6:33 AM To: nutch-dev@lucene.apache.org Subject: RE:

[Nutch-dev] [jira] Created: (NUTCH-67) I want crawl the websites including news.yahoo.com,game.yahoo.com,blog.yahoo.com,etc!

2005-07-03 Thread zhangjin (JIRA)
I want crawl the websites including news.yahoo.com,game.yahoo.com,blog.yahoo.com,etc! --- Key: NUTCH-67 URL: http://issues.apache.org/jira/browse/NUTCH-67 Project: Nutch Type: Wish

[Nutch-dev] RE: both html parser have bug with javascript

2005-07-03 Thread Ilia S. Yatsenko
I thought "javascript" shown in summaries because I enable parse-js plug-in. I have disabled it, made new database but got the same result :( -Original Message- From: Ilia S. Yatsenko [mailto:[EMAIL PROTECTED] Sent: Sunday, July 03, 2005 7:09 PM To: nutch-dev@lucene.apache.org Subject: RE

[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

2005-07-03 Thread Nutch开发邮件
please modify below (# skip URLs containing certain characters as probable queries, etc. # [EMAIL PROTECTED] because the link http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt includes the ?=& which will be ignored it will be (# skip URLs containing certain characters as pro

[Nutch-dev] [jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-07-03 Thread Nick Lothian (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-65?page=comments#action_12314976 ] Nick Lothian commented on NUTCH-65: --- Re: It could be a good idea to declare the DataFormat as a final static constant ... no? NO! SimpleDateFormat is not thread safe. Using

[Nutch-dev] RE: both html parser have bug with javascript

2005-07-03 Thread Chirag Chaman
Actually, I think the JavaScript is there as it's part of the HTML page -- but it should not be part of the summaries. Has anyone found a solution to not showing the "JavaScript" or "text/css" -- that shows up from time to time? CC- -Original Message- From: Ilia S. Yatsenko [mailto:[EMA

[Nutch-dev] RE: both html parser have bug with javascript

2005-07-03 Thread Ilia S. Yatsenko
Opps, I see my mistake O-) -Original Message- From: Ilia S. Yatsenko [mailto:[EMAIL PROTECTED] Sent: Sunday, July 03, 2005 6:06 PM To: nutch-dev@lucene.apache.org Subject: both html parser have bug with javascript Hello :) Sorry my little English I have issue with both html parsers.

[Nutch-dev] both html parser have bug with javascript

2005-07-03 Thread Ilia S. Yatsenko
Hello :) Sorry my little English I have issue with both html parsers. I see in summaries next text: 2JavaScript1.3JavaScriptJavaScriptjavascriptjavascript1.1javascript1.2javasc ript1.3javascript my text description. Or 2javascript my text description. Or javascriptjavasc