Try this
+^http://([a-z0-9\.\-]*)\.yahoo\.com/
I hope it help you :)
-Original Message-
From: zhangjin (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Monday, July 04, 2005 6:42 AM
To: nutch-dev@incubator.apache.org
Subject: [jira] Created: (NUTCH-67) I want crawl the websites including
news.yah
And this <%@ Language=VBScript %> shown in summaries
I thought ANY text between < and > should be always ignored and unknown tags
too.
:)
-Original Message-
From: Ilia S. Yatsenko [mailto:[EMAIL PROTECTED]
Sent: Monday, July 04, 2005 6:33 AM
To: nutch-dev@lucene.apache.org
Subject: RE:
I want crawl the websites including
news.yahoo.com,game.yahoo.com,blog.yahoo.com,etc!
---
Key: NUTCH-67
URL: http://issues.apache.org/jira/browse/NUTCH-67
Project: Nutch
Type: Wish
I thought "javascript" shown in summaries because I enable parse-js plug-in.
I have disabled it, made new database but got the same result :(
-Original Message-
From: Ilia S. Yatsenko [mailto:[EMAIL PROTECTED]
Sent: Sunday, July 03, 2005 7:09 PM
To: nutch-dev@lucene.apache.org
Subject: RE
please modify below
(# skip URLs containing certain characters as probable queries, etc.
# [EMAIL PROTECTED]
because the link
http://news.buaa.edu.cn/dispnews.php?type=1&nid=2500&s_table=news_txt
includes the ?=& which will be ignored
it will be
(# skip URLs containing certain characters as pro
[
http://issues.apache.org/jira/browse/NUTCH-65?page=comments#action_12314976 ]
Nick Lothian commented on NUTCH-65:
---
Re: It could be a good idea to declare the DataFormat as a final static
constant ... no?
NO! SimpleDateFormat is not thread safe. Using
Actually, I think the JavaScript is there as it's part of the HTML page --
but it should not be part of the summaries. Has anyone found a solution to
not showing the "JavaScript" or "text/css" -- that shows up from time to
time?
CC-
-Original Message-
From: Ilia S. Yatsenko [mailto:[EMA
Opps, I see my mistake O-)
-Original Message-
From: Ilia S. Yatsenko [mailto:[EMAIL PROTECTED]
Sent: Sunday, July 03, 2005 6:06 PM
To: nutch-dev@lucene.apache.org
Subject: both html parser have bug with javascript
Hello :)
Sorry my little English
I have issue with both html parsers.
Hello :)
Sorry my little English
I have issue with both html parsers.
I see in summaries next text:
2JavaScript1.3JavaScriptJavaScriptjavascriptjavascript1.1javascript1.2javasc
ript1.3javascript my text description.
Or
2javascript my text description.
Or
javascriptjavasc