[jira] Commented: (NUTCH-535) ParseData's contentMeta accumulates unnecessary values during parse

2007-08-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518629
 ] 

Hudson commented on NUTCH-535:
--

Integrated in Nutch-Nightly #175 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/175/])

 ParseData's contentMeta accumulates unnecessary values during parse
 ---

 Key: NUTCH-535
 URL: https://issues.apache.org/jira/browse/NUTCH-535
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Doğacan Güney
Assignee: Doğacan Güney
 Fix For: 1.0.0

 Attachments: NUTCH-535.patch, NUTCH_535_v2.patch


 After NUTCH-506, if you run parse on a segment, parseData's contentMeta 
 accumulates metadata of every content parsed so far. This is because 
 NUTCH-506 changed constructor to create a new metadata (before NUTCH-506, a 
 new metadata was created for every call to readFields). It seems hadoop 
 somehow caches Content instance so each new call to Content.readFields during 
 ParseSegment increases size of metadata. Because of this, one can end up with 
 *huge* parse_data directory (something like 10 times larger than content 
 directory)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-522) Use URLValidator in the Injector

2007-08-09 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518627
 ] 

Hudson commented on NUTCH-522:
--

Integrated in Nutch-Nightly #175 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/175/])

 Use URLValidator in the Injector
 

 Key: NUTCH-522
 URL: https://issues.apache.org/jira/browse/NUTCH-522
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: Emmanuel Joke
Assignee: Emmanuel Joke
Priority: Minor
 Fix For: 1.0.0

 Attachments: NUTCH-522.patch, NUTCH-522_v2.patch, NUTCH-522_v3.patch, 
 NUTCH_522_v4.patch


 Same as NUTCH-505, we should use the UrlValidator to check url in the Injector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-540) some problem about the Nutch cache

2007-08-09 Thread crossany (JIRA)
some problem about the Nutch cache
--

 Key: NUTCH-540
 URL: https://issues.apache.org/jira/browse/NUTCH-540
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
Reporter: crossany
Priority: Blocker
 Fix For: 0.9.0


I'am a chinese.
I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website it 
a chinese website the web charset it's also UTF-8. when Use the nutch on tomcat 
for search chinese word , I find the search result' Title and description was 
right to display. but when I click the cache, the cache web was display a error 
charset code, I see the cache
web' charset also utf-8. I find a website use Nutch 
http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
error.
I use Luke to see the segments It's can display chinese word, I think maybe 
it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-541) Index url field untokenized

2007-08-09 Thread Enis Soztutar (JIRA)
Index url field untokenized
---

 Key: NUTCH-541
 URL: https://issues.apache.org/jira/browse/NUTCH-541
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Affects Versions: 1.0.0
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 1.0.0


Url field is indexed as Strore.YES , Index.TOKENIZED. We also need the 
untokenized version of the url field in some contexts : 
1. For deleting duplicates by url (at search time). see NUTCH-455
2. For restricting the search to a certain url (may be used in the case of RSS 
search where each entry in the Rss is added as a distinct document with 
(possibly) same url ) 
   query-url extends FieldQueryFilter so: 
Query: url:http://www.apache.org/
Parsed: url:http http-www http-www-apache www www-apache apache org
Translated: +url:http-http-www http-www-http-www-apache 
http-www-apache-www www-www-apache www-apache apache org
3. for accessing a document(s) in the search servers in the search servers. 
(using query plugin)

I suggest we add url as in index-basic and implement a query-url-untoken 
plugin. 
doc.add(new Field(url, url.toString(), Field.Store.YES, 
Field.Index.TOKENIZED));
doc.add(new Field(url_untoken, url.toString(), Field.Store.NO, 
Field.Index.UN_TOKENIZED));


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-540) some problem about the Nutch cache

2007-08-09 Thread Renaud Richardet (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Richardet updated NUTCH-540:
---

Priority: Major  (was: Blocker)

could you please attach log files and error messages? thanks

 some problem about the Nutch cache
 --

 Key: NUTCH-540
 URL: https://issues.apache.org/jira/browse/NUTCH-540
 Project: Nutch
  Issue Type: Bug
  Components: searcher
Affects Versions: 0.9.0
 Environment: Red hat AS4 + Tomcat5.5 + Nutch0.9
Reporter: crossany
 Fix For: 0.9.0


 I'am a chinese.
 I just test to search chinese word in nutch. I install nutch0.9 in tomcat5 on 
 linux.and the Tomcat charset it's UTF-8 and I use nutch to Crawl the website 
 it a chinese website the web charset it's also UTF-8. when Use the nutch on 
 tomcat for search chinese word , I find the search result' Title and 
 description was right to display. but when I click the cache, the cache web 
 was display a error charset code, I see the cache
 web' charset also utf-8. I find a website use Nutch 
 http://www.synoo.com:8080/zh/ I just test to search chinese word . It's also 
 error.
 I use Luke to see the segments It's can display chinese word, I think maybe 
 it's a Bug.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.