[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Sami Siren (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428399 ] 

Sami Siren commented on NUTCH-349:
--

I anything at all should be done then I'd go for #2. There was also a total 
incombatibility from 0.7 to 0.8 and I didn't see so many complaints.

I also noticed that there is (in some file formats) code to test version 
changes and format read varies from version to version - those could be removed 
also if we go for #2.

 Port Nutch to use Hadoop Text instead of UTF8
 -

 Key: NUTCH-349
 URL: http://issues.apache.org/jira/browse/NUTCH-349
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki 

 Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. 
 This class has been deprecated in Hadoop 0.5.0, and Text class should be used 
 instead. Sooner or later we will need to move Nutch to use this class instead 
 of UTF8.
 This raises numerous issues regarding the compatibility of existing data in 
 CrawlDB, LinkDB and segments. I can see two ways to solve this:
 * add code in readers of respective formats to convert UTF8-Text on the fly. 
 New writers would only use Text. This is less than ideal, because it 
 complicates the code, and also at some point in time the UTF8 class will be 
 removed.
 * create a converter (to be maintaines as long as UTF8 exists), which 
 converts existing data in bulk from UTF8 to Text. This requires an additional 
 processing step when upgrading to convert all existing data to the new format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] 

Stefan Groschupf commented on NUTCH-349:


my vote goes to #2.
Having a tool that need to be started manually would be better than complicate 
the already fragile code from my point of view. 

 Port Nutch to use Hadoop Text instead of UTF8
 -

 Key: NUTCH-349
 URL: http://issues.apache.org/jira/browse/NUTCH-349
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki 

 Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. 
 This class has been deprecated in Hadoop 0.5.0, and Text class should be used 
 instead. Sooner or later we will need to move Nutch to use this class instead 
 of UTF8.
 This raises numerous issues regarding the compatibility of existing data in 
 CrawlDB, LinkDB and segments. I can see two ways to solve this:
 * add code in readers of respective formats to convert UTF8-Text on the fly. 
 New writers would only use Text. This is less than ideal, because it 
 complicates the code, and also at some point in time the UTF8 class will be 
 removed.
 * create a converter (to be maintaines as long as UTF8 exists), which 
 converts existing data in bulk from UTF8 to Text. This requires an additional 
 processing step when upgrading to convert all existing data to the new format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira