[
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428399 ]
Sami Siren commented on NUTCH-349:
--
I anything at all should be done then I'd go for #2. There was also a total
incombatibility from 0.7 to 0.8 and I didn't see so many complaints.
I also noticed that there is (in some file formats) code to test version
changes and format read varies from version to version - those could be removed
also if we go for #2.
Port Nutch to use Hadoop Text instead of UTF8
-
Key: NUTCH-349
URL: http://issues.apache.org/jira/browse/NUTCH-349
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Andrzej Bialecki
Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings.
This class has been deprecated in Hadoop 0.5.0, and Text class should be used
instead. Sooner or later we will need to move Nutch to use this class instead
of UTF8.
This raises numerous issues regarding the compatibility of existing data in
CrawlDB, LinkDB and segments. I can see two ways to solve this:
* add code in readers of respective formats to convert UTF8-Text on the fly.
New writers would only use Text. This is less than ideal, because it
complicates the code, and also at some point in time the UTF8 class will be
removed.
* create a converter (to be maintaines as long as UTF8 exists), which
converts existing data in bulk from UTF8 to Text. This requires an additional
processing step when upgrading to convert all existing data to the new format.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira