[ https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575321#comment-13575321 ]
Lewis John McGibbney commented on NUTCH-1511: --------------------------------------------- In Gora trunk we leverage numerous underlying serializers to persist data into Cassandra always in bytes. This means that it is down to the client application to do the deserialization when reading from Cassandra. An example of this using Gora 0.2.1 (although it is buggy*) shows below the same columns and values as above. This behaviour will be present in Gora 0.3 when we release, however we are also aware that a mechanism to deal with deserialization of data reads from the datastore within Gora would be advantageous to client applications. {code} Authenticated to keyspace: webpage [default@webpage] list p; Using default limit of 100 Using default column limit of 100 0 Row Returned. Elapsed time: 1 msec(s). [default@webpage] list f; Using default limit of 100 Using default column limit of 100 ------------------- RowKey: 6f72672e6170616368652e676f72613a687474702f => (column=6669, value=00278d00, timestamp=1360463612905000) => (column=73, value=3f800000, timestamp=1360463612907000) => (column=7473, value=0000013cc1f3450d, timestamp=1360463612891000) 1 Row Returned. Elapsed time: 2 msec(s). [default@webpage] list sc; Using default limit of 100 Using default column limit of 100 ------------------- RowKey: 6f72672e6170616368652e676f72613a687474702f => (super_column=6d6b, (column=5f696e6a6d726b5f, value=79, timestamp=1360463612912000) (column=64697374, value=30, timestamp=1360463612909000)) => (super_column=6d746474, (column=5f6373685f, value=3f800000, timestamp=1360463612913000)) 1 Row Returned. Elapsed time: 3 msec(s). {code} > Metadata in MYSQL updated with 'garbage' > ---------------------------------------- > > Key: NUTCH-1511 > URL: https://issues.apache.org/jira/browse/NUTCH-1511 > Project: Nutch > Issue Type: Bug > Components: fetcher, injector, storage > Affects Versions: 2.1 > Environment: Ubuntu 12.04 > Reporter: J. Gobel > Labels: metadata, mysql, nutch, scoring-opic > Fix For: 2.2 > > > After applying patch for Metadata parser (NUTCH-1478) I notice that the > metadata field just before the crawl ends is populated with the correct > information. However when the crawl is completely finished the metadata field > is populated with 'garbage' _csh_����� > I notice in my SQL log file that the scoring plugin is overwriting the > metadata field in a final data insertion with '_csh_ \0\0\0\0\'. When I > remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml > , the metadata-field is crisp and clear. > MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see > a fragments of my MYSQL log file, only the moments when data is written to > the METADATA field in the MYSQL table. > First Insertion .. here I suppose scoring-opic writes its information, _csh_ > ?€\0\0\0 > 58 Query INSERT INTO webpage > (fetchInterval,fetchTime,id,markers,metadata,score )VALUES > (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0',' > _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE > fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_ > y\0',metadata=' > _csh_ ?€\0\0\0',score=1.0 > Second Insertion - inhere scraped metada is inserted into metadata. > 81 Query INSERT INTO webpage > (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES > ('org.apache.nutch:http/', > The final insertion - please note that here the metadata field is > overwritten with _CSH_\0\0\0\0 > 90 Query INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata > )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/ > Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 > __prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508 > _ftcmrk_*1357122982-1745626508\0',' > _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' > 0http://nutch.apache.org/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira