[ 
https://issues.apache.org/jira/browse/NUTCH-1511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13575321#comment-13575321
 ] 

Lewis John McGibbney commented on NUTCH-1511:
---------------------------------------------

In Gora trunk we leverage numerous underlying serializers to persist data into 
Cassandra always in bytes. This means that it is down to the client application 
to do the deserialization when reading from Cassandra. An example of this using 
Gora 0.2.1 (although it is buggy*) shows below the same columns and values as 
above.
This behaviour will be present in Gora 0.3 when we release, however we are also 
aware that a mechanism to deal with deserialization of data reads from the 
datastore within Gora would be advantageous to client applications. 
 
{code}
Authenticated to keyspace: webpage
[default@webpage] list p;
Using default limit of 100
Using default column limit of 100

0 Row Returned.
Elapsed time: 1 msec(s).
[default@webpage] list f;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 6f72672e6170616368652e676f72613a687474702f
=> (column=6669, value=00278d00, timestamp=1360463612905000)
=> (column=73, value=3f800000, timestamp=1360463612907000)
=> (column=7473, value=0000013cc1f3450d, timestamp=1360463612891000)

1 Row Returned.
Elapsed time: 2 msec(s).
[default@webpage] list sc;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: 6f72672e6170616368652e676f72613a687474702f
=> (super_column=6d6b,
     (column=5f696e6a6d726b5f, value=79, timestamp=1360463612912000)
     (column=64697374, value=30, timestamp=1360463612909000))
=> (super_column=6d746474,
     (column=5f6373685f, value=3f800000, timestamp=1360463612913000))

1 Row Returned.
Elapsed time: 3 msec(s).

{code}
                
> Metadata in MYSQL updated with 'garbage'
> ----------------------------------------
>
>                 Key: NUTCH-1511
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1511
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, injector, storage
>    Affects Versions: 2.1
>         Environment: Ubuntu 12.04
>            Reporter: J. Gobel
>              Labels: metadata, mysql, nutch, scoring-opic
>             Fix For: 2.2
>
>
> After applying patch for Metadata parser (NUTCH-1478) I notice that the 
> metadata field just before the crawl ends is populated with the correct 
> information. However when the crawl is completely finished the metadata field 
> is populated with 'garbage' _csh_����� 
> I notice in my SQL log file that the scoring plugin is overwriting the 
> metadata field in a final data insertion with '_csh_ \0\0\0\0\'. When I 
> remove 'scoring-opic' out of 'plugin.includes' property in the nutch-site.xml 
> , the metadata-field is crisp and clear.
> MYSQL LOG FILE: (I did a crawl on http://nutch.apache.org. Below you will see 
> a fragments of my MYSQL log file, only the moments when data is written to 
> the METADATA field in the MYSQL table.
> First Insertion .. here I suppose scoring-opic writes its information, _csh_ 
> ?€\0\0\0 
> 58 Query    INSERT INTO webpage 
> (fetchInterval,fetchTime,id,markers,metadata,score )VALUES 
> (2592000,1357122976493,'org.apache.nutch:http/',' dist 0 _injmrk_ y\0','
> _csh_ ?€\0\0\0',1.0) ON DUPLICATE KEY UPDATE 
> fetchInterval=2592000,fetchTime=1357122976493,markers=' dist 0 _injmrk_ 
> y\0',metadata='
> _csh_ ?€\0\0\0',score=1.0
> Second Insertion - inhere scraped metada is inserted into metadata. 
>  81 Query    INSERT INTO webpage 
> (id,markers,metadata,outlinks,parseStatus,signature,text,title )VALUES 
> ('org.apache.nutch:http/',
> The final insertion -  please note that here the metadata field is 
> overwritten with _CSH_\0\0\0\0
> 90 Query    INSERT INTO webpage (fetchTime,id,inlinks,markers,metadata 
> )VALUES (1359714995075,'org.apache.nutch:http/',' 0http://nutch.apache.org/
> Nutch\0',' dist 0 _injmrk_ y _updmrk_*1357122982-1745626508 
> __prsmrk__*1357122982-1745626508 _gnmrk_*1357122982-1745626508 
> _ftcmrk_*1357122982-1745626508\0','
> _csh_ \0\0\0\0\0') ON DUPLICATE KEY UPDATE fetchTime=1359714995075,inlinks=' 
> 0http://nutch.apache.org/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to