[ 
https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279372#comment-14279372
 ] 

Renato Javier MarroquĂ­n Mogrovejo commented on GORA-401:
--------------------------------------------------------

Hi [~alparslan.avci], could you please point me to the code where the dirty 
bits get used? I haven't been able to find them, Maybe that is the bug? As far 
as I can recall what is serialized between map and reduce phases is the 
configuration object
I remember that this is not what it is expected. If you look into data 
generator for GoraCI [1], every new node that is create is first flushed into 
the datastore, making all dirty bits go away as it is being written into the 
data store.

{quote} Moreover, a serialization-deserialization operation couple on an object 
should not outcome with a different object from the beginning (including the 
super class members).{\quote}
I agree with you but Gora works as a cache layer, then every time you write 
whatever was on the cache layer, then the dirty bits (which indicate what has 
been modified in memory) should also be reset. Somehow like transient variables 
work with java serialization. 


[1] 
https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java#L262

> Serialization and deserialization of Persistent does not hold the entity 
> dirty state
> ------------------------------------------------------------------------------------
>
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on 
> gora-0.5. HBase backend.
>            Reporter: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>         Attachments: GORA-401-tests.patch
>
>   Original Estimate: 35h
>          Time Spent: 4h
>  Remaining Estimate: 31h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized. 
> In GORA-321 
> {{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
>  went from using 
> {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
>  to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty 
> field to Avro (but really not desirable to have that field as a main field in 
> the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which 
> will serialize the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's 
> phases, serializes entities (from Map to Reduce), and when deserializes finds 
> all fields as "dirty", independently of what fields were modified in the Map, 
> and overwrite all data in datastore (deleting much things: downloaded 
> content, parsed content, etc).
> This effect can be seen in 
> {{TestPersistentSerialization#testSerderEmployeeTwoFields}}, when debuging in 
> {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections 
> shows that, entities are "equal" when it's fields are equal. This is fine as 
> "equal" definition, but another test must be added to check that 
> serialization an deserialization keeps the dirty state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to