[ 
https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250632#comment-14250632
 ] 

Alfonso Nishikawa commented on GORA-401:
----------------------------------------

Hi, [~drazzib], your question is much related, but not exactly the same. When 
you wrote that I didn't understood because I was using an older version, but 
after upgrading, now I understand you, and I comment the same here bellow in  
(1).
The problem I comment arises after GORA-326, applied on August 19th. I will 
answer [~renato2099] at the same time :)

Hi, [~renato2099]. When StateManager was deleted and {{__g__dirty}} field was 
introduced inside the schema, Avro was serializing it at the same time as the 
rest of the fields and the dirty state was traveling in a pack (albeit wrongly 
it was loosing the map's k-v dirty state). That was, in my oppinion, a bad 
design. In GORA-326, {{__g__dirty}} was removed from the schema fields and 
became an inmemory dirty state that is not serialized by Avro. In my opinion is 
a better in design because it is not part of the fields in the schema (but 
still has flaws).

When an entity is sent from the Map phase to Reduce phase, it is serialized 
with the Avro serializer, and loosing the dirty state is a bad thing. Let's see 
why:
# You load an entity specifying only a few fields (a subset of all fields), as 
we know we can do. Fields not loaded have a null value (or default value for 
basic java types)
# After serializing and deserializing, every field becomes dirty.
# When you write, *all* fields gets persisted.

This, simply, was not the behavior when StateManager was in, nor before 
GORA-326. But there are more important implications:

* Since every field *eventually* will be written with a null value, you will 
have to define your schemas will all fields as "union null". Otherwise you will 
always have to read all the entity
* Nutch breaks horribly: after {{updatedb}} all content downloaded is deleted 
becasue updatedb does not load that field. I don't know why no one noticed it :P
* If you want to update only one field, you have to read all the fields 
*always*. Before this point, you could just read the interesting fields, update 
the interesting field and persist.
* If you create a new entity interested only in 1 field, you will have to 
assign a value to all fields or define all of them as nullable.
* etc...

About the "two mappers reading the same entity in different machines and 
modifying entity differently", the answer is not differente than before 
GORA-326: it depends on the situation, and you can mess the same way as now it 
is.

Before GORA-326, the dirty fields were the ones being updated, and that is how 
I think should be now too. (Obviously, if you wanted to delete a field, you 
wrote it blank).

I took a deep look at Nutch and I wrote the effect in the description of this 
issue, but I find good if you take a look at Nutch by yourself. Anyway I feel a 
bit hurted noticing your preconception about that the problem probably is other 
:(

What I suggest:

I find DirtyStateManager the best design approach, but since the dirty state 
managing has been shifted to the fields' types, I find ok to reintroduce the 
{{PersistentDatumWriter/Reader}}.

(1) And about the question of [~drazzib], before introducing {{__g__dirty}} in 
the fields, Maps were managing the key-values added and deleted. Now that 
incremental information is not taken into account, forcing to read and write 
all the key-values everytime you read/write. I find it wrong, since I that 
information was useful to not have to load the field (all k-v) and delete some 
key-values (I used to do that), but well... now there are so many changes to 
rollback, so ok.
If I had to choose between the StateManager and the state managed in the 
instance of Maps I would vote for the StateManager because each backend could 
use one state manager properly for each backend. But well... that maybe would 
come some day.

Thanks!


> Serialization and deserialization of Persistent does not hold the entity 
> dirty state
> ------------------------------------------------------------------------------------
>
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on 
> gora-0.5
>            Reporter: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>   Original Estimate: 35h
>  Remaining Estimate: 35h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized. 
> In GORA-321 
> {{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
>  went from using 
> {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
>  to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty 
> field to Avro (but really not desirable to have that field as a main field in 
> the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which 
> will serialize the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's 
> phases, serializes entities (from Map to Reduce), and when deserializes finds 
> all fields as "dirty", independently of what fields were modified in the Map, 
> and overwrite all data in datastore (deleting much things: downloaded 
> content, parsed content, etc).
> This effect can be seen in 
> {{TestPersistentSerialization#testSerderEmployeeTwoFields}}, when debuging in 
> {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections 
> shows that, entities are "equal" when it's fields are equal. This is fine as 
> "equal" definition, but another test must be added to check that 
> serialization an deserialization keeps the dirty state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to