Bruno Rijsman created THRIFT-4638:
-------------------------------------

             Summary: When I use Thrift to serialize map in C++ to disk, and 
then de-serialize it using Python, I do not get back the same object
                 Key: THRIFT-4638
                 URL: https://issues.apache.org/jira/browse/THRIFT-4638
             Project: Thrift
          Issue Type: Bug
          Components: C++ - Library, Python - Library
    Affects Versions: 1.0
            Reporter: Bruno Rijsman


Summary: when I use Thrift to serialize map in C++ to disk, and then 
de-serialize it using Python, I do not get back the same object.

 

See also 
https://stackoverflow.com/questions/52377410/when-i-use-thrift-to-serialize-map-in-c-to-disk-and-then-de-serialize-it-usin

I have the following Thrift model file, which (as you can see) contains a map 
indexed by a struct:

struct Coordinate {
 1: required i32 x;
 2: required i32 y;
 }

struct Terrain {
 1: required map<Coordinate, i32> altitude_samples;
 }

I use the following C++ code to create an object with 3 coordinates in the map 
(see the repo for complete code for all snippets below):

Terrain terrain;
 add_sample_to_terrain(terrain, 10, 10, 100);
 add_sample_to_terrain(terrain, 20, 20, 200);
 add_sample_to_terrain(terrain, 30, 30, 300);

where:

void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t 
altitude)
 {
 Coordinate coordinate;
 coordinate.x = x;
 coordinate.y = y;
 std::pair<Coordinate, int32_t> sample(coordinate, altitude);
 terrain.altitude_samples.insert(sample);
 }

I use the following C++ code to serialize an object to disk:

shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
 shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
 terrain.write(protocol.get());

Important note: for this to work correctly, I had to implement the function 
Coordinate::operator<. Thrift does generate the declaration for the 
Coordinate::operator< but does not generate the implementation of 
Coordinate::operator<. The reason for this is that Thrift does not understand 
the semantics of the struct and hence cannot guess the correct implementation 
of the comparison operator. This is discussed at 
http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%3c4c4e08bd.8030...@facebook.com%3E

// Thrift generates the declaration but not the implementation of operator< 
because it has no way
 // of knowning what the criteria for the comparison are. So, provide the 
implementation here.
 bool Coordinate::operator<(const Coordinate& other) const
 {
 if (x < other.x) {
 return true;
 } else if (x > other.x) {
 return false;
 } else if (y < other.y) {
 return true;
 } else {
 return false;
 }
 }

Then, finally, I use the following Python code to de-serialize the same object 
from disk:

file = open("terrain.dat", "rb")
 transport = thrift.transport.TTransport.TFileObjectTransport(file)
 protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
 terrain = Terrain()
 terrain.read(protocol)
 print(terrain)

This Python program outputs:

Terrain(altitude_samples=None)

In other words, the de-serialized Terrain contains no terrain_samples, instead 
of the expected dictionary with 3 coordinates.

I am 100% sure that the file terrain.dat contains valid data: I also 
de-serialized the same data using C++ and in that case, I *do* get the expected 
results (see repo for details)

I suspect that this has something to do with the comparison operator.

I gut feeling is that I should have done something similar in Python with 
respect to the comparison operator as I did in C++. But I don't know what that 
missing something would be.

Additional information added on 19-Sep-2018:

Here is a hexdump of the encoding produced by the C++ encoding program:

Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 
 00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
 00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
 00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
 00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
 00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
 00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
 00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
 00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
 00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
 00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
 000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
 000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....

The first 4 bytes are 01 00 00 00

Using a debugger to step through the Python decoding function reveals that:

* This is being decoded as a struct (which is expected)

* The first byte 01 is interpreted as the field type. 01 means field type VOID.

* The next two bytes are interpreted as the field id. 00 00 means field ID 0.

* For field type VOID, nothing else is read and we continue to the next field.

* The next byte is interpreted as the field type. 00 means STOP.

* We top reading data for the struct.

* The final result is an empty struct.

All off the above is consistent with the information at 
https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
 which describes the Thrift binary encoding format

My conclusion thus far is that the C++ encoder appears to produce an 
"incorrect" binary encoding (I put incorrect in quotes because certainly 
something as blatant as that would have been discovered by lots of other 
people, so I am sure that I am still missing something).

Additional information added on 19-Sep-2018:

It appears that the C++ implementation of TFileTransport has the concept of 
"events" when writing to disk.

The output which is written to disk is divided into a sequence of "events" 
where each "event" is preceded by a 4-byte length field of the event, followed 
by the contents of the event.

Looking at the hexdump above, the first couple of events are:

`0100 0000 0d` : Event length 1, event value `0d`

`02 0000 0000 01` : Event length 2, event value `00 01`

Etc.

The Python implementation of TFileTransport does not understand this concept of 
events when parsing the file.

It appears that the problem is one of the following two:

1) Either the C++ code should not be inserting these event lengths into the 
encoded file,

2) Or the Python code should understand these event lengths when decoding the 
file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to