Joel Croteau created SPARK-22338:
------------------------------------

             Summary: namedtuple serialization is inefficient
                 Key: SPARK-22338
                 URL: https://issues.apache.org/jira/browse/SPARK-22338
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 2.2.0
            Reporter: Joel Croteau
            Priority: Minor


I greatly appreciate the level of hack that PySpark contains in order to make 
namedtuples serializable, but I feel like it could be done a little better. In 
particular, say I create a namedtuple class with a few long argument names like 
this:
{code:JiraShouldReallySupportPython}
MyTuple = namedtuple('MyTuple', ('longarga', 'longargb', 'longargc'))
{code}
If a create an instance of this, here is how PySpark serializes it:
{code:JiraShouldReallySupportPython}
mytuple = MyTuple(1, 2, 3)
pickle.dumps(mytuple, pickle.HIGHEST_PROTOCOL)
b'\x80\x04\x95]\x00\x00\x00\x00\x00\x00\x00\x8c\x13pyspark.serializers\x94\x8c\x08_restore\x94\x93\x94\x8c\x07MyTuple\x94\x8c\x08longarga\x94\x8c\x08longargb\x94\x8c\x08longargc\x94\x87\x94K\x01K\x02K\x03\x87\x94\x87\x94R\x94.'
{code}
This serialization includes the name of the namedtuple class, the names of each 
of its members, as well as reference to internal functions in 
pyspark.serializers. By comparison, this is what I get if I serialize the bare 
tuple:
{code:JiraShouldReallySupportPython}
shorttuple = (1,2,3)
pickle.dumps(shorttuple, pickle.HIGHEST_PROTOCOL)
b'\x80\x04\x95\t\x00\x00\x00\x00\x00\x00\x00K\x01K\x02K\x03\x87\x94.'
{code}
Much shorter. For another comparison, here is what it looks like if I build a 
dict with the same data and element names:
{code:JiraShouldReallySupportPython}
mydict = {'longarga':1, 'longargb':2, 'longargc':3}
pickle.dumps(mydict, pickle.HIGHEST_PROTOCOL)
b'\x80\x04\x95,\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x08longarga\x94K\x01\x8c\x08longargb\x94K\x02\x8c\x08longargc\x94K\x03u.'
{code}
In other words, even using a dict is substantially shorter than using a 
namedtuple in its current form. There shouldn't be any need for namedtuples to 
have this much overhead in their serialization. For one thing, if the class 
object is being broadcast to the nodes, there should be no need for each 
namedtuple instance to include all of the field names; the class name should be 
enough. If you use namedtuples heavily, this can create a lot of overhead in 
memory and disk use. I am going to try and improve the serialization and submit 
a patch if I can find the time, but I don't know the pyspark code too well, so 
if anyone has suggestions for where to start, I would love to hear them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to