Suvayu Ali created ARROW-3806:
---------------------------------

             Summary: [Python] When converting nested types to pandas, use 
tuples
                 Key: ARROW-3806
                 URL: https://issues.apache.org/jira/browse/ARROW-3806
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 0.11.1
         Environment: Fedora 29, pyarrow installed with conda
            Reporter: Suvayu Ali


When converting to pandas, convert nested types (e.g. list) to tuples.  Columns 
with lists are difficult to query.  Here are a few unsuccessful attempts:

{code}
>>> mini
    CHROM    POS           ID            REF    ALTS  QUAL
80     20  63521  rs191905748              G     [A]   100
81     20  63541  rs117322527              C     [A]   100
82     20  63548  rs541129280              G    [GT]   100
83     20  63553  rs536661806              T     [C]   100
84     20  63555  rs553463231              T     [C]   100
85     20  63559  rs138359120              C     [A]   100
86     20  63586  rs545178789              T     [G]   100
87     20  63636  rs374311122              G     [A]   100
88     20  63696  rs149160003              A     [G]   100
89     20  63698  rs544072005              A     [C]   100
90     20  63729  rs181483669              G     [A]   100
91     20  63733   rs75670495              C     [T]   100
92     20  63799    rs1418258              C     [T]   100
93     20  63808   rs76004960              G     [C]   100
94     20  63813  rs532151719              G     [A]   100
95     20  63857  rs543686274  CCTGGAAAGGATT     [C]   100
96     20  63865  rs551938596              G     [A]   100
97     20  63902  rs571779099              A     [T]   100
98     20  63963  rs531152674              G     [A]   100
99     20  63967  rs116770801              A     [G]   100
100    20  63977  rs199703510              C     [G]   100
101    20  64016  rs143263863              G     [A]   100
102    20  64062  rs148297240              G     [A]   100
103    20  64139  rs186497980              G  [A, T]   100
104    20  64150    rs7274499              C     [A]   100
105    20  64151  rs190945171              C     [T]   100
106    20  64154  rs537656456              T     [G]   100
107    20  64175  rs116531220              A     [G]   100
108    20  64186  rs141793347              C     [G]   100
109    20  64210  rs182418654              G     [C]   100
110    20  64303  rs559929739              C     [A]   100
{code}

# I think this one fails because it tries to broadcast the comparison.
{code}
>>> mini[mini.ALTS == ["A", "T"]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", 
line 1283, in wrapper
    res = na_op(values, other)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", 
line 1143, in na_op
    result = _comp_method_OBJECT_ARRAY(op, x, y)
  File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", 
line 1120, in _comp_method_OBJECT_ARRAY
    result = libops.vec_compare(x, y, op)
  File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 31 vs 2
{code}
# I think this fails due to a similar reason, but the broadcasting is happening 
at a different place.
{code}
>>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 
2682, in __getitem__
    return self._getitem_array(key)
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 
2726, in _getitem_array
    indexer = self.loc._convert_to_indexer(key, axis=1)
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", 
line 1314, in _convert_to_indexer
    indexer = check = labels.get_indexer(objarr)
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py",
 line 3259, in get_indexer
    indexer = self._engine.get_indexer(target._ndarray_values)
  File "pandas/_libs/index.pyx", line 301, in 
pandas._libs.index.IndexEngine.get_indexer
  File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in 
pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'numpy.ndarray'
>>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
80     [True, False]
81     [True, False]
82    [False, False]
83    [False, False]
84    [False, False]
{code}
# Unfortunately this clever hack fails as well!
{code}
>>> c = np.empty(1, object)
>>> c[0] = ["A", "T"]
>>> mini[mini.ALTS.values == c]
Traceback (most recent call last):
  File 
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py",
 line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in 
pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in 
pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in 
pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in 
pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
>>> mini.ALTS.values == c
False
{code}

Finally, what succeeds is the following (probably because of the immutability 
of tuple):
{code}
>>> mini["ALTS2"] = mini.ALTS.apply(tuple)
>>> mini.head()
   CHROM    POS           ID REF  ALTS  QUAL  ALTS2
80    20  63521  rs191905748   G   [A]   100   (A,)
81    20  63541  rs117322527   C   [A]   100   (A,)
82    20  63548  rs541129280   G  [GT]   100  (GT,)
83    20  63553  rs536661806   T   [C]   100   (C,)
84    20  63555  rs553463231   T   [C]   100   (C,)
>>> mini[mini["ALTS2"] == ("A", "T")]
    CHROM    POS           ID REF    ALTS  QUAL   ALTS2
103    20  64139  rs186497980   G  [A, T]   100  (A, T)
>>> mini[mini["ALTS2"] == ("GT",)]
   CHROM    POS           ID REF  ALTS  QUAL  ALTS2
82    20  63548  rs541129280   G  [GT]   100  (GT,)
>>> mini[mini["ALTS2"] == tuple("C")]
    CHROM    POS           ID            REF ALTS  QUAL ALTS2
83     20  63553  rs536661806              T  [C]   100  (C,)
84     20  63555  rs553463231              T  [C]   100  (C,)
89     20  63698  rs544072005              A  [C]   100  (C,)
93     20  63808   rs76004960              G  [C]   100  (C,)
95     20  63857  rs543686274  CCTGGAAAGGATT  [C]   100  (C,)
109    20  64210  rs182418654              G  [C]   100  (C,)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to