Suvayu Ali created ARROW-3806:
---------------------------------
Summary: [Python] When converting nested types to pandas, use
tuples
Key: ARROW-3806
URL: https://issues.apache.org/jira/browse/ARROW-3806
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 0.11.1
Environment: Fedora 29, pyarrow installed with conda
Reporter: Suvayu Ali
When converting to pandas, convert nested types (e.g. list) to tuples. Columns
with lists are difficult to query. Here are a few unsuccessful attempts:
{code}
>>> mini
CHROM POS ID REF ALTS QUAL
80 20 63521 rs191905748 G [A] 100
81 20 63541 rs117322527 C [A] 100
82 20 63548 rs541129280 G [GT] 100
83 20 63553 rs536661806 T [C] 100
84 20 63555 rs553463231 T [C] 100
85 20 63559 rs138359120 C [A] 100
86 20 63586 rs545178789 T [G] 100
87 20 63636 rs374311122 G [A] 100
88 20 63696 rs149160003 A [G] 100
89 20 63698 rs544072005 A [C] 100
90 20 63729 rs181483669 G [A] 100
91 20 63733 rs75670495 C [T] 100
92 20 63799 rs1418258 C [T] 100
93 20 63808 rs76004960 G [C] 100
94 20 63813 rs532151719 G [A] 100
95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100
96 20 63865 rs551938596 G [A] 100
97 20 63902 rs571779099 A [T] 100
98 20 63963 rs531152674 G [A] 100
99 20 63967 rs116770801 A [G] 100
100 20 63977 rs199703510 C [G] 100
101 20 64016 rs143263863 G [A] 100
102 20 64062 rs148297240 G [A] 100
103 20 64139 rs186497980 G [A, T] 100
104 20 64150 rs7274499 C [A] 100
105 20 64151 rs190945171 C [T] 100
106 20 64154 rs537656456 T [G] 100
107 20 64175 rs116531220 A [G] 100
108 20 64186 rs141793347 C [G] 100
109 20 64210 rs182418654 G [C] 100
110 20 64303 rs559929739 C [A] 100
{code}
# I think this one fails because it tries to broadcast the comparison.
{code}
>>> mini[mini.ALTS == ["A", "T"]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py",
line 1283, in wrapper
res = na_op(values, other)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py",
line 1143, in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py",
line 1120, in _comp_method_OBJECT_ARRAY
result = libops.vec_compare(x, y, op)
File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 31 vs 2
{code}
# I think this fails due to a similar reason, but the broadcasting is happening
at a different place.
{code}
>>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line
2682, in __getitem__
return self._getitem_array(key)
File
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line
2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py",
line 1314, in _convert_to_indexer
indexer = check = labels.get_indexer(objarr)
File
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py",
line 3259, in get_indexer
indexer = self._engine.get_indexer(target._ndarray_values)
File "pandas/_libs/index.pyx", line 301, in
pandas._libs.index.IndexEngine.get_indexer
File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in
pandas._libs.hashtable.PyObjectHashTable.lookup
TypeError: unhashable type: 'numpy.ndarray'
>>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
80 [True, False]
81 [True, False]
82 [False, False]
83 [False, False]
84 [False, False]
{code}
# Unfortunately this clever hack fails as well!
{code}
>>> c = np.empty(1, object)
>>> c[0] = ["A", "T"]
>>> mini[mini.ALTS.values == c]
Traceback (most recent call last):
File
"/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py",
line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in
pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in
pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in
pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in
pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: False
>>> mini.ALTS.values == c
False
{code}
Finally, what succeeds is the following (probably because of the immutability
of tuple):
{code}
>>> mini["ALTS2"] = mini.ALTS.apply(tuple)
>>> mini.head()
CHROM POS ID REF ALTS QUAL ALTS2
80 20 63521 rs191905748 G [A] 100 (A,)
81 20 63541 rs117322527 C [A] 100 (A,)
82 20 63548 rs541129280 G [GT] 100 (GT,)
83 20 63553 rs536661806 T [C] 100 (C,)
84 20 63555 rs553463231 T [C] 100 (C,)
>>> mini[mini["ALTS2"] == ("A", "T")]
CHROM POS ID REF ALTS QUAL ALTS2
103 20 64139 rs186497980 G [A, T] 100 (A, T)
>>> mini[mini["ALTS2"] == ("GT",)]
CHROM POS ID REF ALTS QUAL ALTS2
82 20 63548 rs541129280 G [GT] 100 (GT,)
>>> mini[mini["ALTS2"] == tuple("C")]
CHROM POS ID REF ALTS QUAL ALTS2
83 20 63553 rs536661806 T [C] 100 (C,)
84 20 63555 rs553463231 T [C] 100 (C,)
89 20 63698 rs544072005 A [C] 100 (C,)
93 20 63808 rs76004960 G [C] 100 (C,)
95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100 (C,)
109 20 64210 rs182418654 G [C] 100 (C,)
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)