[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12010: ARROW-6001 [Python]: Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records

GitBox Mon, 10 Jan 2022 01:28:11 -0800


jorisvandenbossche commented on a change in pull request #12010:
URL: https://github.com/apache/arrow/pull/12010#discussion_r781008612




##########
File path: python/pyarrow/table.pxi
##########
@@ -2442,6 +2602,46 @@ def _from_pydict(cls, mapping, schema, metadata):
         raise TypeError('Schema must be an instance of pyarrow.Schema')
 
 
+def _from_pylist(cls, mapping, schema, metadata):
+    """
+    Construct a Table/RecordBatch from list of dictionary of rows.
+
+    Parameters
+    ----------
+    cls : Class Table/RecordBatch
+    mapping : list of dicts of rows
+        A mapping of strings to row values.
+    schema : Schema, default None
+        If not passed, will be inferred from the Mapping values.

Review comment:
       Maybe you can specify here that it will be inferred from the _first_ row 
(also in the actual user facing docstrings above)

##########
File path: python/pyarrow/table.pxi
##########
@@ -671,13 +671,61 @@ cdef class RecordBatch(_PandasConvertible):
         Returns
         -------
         RecordBatch
+
+        Examples
+        --------
+        >>> import pyarrow as pa
+        >>> pydict = {'int': [1, 2], 'str': ['a', 'b']}
+        >>> pa.RecordBatch.from_pydict(pydict)
+        pyarrow.RecordBatch
+        int: int64
+        str: string
         """
 
         return _from_pydict(cls=RecordBatch,
                             mapping=mapping,
                             schema=schema,
                             metadata=metadata)
 
+    @staticmethod
+    def from_pylist(mapping, schema=None, metadata=None):
+        """
+        Construct a RecordBatch from list of dictionary of rows.

Review comment:
       ```suggestion
           Construct a RecordBatch from list of rows / dictionaries.
   ```
   
   Each dictionary represents a row, so "dictionary of rows" sounds a bit 
strange ("dictionary of row values" could be strictly speaking more correct, 
but I still find that not super clear)

##########
File path: python/pyarrow/table.pxi
##########
@@ -1016,6 +1064,28 @@ cdef class RecordBatch(_PandasConvertible):
             entries.append((name, column))
         return ordered_dict(entries)
 
+    def to_pylist(self, index=None):
+        """
+        Convert the RecordBatch to a list of dictionaries of rows.
+
+        Parameters
+        ----------
+        index: list
+            A list of column names to index.

Review comment:
       Is this `index` keyword needed? (it's to select a subset of columns to 
export) Eg `to_pydict` doesn't have it (we should probably add it there as well 
if we want to keep it)
   
   We have nowadays the `select()` method, so it is relatively straightforward 
to do `table.select([...]-.to_pylist()` instead of 
`table.to_pylist(index=[...])`. 
   Or if we keep it, I would call it something else as `index`, but for example 
rather `columns=[..]`.

##########
File path: python/pyarrow/table.pxi
##########
@@ -2442,6 +2602,46 @@ def _from_pydict(cls, mapping, schema, metadata):
         raise TypeError('Schema must be an instance of pyarrow.Schema')
 
 
+def _from_pylist(cls, mapping, schema, metadata):
+    """
+    Construct a Table/RecordBatch from list of dictionary of rows.
+
+    Parameters
+    ----------
+    cls : Class Table/RecordBatch
+    mapping : list of dicts of rows
+        A mapping of strings to row values.
+    schema : Schema, default None
+        If not passed, will be inferred from the Mapping values.
+    metadata : dict or Mapping, default None
+        Optional metadata for the schema (if inferred).
+
+    Returns
+    -------
+    Table/RecordBatch
+    """
+
+    arrays = []
+    if schema is None:
+        names = []
+        if mapping:
+            names = list(mapping[0].keys())
+        for n in names:
+            v = [i[n] if n in i else None for i in mapping]
+            arrays.append(asarray(v))
+        return cls.from_arrays(arrays, names, metadata=metadata)
+    else:
+        if isinstance(schema, Schema):
+            for n in schema.names:
+                v = [i[n] if n in i else None for i in mapping]
+                n_type = schema.types[schema.get_field_index(n)]
+                arrays.append(asarray(v, type=n_type))

Review comment:
       The `asarray` with the type from the schema also gets done inside 
`from_arrays`, so it might be unnecessary to do it here as well




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12010: ARROW-6001 [Python]: Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records

Reply via email to