timsaucer opened a new issue, #647:
URL: https://github.com/apache/datafusion-python/issues/647

   **Describe the bug**
   When attempting to call `show()` on a DataFrame that contains a built in 
window function on a column that has struct elements, it produces the error 
`Compute error: concat requires input of at least one array`. However other 
functions such as `count()` do not have issues. I have less experience with 
DataFusion, so I just expected `count()` to do a full evaluation like it does 
in pyspark, so it's possible that my assumption is incorrect in that having any 
bearing on this error.
   
   **To Reproduce**
   This minimal example can reproduce the window function working properly on a 
simple element type and failing with a very simple struct.
   
   ```
   import pyarrow as pa
   from datafusion import SessionContext
   import datafusion.functions as F
   
   # taken from datafusion/tests/test_dataframe.py
   def struct_df():
       ctx = SessionContext()
   
       # create a RecordBatch and a new DataFrame from it
       batch = pa.RecordBatch.from_arrays(
           [pa.array([{"c": 1}, {"c": 2}, {"c": 3}]), pa.array([4, 5, 6])],
           names=["a", "b"],
       )
   
       return ctx.create_dataframe([[batch]])
   
   df = struct_df()
   
   df.show()
   
   df.select(F.col("a"), F.col("b"), F.window("lag", 
[F.col("b")]).alias("lag_b")).show()
   
   print("Calling count on lag a: ", df.select(F.col("a"), F.col("b"), 
F.window("lag", [F.col("a")]).alias("lag_a")).count())
   
   df.select(F.col("a"), F.col("b"), F.window("lag", 
[F.col("a")]).alias("lag_a")).show()
   ```
   
   Produces the following output:
   
   ```
   DataFrame()
   +--------+---+
   | a      | b |
   +--------+---+
   | {c: 1} | 4 |
   | {c: 2} | 5 |
   | {c: 3} | 6 |
   +--------+---+
   DataFrame()
   +--------+---+-------+
   | a      | b | lag_b |
   +--------+---+-------+
   | {c: 1} | 4 |       |
   | {c: 2} | 5 | 4     |
   | {c: 3} | 6 | 5     |
   +--------+---+-------+
   Calling count on lag a:  3
   Traceback (most recent call last):
     File "/Users/tsaucer/src/arrow-datafusion-python/example_lag_struct.py", 
line 25, in <module>
       df.select(F.col("a"), F.col("b"), F.window("lag", 
[F.col("a")]).alias("lag_a")).show()
   Exception: Arrow error: Compute error: concat requires input of at least one 
array
   ```
   
   In searching the web there was a similar error thrown that this old MR 
resolved in sort operations: 
https://github.com/apache/arrow/pull/9275/files#diff-3ee8e6ac2472badc7bb448c360f56ed60f06a787d1f45ea589d9e213eaf2ae82
   
   **Expected behavior**
   Calling `show()` on a window function with a struct column type should 
operate similar to simple column types.
   
   **Additional context**
   I'm willing to work on this myself, but I'm not familiar with the internals 
of the plan execution. I've looked around myself to see if I can find anything 
obvious, but nothing is jumping out at me. If you can provide any directions or 
pointers, I would appreciate it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to