zhengruifeng opened a new pull request, #46281:
URL: https://github.com/apache/spark/pull/46281

   ### What changes were proposed in this pull request?
   Cache `DataFrame.isStreaming`
   
   
   ### Why are the changes needed?
   In PS, `DataFrame.isStreaming` is used in the construction of `InternalFrame`
   
   
https://github.com/apache/spark/blob/e01ac581f46aa595e66daf33fe92b56d1328bc78/python/pyspark/pandas/internal.py#L624
   
   it might cause performance issues since a lot of `InternalFrame` will be 
built in even simple logic, such as
   
   ```
   import cProfile, pstats
   import pyspark.pandas as ps
   
   df1 = ps.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3, 
5]}, columns=['lkey', 'value'])
   
   cProfile.run("df1['value2'] = df1['value'] + 123 + 456", 
"/tmp/profile_results")
   
   pstats.Stats("/tmp/profile_results").sort_stats("cumtime").print_stats(.1)
   ```
   
   before:
   ```
      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
           1    0.000    0.000    0.120    0.120 {built-in method builtins.exec}
           1    0.000    0.000    0.120    0.120 <string>:1(<module>)
          27    0.000    0.000    0.113    0.004 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/client/core.py:1160(_analyze)
          27    0.000    0.000    0.112    0.004 
/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/grpc/_channel.py:1161(__call__)
          27    0.000    0.000    0.112    0.004 
/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/grpc/_channel.py:1124(_blocking)
          27    0.110    0.004    0.110    0.004 {method 'next_event' of 
'grpc._cython.cygrpc.SegregatedCall' objects}
           2    0.000    0.000    0.108    0.054 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/base.py:318(__add__)
           2    0.000    0.000    0.078    0.039 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/data_type_ops/num_ops.py:79(add)
           2    0.000    0.000    0.074    0.037 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/base.py:210(wrapper)
          24    0.000    0.000    0.064    0.003 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/internal.py:1435(copy)
          24    0.000    0.000    0.063    0.003 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/internal.py:535(__init__)
   
   ```
   
   after:
   ```
      ncalls  tottime  percall  cumtime  percall filename:lineno(function)
           1    0.000    0.000    0.082    0.082 {built-in method builtins.exec}
           1    0.000    0.000    0.082    0.082 <string>:1(<module>)
           2    0.000    0.000    0.081    0.041 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/base.py:318(__add__)
           4    0.000    0.000    0.078    0.019 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/client/core.py:1160(_analyze)
           4    0.000    0.000    0.077    0.019 
/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/grpc/_channel.py:1161(__call__)
           4    0.000    0.000    0.077    0.019 
/Users/ruifeng.zheng/.dev/miniconda3/envs/spark_dev_311/lib/python3.11/site-packages/grpc/_channel.py:1124(_blocking)
           4    0.077    0.019    0.077    0.019 {method 'next_event' of 
'grpc._cython.cygrpc.SegregatedCall' objects}
          26    0.000    0.000    0.058    0.002 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/dataframe.py:1783(schema)
           3    0.000    0.000    0.058    0.019 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/connect/client/core.py:1031(schema)
           2    0.000    0.000    0.057    0.029 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/data_type_ops/num_ops.py:79(add)
           2    0.000    0.000    0.057    0.028 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/base.py:210(wrapper)
          24    0.000    0.000    0.026    0.001 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/internal.py:1435(copy)
          21    0.000    0.000    0.025    0.001 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/internal.py:1419(select_column)
          19    0.000    0.000    0.025    0.001 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/series.py:438(_internal)
          24    0.000    0.000    0.025    0.001 
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/internal.py:535(__init__)
   ```
   
   There are 24 `InternalFrame.__init__` invocations in this example, the 
number of RPC is reduced from 27 to 4.
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   ci
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to