Tian Gao created SPARK-54384:
--------------------------------
Summary: Modernize the _batched method for BatchedSerializer
Key: SPARK-54384
URL: https://issues.apache.org/jira/browse/SPARK-54384
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.1.0
Reporter: Tian Gao
We have `itertools` utilities which could make the iterator operations much
faster and less verbose.
{code:java}
import itertools
import time
def batch_original(iterator, batch_size):
items = []
count = 0
for item in iterator:
items.append(item)
count += 1
if count == batch_size:
yield items
items = []
count = 0
if items:
yield items
def batch_list(iterator, batch_size):
n = len(iterator)
for i in range(0, n, batch_size):
yield iterator[i : i + batch_size]
def batch_after(iterator, batch_size):
it = iter(iterator)
while batch := list(itertools.islice(it, batch_size)):
yield batch
def do_test(iterator, batch):
result = []
start = time.perf_counter_ns()
for b in batch(iterator, 10000):
result.append(b)
end = time.perf_counter_ns()
print(f"Batching {batch.__name__} took {(end - start)/1e9:.4f} seconds")
return result
if __name__ == "__main__":
data = range(10000005)
result_original = do_test(data, batch_original)
result_after = do_test(data, batch_after)
assert result_original == result_after
data = list(range(10000005))
result_list = do_test(data, batch_list)
result_after = do_test(data, batch_after)
assert result_list == result_afterNotice that __getslice__ is remo {code}
Notice that {{__getslice__}} is *removed* since Python 3.0, so the optimization
for known size iterators like lists is not working at all. There's no simple
way to know if an iterator supports slice operation now. The most
straightforward way is to try it out like {{iterator[:1]}} - I don't know how
frequent we are dealing with lists, if the iterator is often lists, then we can
do it. The raw {{[:]}} operation is 22% faster than this implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]