Tian Gao created SPARK-54384:
--------------------------------

             Summary: Modernize the _batched method for BatchedSerializer 
                 Key: SPARK-54384
                 URL: https://issues.apache.org/jira/browse/SPARK-54384
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 4.1.0
            Reporter: Tian Gao


We have `itertools` utilities which could make the iterator operations much 
faster and less verbose.
{code:java}
import itertools
import time

def batch_original(iterator, batch_size):
    items = []
    count = 0
    for item in iterator:
        items.append(item)
        count += 1
        if count == batch_size:
            yield items
            items = []
            count = 0
    if items:
        yield items

def batch_list(iterator, batch_size):
    n = len(iterator)
    for i in range(0, n, batch_size):
        yield iterator[i : i + batch_size]

def batch_after(iterator, batch_size):
    it = iter(iterator)
    while batch := list(itertools.islice(it, batch_size)):
        yield batch


def do_test(iterator, batch):
    result = []
    start = time.perf_counter_ns()
    for b in batch(iterator, 10000):
        result.append(b)
    end = time.perf_counter_ns()
    print(f"Batching {batch.__name__} took {(end - start)/1e9:.4f} seconds")
    return result

if __name__ == "__main__":
    data = range(10000005)

    result_original = do_test(data, batch_original)
    result_after = do_test(data, batch_after)

    assert result_original == result_after

    data = list(range(10000005))
    result_list = do_test(data, batch_list)
    result_after = do_test(data, batch_after)
    assert result_list == result_afterNotice that __getslice__ is remo {code}
Notice that {{__getslice__}} is *removed* since Python 3.0, so the optimization 
for known size iterators like lists is not working at all. There's no simple 
way to know if an iterator supports slice operation now. The most 
straightforward way is to try it out like {{iterator[:1]}} - I don't know how 
frequent we are dealing with lists, if the iterator is often lists, then we can 
do it. The raw {{[:]}} operation is 22% faster than this implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to