Github user tpoterba commented on the issue:

    https://github.com/apache/spark/pull/18005
  
    I used this script to generate random CSV files:
    ```python
    import uuid
    import sys
    
    try:
        print('args = ' + str(sys.argv))
        filename = sys.argv[1]
        cols = int(sys.argv[2])
        rows = int(sys.argv[3])
        if len(sys.argv) != 4 or cols <= 0 or rows <= 0:
            raise RuntimeError()
    except Exception as e:
        raise RuntimeError('Usage: gen_text_file.py <filename> <cols> <rows>')
    
    rand_to_gen = (cols + 7) / 8
    
    
    with open(filename, 'w') as f:
        f.write(','.join('col%d' % i for i in range(cols)))
        f.write('\n')
        for i in range(rows):
            if (i % 10000 == 0):
                print('wrote %d lines' % i)
            rands = [x[i:i+4] for i in range(8) for x in [uuid.uuid4().hex for 
_ in range(rand_to_gen)]]
            f.write(','.join(rands[:cols]))
            f.write('\n')
    ```
    
    I generated files that were all the same size on disk with different 
dimensions (cols x rows):
    10x18M
    20x9M
    30x6M
    60x3M
    150x1200K
    300x600K
    
    Here's what I tried to do to them:
    ```python
    >>> spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path)
    ````
    
    The 10, 20, 30-column files all took between 40s to 1m to complete on 2 
cores of my laptop. 60 and up never completed, and actually crashed the java 
process -- I had to kill it with `kill -9`.
    
    At one point for the 60-column table, I got a "GC overhead limit exceeded" 
OOM from the parquet writer (the error suggested that parquet was doing 
something silly trying to use dictionary encoding for random values, but I 
haven't figured out how to turn that off). I could be conflating this crash 
with one we encountered a few months ago, where Spark crashed because Catalyst 
generated bytecode larger than 64k for dataframes with a large schema.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to