[ 
https://issues.apache.org/jira/browse/SPARK-27228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16800621#comment-16800621
 ] 

Lukas Waldmann commented on SPARK-27228:
----------------------------------------

That's a good question :)

What code does it runs up to several hundreds  of sql queries with different 
parameters and union the results before writing the result to Hive table.

Input are Hive tables with up to several hundreds million lines

code looks something like this:
{code:java}
void process(String dbName, String environment) {
        //For all items in call the sql snippet and union the results
        List<Metadata> mds = ...;
        Map<String, Dataset> res = new LinkedHashMap<>();
        mds.stream()
            .forEach(md -> {
                    try (InputStream is = getClass().getResourceAsStream("/" + 
md.query_id)) {
                        String snippet = IOUtils.toString(is);
                        Dataset ds = spark.sql(snippet);
                        String key = md.product;
                        res.put(key, res.get(key) == null ?  ds : 
ds.union(res.get(key)));
                    } catch (IOException ex) {
                        
Logger.getLogger(SparkMainApp.class.getName()).log(Level.SEVERE, null, ex);
                    }
                });
           

        String name = dbName + "." + table;        
        res.values().stream()
            .forEach( result -> {
                result.repartition( result.col(PRODUCT.toString()), 
result.col(PROTOCOL.toString())).write()
                    .mode(SaveMode.Overwrite).insertInto(name);
            }
        );
    }
{code}

> Spark long delay on close, possible problem with killing executors
> ------------------------------------------------------------------
>
>                 Key: SPARK-27228
>                 URL: https://issues.apache.org/jira/browse/SPARK-27228
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 2.3.0
>            Reporter: Lukas Waldmann
>            Priority: Major
>         Attachments: log.html
>
>
> When using dynamic allocations after all jobs finishes spark delays for 
> several minutes before finally finishes. Log suggest that executors are not 
> cleared up properly.
> See the attachment for log
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to