[ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402110#comment-15402110
 ] 

Sean Owen commented on SPARK-16826:
-----------------------------------

The contention is in the JDK class java.net.URL itself, and it does look like 
an unfortunate bottleneck from ages ago. There's a static, globally 
synchronized Hashtable here, which explains why multiple JVM (executors) 
alleviates the problem. We can't fix that directly. See also 
http://www.supermind.org/blog/580/java-net-url-synchronization-bottleneck

We also probably have to rely on the behavior of this class to parse a URL. I 
don't see a good alternative here? You can avoid the lookup if you pass in the 
URLStreamHandler manually which would mean reimplementing some of the same 
logic in URL. That then means the overhead of looking up a SecurityManager 
though.

What about just avoiding a load of calls to parseURL at once, is that at all 
reasonable? like rearranging the pipeline to filter or remove duplicates 
earlier upstream?

> java.util.Hashtable limits the throughput of PARSE_URL()
> --------------------------------------------------------
>
>                 Key: SPARK-16826
>                 URL: https://issues.apache.org/jira/browse/SPARK-16826
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Sylvain Zimmer
>
> Hello!
> I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
> {{parse_url(url, "host")}} in Spark SQL.
> Unfortunately it seems that there is an internal thread-safe cache in there, 
> and the instances end up being 90% idle.
> When I view the thread dump for my executors, most of the executor threads 
> are "BLOCKED", in that state:
> {code}
> java.util.Hashtable.get(Hashtable.java:362)
> java.net.URL.getURLStreamHandler(URL.java:1135)
> java.net.URL.<init>(URL.java:599)
> java.net.URL.<init>(URL.java:490)
> java.net.URL.<init>(URL.java:439)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
> org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> org.apache.spark.scheduler.Task.run(Task.scala:85)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> However, when I switch from 1 executor with 36 cores to 9 executors with 4 
> cores, throughput is almost 10x higher and the CPUs are back at ~100% use.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to