[ https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773457#comment-16773457 ]
Parth Gandhi edited comment on SPARK-26947 at 2/20/19 10:43 PM: ---------------------------------------------------------------- I am unable to attach the dummy dataset as the size of the data(90 MB) exceeds the maximum allowed size 60 MB. Have attached it to Drive. https://drive.google.com/file/d/1GlHQmwFD2VB9PUi5mDaXdZNXI50dnPYs/view?usp=sharing was (Author: pgandhi): I am unable to attach the dummy dataset as the size of the data(90 MB) exceeds the maximum allowed size 60 MB. > Pyspark KMeans Clustering job fails on large values of k > -------------------------------------------------------- > > Key: SPARK-26947 > URL: https://issues.apache.org/jira/browse/SPARK-26947 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, PySpark > Affects Versions: 2.4.0 > Reporter: Parth Gandhi > Priority: Minor > Attachments: clustering_app.py > > > We recently had a case where a user's pyspark job running KMeans clustering > was failing for large values of k. I was able to reproduce the same issue > with dummy dataset. I have attached the code as well as the data in the JIRA. > The stack trace is printed below from Java: > > {code:java} > Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOf(Arrays.java:3332) > at > java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) > at > java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649) > at java.lang.StringBuilder.append(StringBuilder.java:202) > at py4j.Protocol.getOutputCommand(Protocol.java:328) > at py4j.commands.CallCommand.execute(CallCommand.java:81) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748) > {code} > Python: > {code:java} > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 985, in send_command > response = connection.send_command(command) > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > Traceback (most recent call last): > File "clustering_app.py", line 154, in <module> > main(args) > File "clustering_app.py", line 145, in main > run_clustering(sc, args.input_path, args.output_path, > args.num_clusters_list) > File "clustering_app.py", line 136, in run_clustering > clustersTable, cluster_Centers = clustering(sc, documents, output_path, > k, max_iter) > File "clustering_app.py", line 68, in clustering > cluster_Centers = km_model.clusterCenters() > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py", > line 337, in clusterCenters > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py", > line 55, in _call_java > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py", > line 109, in _java2py > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > File > "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py", > line 336, in get_return_value > py4j.protocol.Py4JError: An error occurred while calling > z:org.apache.spark.ml.python.MLSerDe.dumps > {code} > The command with which the application was launched is given below: > {code:java} > $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf > spark.executor.memory=20g --conf spark.driver.memory=20g --conf > spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf > spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g > ~/clustering_app.py --input_path hdfs:///user/username/part-v001x > --output_path hdfs:///user/username --num_clusters_list 10000 > {code} > The input dataset is approximately 90 MB in size and the assigned heap memory > to both driver and executor is close to 20 GB. This only happens for large > values of k. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org