[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-05 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784831#comment-16784831
 ] 

Parth Gandhi commented on SPARK-26947:
--

[~srowen] Yes your suggestion to limit the vocab size helps. Closing this JIRA. 
Thank you.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783858#comment-16783858
 ] 

Sean Owen commented on SPARK-26947:
---

That doesn't sound "very big" but how big are the vectors you cluster? It looks 
like you're applying CountVectorizer with no vocabSize, so if your input are 
many different unique strings, your vectors have hundreds of thousands of 
dimensions. Ten thousand of them plus all the overhead could really add up to 
challenge even tens of GB of heap. Here it seems to be running out of memory 
while transferring a copy to/from the Python process.

I'd definitely limit vocabSize or else reconsider how you're clustering. This 
doesn't look like a particular Spark problem.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf 

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-04 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783849#comment-16783849
 ] 

Parth Gandhi commented on SPARK-26947:
--

[~srowen] for this particular case, k is set to 1. Input data size is 90 MB 
and memory is set to 20g(both driver and executor). [~mgaido] I will try doing 
that and let you know.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by 

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-03-01 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782076#comment-16782076
 ] 

Sean Owen commented on SPARK-26947:
---

How big is k? yes, you're going to run out of memory eventually if you have 
enough centroids and they're large enough. I'm not sure that's a bug, unless 
this is a really surprisingly small k.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent 

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-26 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1665#comment-1665
 ] 

Marco Gaido commented on SPARK-26947:
-

Cloud you also please provide the heap dump of the JVM? You can use 
{{-XX:+HeapDumpOnOutOfMemoryError}} in order to achieve that, passing it to the 
java options.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-26947) Pyspark KMeans Clustering job fails on large values of k

2019-02-20 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16773457#comment-16773457
 ] 

Parth Gandhi commented on SPARK-26947:
--

I am unable to attach the dummy dataset as the size of the data(90 MB) exceeds 
the maximum allowed size 60 MB.

> Pyspark KMeans Clustering job fails on large values of k
> 
>
> Key: SPARK-26947
> URL: https://issues.apache.org/jira/browse/SPARK-26947
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.4.0
>Reporter: Parth Gandhi
>Priority: Minor
> Attachments: clustering_app.py
>
>
> We recently had a case where a user's pyspark job running KMeans clustering 
> was failing for large values of k. I was able to reproduce the same issue 
> with dummy dataset. I have attached the code as well as the data in the JIRA. 
> The stack trace is printed below from Java:
>  
> {code:java}
> Exception in thread "Thread-10" java.lang.OutOfMemoryError: Java heap space
>   at java.util.Arrays.copyOf(Arrays.java:3332)
>   at 
> java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
>   at 
> java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:649)
>   at java.lang.StringBuilder.append(StringBuilder.java:202)
>   at py4j.Protocol.getOutputCommand(Protocol.java:328)
>   at py4j.commands.CallCommand.execute(CallCommand.java:81)
>   at py4j.GatewayConnection.run(GatewayConnection.java:238)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Python:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 985, in send_command
> response = connection.send_command(command)
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Traceback (most recent call last):
>   File "clustering_app.py", line 154, in 
> main(args)
>   File "clustering_app.py", line 145, in main
> run_clustering(sc, args.input_path, args.output_path, 
> args.num_clusters_list)
>   File "clustering_app.py", line 136, in run_clustering
> clustersTable, cluster_Centers = clustering(sc, documents, output_path, 
> k, max_iter)
>   File "clustering_app.py", line 68, in clustering
> cluster_Centers = km_model.clusterCenters()
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/clustering.py",
>  line 337, in clusterCenters
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/wrapper.py",
>  line 55, in _call_java
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/ml/common.py",
>  line 109, in _java2py
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/grid/2/tmp/yarn-local/usercache/user/appcache/application_xxx/container_xxx/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 336, in get_return_value
> py4j.protocol.Py4JError: An error occurred while calling 
> z:org.apache.spark.ml.python.MLSerDe.dumps
> {code}
> The command with which the application was launched is given below:
> {code:java}
> $SPARK_HOME/bin/spark-submit --master yarn --deploy-mode cluster --conf 
> spark.executor.memory=20g --conf spark.driver.memory=20g --conf 
> spark.executor.memoryOverhead=4g --conf spark.driver.memoryOverhead=4g --conf 
> spark.kryoserializer.buffer.max=2000m --conf spark.driver.maxResultSize=12g 
> ~/clustering_app.py --input_path hdfs:///user/username/part-v001x 
> --output_path hdfs:///user/username --num_clusters_list 1
> {code}
> The input dataset is approximately 90 MB in size and the assigned heap memory 
> to both driver and executor is close to 20 GB. This only happens for large 
> values of k.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)