[ https://issues.apache.org/jira/browse/SPARK-20071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15938475#comment-15938475 ]
Barry Becker commented on SPARK-20071: -------------------------------------- Yes. I agree. I wanted to report the issue, but wasn't sure if it should be a bug or enhancement request. The main problem for me was that it took a fair bit of debugging to discover what the problem was. It might be nice to provide a warning or more info in the exception. I will find a way to work around it. > StringIndexer overflows Kryo serialization buffer when run on column with > many long distinct values > --------------------------------------------------------------------------------------------------- > > Key: SPARK-20071 > URL: https://issues.apache.org/jira/browse/SPARK-20071 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.1.0 > Reporter: Barry Becker > Priority: Minor > > I marked this as minor because there are workarounds. > I have a 2 million row dataset with a string column that is mostly unique and > contains many very long values. > Most of the values are between 1,000 and 40,0000 characters long. > I am using Kryoserializer and increased the spark.kryoserializer.buffer.max > to 256m. > If I try to run StringIndexer.fit on this column, I will get an OutOfMemory > exception or more likely a Buffer overflow error like > {code} > org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. > Available: 0, required: 23. > To avoid this, increase spark.kryoserializer.buffer.max > value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315) > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > {code} > This result is not that surprising given that we are trying to index a column > like this, however, I can think of some suggestions that would help avoid the > error and maybe help performance. > These possible enhancements to StringIndexer might be hard, but I thought I > would suggest them anyway, just in case they are not. > 1) Add param for Top N values. I know that StringIndexer gives lower indices > to the more commonly occurring values. It would be great if one could specify > that I only want to index the top N values and long everything else into a > special "Other" value. > 2) Add param for label length limit. Only consider the first L characters of > labels when doing the indexing. > Either of these enhancements would work, but I suppose they can also be > implemented with additional work as steps preceding the indexer in the > pipeline. Perhaps topByKey could be used to replace the column with one that > has the top values and "Other" as suggesed in 1). -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org