Barry Becker created SPARK-20071:
------------------------------------

             Summary: StringIndexer overflows Kryo serialization buffer when 
run on column with many long distinct values
                 Key: SPARK-20071
                 URL: https://issues.apache.org/jira/browse/SPARK-20071
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.1.0
            Reporter: Barry Becker
            Priority: Minor


I marked this as minor because there are workarounds.
I have a 2 million row dataset with a string column that is mostly unique and 
contains many very long values.
Most of the values are between 1,000 and 40,0000 characters long.
I am using Kryoserializer and increased the  spark.kryoserializer.buffer.max to 
256m. 

If I try to run StringIndexer.fit on this column, I will get an OutOfMemory 
exception or more likely a Buffer overflow error like 
{code}
org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. 
Available: 0, required: 23. 
To avoid this, increase spark.kryoserializer.buffer.max 
value.org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:315)
 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324) 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
{code}
This result is not that surprising given that we are trying to index a column 
like this, however, I can think of some suggestions that would help avoid the 
error and maybe help performance.

These possible enhancements to StringIndexer might be hard, but I thought I 
would suggest them anyway, just in case they are not.
1) Add param for Top N values. I know that StringIndexer gives lower indices to 
the more commonly occurring values. It would be great if one could specify that 
I only want to index the top N values and long everything else into a special 
"Other" value.
2) Add param for label length limit. Only consider the first L characters of 
labels when doing the indexing.

Either of these enhancements would work, but I suppose they can also be 
implemented with additional work as steps preceding the indexer in the 
pipeline. Perhaps topByKey could be used to replace the column with one that 
has the top values and "Other" as suggesed in 1).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to