[ https://issues.apache.org/jira/browse/SPARK-20339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971054#comment-15971054 ]
Nischay commented on SPARK-20339: --------------------------------- Sure I'll not add redundant code in future, also I'll use u...@spark.apache.org "For such a huge sequence of generating columns you are probably much better off contstructing a Row directly in a transformation" we are not able to understand can you please explain in detail. We used UDF but getting "Task not serializable exception". UDF1 removeSpecialCharaters = new UDF1<String, String>() { public String call(final String types) throws Exception { while(names.hasMoreElements()) { String str = (String) names.nextElement(); types.replaceAll(str, manufacturerNames.get(str).toString()); } return types; } }; sqlContext.udf().register("removeSpecialCharatersUDF", removeSpecialCharaters, DataTypes.StringType); dataFileContent.createOrReplaceTempView("DataSetOfinterest"); Dataset<Row> newDF = sqlContext.sql("select removeSpecialCharatersUDF(ManufacturerSource) FROM DataSetOfinterest"); > Issue in regex_replace in Apache Spark Java > ------------------------------------------- > > Key: SPARK-20339 > URL: https://issues.apache.org/jira/browse/SPARK-20339 > Project: Spark > Issue Type: Question > Components: Java API, Spark Core, SQL > Affects Versions: 2.1.0 > Reporter: Nischay > > We are currently facing couple of issues > 1. > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB". > 2. "java.lang.StackOverflowError" > The first issue is reported as a Major bug in Jira of Apache spark > https://issues.apache.org/jira/browse/SPARK-18492 > We got these issues by the following program. We are trying to replace the > Manufacturer name by its equivalent alternate name, > These issues occur only when we have Huge number of alternate names to > replace, for small number of replacements it works with no issues. > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));` > Kindly suggest us an alternative method or a solution to go around this > problem. > {code} > Hashtable manufacturerNames = new Hashtable(); > Enumeration names; > String str; > double bal; > manufacturerNames.put("Allen","Apex Tool Group"); > manufacturerNames.put("Armstrong","Apex Tool Group"); > manufacturerNames.put("Campbell","Apex Tool Group"); > manufacturerNames.put("Lubriplate","Apex Tool Group"); > manufacturerNames.put("Delta","Apex Tool Group"); > manufacturerNames.put("Gearwrench","Apex Tool Group"); > manufacturerNames.put("H.K. Porter","Apex Tool > Group"); > manufacturerNames.put("Jacobs","Apex Tool Group"); > manufacturerNames.put("Jobox","Apex Tool Group"); > ...about 100 more ... > manufacturerNames.put("Standard Safety","Standard > Safety Equipment Company"); > manufacturerNames.put("Standard Safety","Standard > Safety Equipment Company"); > // Show all balances in hash table. > names = manufacturerNames.keys(); > Dataset<Row> dataFileContent = > sqlContext.load("com.databricks.spark.csv", options); > > > while(names.hasMoreElements()) { > str = (String) names.nextElement(); > > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString())); > } > dataFileContent.show(); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org