I am facing a major issue on replacement of Synonyms in my DataSet. I am trying to replace the synonym of the Brand names to its equivalent names. I have tried 2 methods to solve this issue. Method 1 (regexp_replace) Here i am using the regexp_replace method. Hashtable manufacturerNames = new Hashtable(); Enumeration names; String str; double bal; manufacturerNames.put("Allen","Apex Tool Group"); manufacturerNames.put("Armstrong","Apex Tool Group"); manufacturerNames.put("Campbell","Apex Tool Group"); manufacturerNames.put("Lubriplate","Apex Tool Group"); manufacturerNames.put("Delta","Apex Tool Group"); manufacturerNames.put("Gearwrench","Apex Tool Group"); manufacturerNames.put("H.K. Porter","Apex Tool Group"); /*....100 MORE....*/ manufacturerNames.put("Stanco","Stanco Mfg"); manufacturerNames.put("Stanco","Stanco Mfg"); manufacturerNames.put("Standard Safety","Standard Safety Equipment Company"); manufacturerNames.put("Standard Safety","Standard Safety Equipment Company");
// Show all balances in hash table. names = manufacturerNames.keys(); Dataset<Row> dataFileContent = sqlContext.load("com.databricks.spark.csv", options); while(names.hasMoreElements()) { str = (String) names.nextElement(); dataFileContent=dataFileContent.withColumn("ManufacturerSource", regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString())); } dataFileContent.show(); I got to know that the amount of data is too huge for regexp_replace so got a solution to use UDFhttp://stackoverflow.com/questions/43413513/issue-in-regex-replace-in-apache-spark-java Method 2 (UDF) List<Row> data2 = Arrays.asList( RowFactory.create("Allen", "Apex Tool Group"), RowFactory.create("Armstrong","Apex Tool Group"), RowFactory.create("DeWALT","StanleyBlack") ); StructType schema2 = new StructType(new StructField[] { new StructField("label2", DataTypes.StringType, false, Metadata.empty()), new StructField("sentence2", DataTypes.StringType, false, Metadata.empty()) }); Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2); UDF2<String, String, Boolean> contains = new UDF2<String, String, Boolean>() { private static final long serialVersionUID = -5239951370238629896L; @Override public Boolean call(String t1, String t2) throws Exception { return t1.contains(t2); } }; spark.udf().register("contains", contains, DataTypes.BooleanType); UDF3<String, String, String, String> replaceWithTerm = new UDF3<String, String, String, String>() { private static final long serialVersionUID = -2882956931420910207L; @Override public String call(String t1, String t2, String t3) throws Exception { return t1.replaceAll(t2, t3); } }; spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType); Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, callUDF("contains", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"))) .withColumn("sentence_replaced", callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2"))) .select(col("sentence_replaced")); joined.show(false);} Input Allen Armstrong nishanth hemanth Allenshivu Armstrong nishanthshree shivu DeWALT Replacement of wordsThe word in LHS has to replace with the words in RHS given in the input sentenceAllen => Apex Tool GroupArmstrong => Apex Tool GroupDeWALT => StanleyBlack Output +-----+----------------------------------+---------+---------------+------------+ |label|sentence_replaced +-----+----------------------------------+---------+---------------+------------+ |0 |Apex Tool Group Armstrong nishanth hemanth Apex Tool Group |0 |Allen Apex Tool Group nishanth hemanth Allen |1 |shivu Apex Tool Group nishanth |2 |shree shivu StanleyBlack +-----+----------------------------------+---------+---------------+------------+ Expected Output +-----+----------------------------------+---------+---------------+------------+ |label| sentence_replaced +-----+----------------------------------+---------+---------------+------------+ |0 |Apex Tool Group Apex Tool Group nishanth hemanth Apex Tool Group |1 |shivu Apex Tool Group nishanth |2 |shree shivu StanleyBlack +-----+----------------------------------+---------+---------------+------------+ I get such output when there is multiple replacements to do in a row.Only Allen has been replaced with Apex Tool GroupBut not in Armstrong with Apex Tool Group in first row.Label 0 should merge to get a single row as an output in the data.So that no redundancy should exist within the same Dataset RowIs there any other method which i must follow to get the proper output.?Or is this is limitation of UDF Since return t1.contains(t2); Searches the Input sentence with the string in L.H.S though it matches and replaces. But it does not traverse to second word in the L.H.S.in the given sentence. Although it again searches and replaces the word in the L.H.S if it's matches, that's the reason for getting 2 rows instead of 1 row UDF is executed for cells as 1-1. It does not have context to the whole context of column in your case. That is why with UDF you can replace only 1 word at a time. I would suggest reconsider wrapping replace strings as dataframe - make it as map and do replacement in your "replace with" using that map of values. Kindly help us with this issue.