Synonym handling replacement issue with UDF in Apache Spark

Nishanth Thu, 27 Apr 2017 23:13:42 -0700

I am facing a major issue on replacement of Synonyms in my DataSet.
I am trying to replace the synonym of the Brand names to its equivalent names.
I have tried 2 methods to solve this issue.
Method 1 (regexp_replace)
Here i am using the regexp_replace method.
 Hashtable manufacturerNames = new Hashtable();          Enumeration names;     
     String str;          double bal;
          manufacturerNames.put("Allen","Apex Tool Group");          
manufacturerNames.put("Armstrong","Apex Tool Group");          
manufacturerNames.put("Campbell","Apex Tool Group");          
manufacturerNames.put("Lubriplate","Apex Tool Group");          
manufacturerNames.put("Delta","Apex Tool Group");          
manufacturerNames.put("Gearwrench","Apex Tool Group");          
manufacturerNames.put("H.K. Porter","Apex Tool Group");          /*....100 
MORE....*/          manufacturerNames.put("Stanco","Stanco Mfg");          
manufacturerNames.put("Stanco","Stanco Mfg");          
manufacturerNames.put("Standard Safety","Standard Safety Equipment Company");   
       manufacturerNames.put("Standard Safety","Standard Safety Equipment 
Company");



          // Show all balances in hash table.          names = 
manufacturerNames.keys();          Dataset<Row> dataFileContent = 
sqlContext.load("com.databricks.spark.csv", options);

          while(names.hasMoreElements()) {             str = (String) 
names.nextElement();             
dataFileContent=dataFileContent.withColumn("ManufacturerSource", 
regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString()));
          }                  dataFileContent.show();
I got to know that the amount of data is too huge for regexp_replace so got a 
solution to use 
UDFhttp://stackoverflow.com/questions/43413513/issue-in-regex-replace-in-apache-spark-java

Method 2 (UDF)
List<Row> data2 = Arrays.asList(        RowFactory.create("Allen", "Apex Tool 
Group"),        RowFactory.create("Armstrong","Apex Tool Group"),        
RowFactory.create("DeWALT","StanleyBlack")    );
    StructType schema2 = new StructType(new StructField[] {        new 
StructField("label2", DataTypes.StringType, false, Metadata.empty()),        
new StructField("sentence2", DataTypes.StringType, false, Metadata.empty())     
});    Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
    UDF2<String, String, Boolean> contains = new UDF2<String, String, 
Boolean>() {        private static final long serialVersionUID = 
-5239951370238629896L;
        @Override        public Boolean call(String t1, String t2) throws 
Exception {            return t1.contains(t2);        }    };    
spark.udf().register("contains", contains, DataTypes.BooleanType);
    UDF3<String, String, String, String> replaceWithTerm = new UDF3<String, 
String, String, String>() {        private static final long serialVersionUID = 
-2882956931420910207L;
        @Override        public String call(String t1, String t2, String t3) 
throws Exception {            return t1.replaceAll(t2, t3);        }    };    
spark.udf().register("replaceWithTerm", replaceWithTerm, DataTypes.StringType);
    Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, 
callUDF("contains", sentenceDataFrame.col("sentence"), 
sentenceDataFrame2.col("label2")))                            
.withColumn("sentence_replaced", callUDF("replaceWithTerm", 
sentenceDataFrame.col("sentence"), sentenceDataFrame2.col("label2"), 
sentenceDataFrame2.col("sentence2")))                            
.select(col("sentence_replaced"));
    joined.show(false);}

Input
Allen Armstrong nishanth hemanth Allenshivu Armstrong nishanthshree shivu DeWALT
Replacement of wordsThe word in LHS has to replace with the words in RHS given 
in the input sentenceAllen => Apex Tool GroupArmstrong => Apex Tool GroupDeWALT 
=> StanleyBlack
       Output
      
+-----+----------------------------------+---------+---------------+------------+
      |label|sentence_replaced                                                  
   
+-----+----------------------------------+---------+---------------+------------+
      |0    |Apex Tool Group Armstrong nishanth hemanth Apex Tool Group         
   |0    |Allen Apex Tool Group nishanth hemanth Allen                          
                 |1    |shivu Apex Tool Group nishanth                          
                                 |2    |shree shivu StanleyBlack                
                                           
+-----+----------------------------------+---------+---------------+------------+
      Expected Output      
+-----+----------------------------------+---------+---------------+------------+
      |label| sentence_replaced                                                 
    
+-----+----------------------------------+---------+---------------+------------+
      |0    |Apex Tool Group Apex Tool Group nishanth hemanth Apex Tool Group   
     |1    |shivu Apex Tool Group nishanth                                      
        |2    |shree shivu StanleyBlack                                         
         
+-----+----------------------------------+---------+---------------+------------+
I get such output when there is multiple replacements to do in a row.Only Allen 
has been replaced with Apex Tool GroupBut not in Armstrong with Apex Tool Group 
in first row.Label 0 should merge to get a single row as an output in the 
data.So that no redundancy should exist within the same Dataset RowIs there any 
other method which i must follow to get the proper output.?Or is this is 
limitation of UDF

Since return t1.contains(t2); Searches the Input sentence with the string in 
L.H.S though it matches and replaces. But it does not traverse to second word 
in the L.H.S.in the given sentence. Although it again searches and replaces the 
word in the L.H.S if it's matches, that's the reason for getting 2 rows instead 
of 1 row

 UDF is executed for cells as 1-1. It does not have context to the whole 
context of column in your case. That is why with UDF you can replace only 1 
word at a time. I would suggest reconsider wrapping replace strings as 
dataframe - make it as map and do replacement in your "replace with" using that 
map of values.

Kindly help us with this issue.

Synonym handling replacement issue with UDF in Apache Spark

Reply via email to