Re: Synonym handling replacement issue with UDF in Apache Spark

Yanbo Liang Thu, 27 Apr 2017 07:47:21 -0700

What about JOIN your table with a map table?

On Thu, Apr 27, 2017 at 9:58 PM, Nishanth <nishanth...@yahoo.com.invalid>
wrote:


> I am facing a major issue on replacement of Synonyms in my DataSet.
>
> I am trying to replace the synonym of the Brand names to its equivalent
> names.
>
> I have tried 2 methods to solve this issue.
>
> Method 1 (regexp_replace)
>
> Here i am using the regexp_replace method.
>
> Hashtable manufacturerNames = new Hashtable();
>           Enumeration names;
>           String str;
>           double bal;
>
>           manufacturerNames.put("Allen","Apex Tool Group");
>           manufacturerNames.put("Armstrong","Apex Tool Group");
>           manufacturerNames.put("Campbell","Apex Tool Group");
>           manufacturerNames.put("Lubriplate","Apex Tool Group");
>           manufacturerNames.put("Delta","Apex Tool Group");
>           manufacturerNames.put("Gearwrench","Apex Tool Group");
>           manufacturerNames.put("H.K. Porter","Apex Tool Group");
>           /*....100 MORE....*/
>           manufacturerNames.put("Stanco","Stanco Mfg");
>           manufacturerNames.put("Stanco","Stanco Mfg");
>           manufacturerNames.put("Standard Safety","Standard Safety
> Equipment Company");
>           manufacturerNames.put("Standard Safety","Standard Safety
> Equipment Company");
>
>
>
>           // Show all balances in hash table.
>           names = manufacturerNames.keys();
>           Dataset<Row> dataFileContent = 
> sqlContext.load("com.databricks.spark.csv",
> options);
>
>
>           while(names.hasMoreElements()) {
>              str = (String) names.nextElement();
>              dataFileContent=dataFileContent.withColumn("ManufacturerSource",
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).
> toString()));
>           }
>           dataFileContent.show();
>
> I got to know that the amount of data is too huge for regexp_replace so
> got a solution to use UDF
> http://stackoverflow.com/questions/43413513/issue-in-
> regex-replace-in-apache-spark-java
>
>
> Method 2 (UDF)
>
> List<Row> data2 = Arrays.asList(
>         RowFactory.create("Allen", "Apex Tool Group"),
>         RowFactory.create("Armstrong","Apex Tool Group"),
>         RowFactory.create("DeWALT","StanleyBlack")
>     );
>
>     StructType schema2 = new StructType(new StructField[] {
>         new StructField("label2", DataTypes.StringType, false,
> Metadata.empty()),
>         new StructField("sentence2", DataTypes.StringType, false,
> Metadata.empty())
>     });
>     Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2,
> schema2);
>
>     UDF2<String, String, Boolean> contains = new UDF2<String, String,
> Boolean>() {
>         private static final long serialVersionUID = -5239951370238629896L;
>
>         @Override
>         public Boolean call(String t1, String t2) throws Exception {
>             return t1.contains(t2);
>         }
>     };
>     spark.udf().register("contains", contains, DataTypes.BooleanType);
>
>     UDF3<String, String, String, String> replaceWithTerm = new
> UDF3<String, String, String, String>() {
>         private static final long serialVersionUID = -2882956931420910207L;
>
>         @Override
>         public String call(String t1, String t2, String t3) throws
> Exception {
>             return t1.replaceAll(t2, t3);
>         }
>     };
>     spark.udf().register("replaceWithTerm", replaceWithTerm,
> DataTypes.StringType);
>
>     Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2,
> callUDF("contains", sentenceDataFrame.col("sentence"),
> sentenceDataFrame2.col("label2")))
>                             .withColumn("sentence_replaced",
> callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"),
> sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
>                             .select(col("sentence_replaced"));
>
>     joined.show(false);
> }
>
>
> Input
>
> Allen Armstrong nishanth hemanth Allen
> shivu Armstrong nishanth
> shree shivu DeWALT
>
> Replacement of words
> The word in LHS has to replace with the words in RHS given in the input
> sentence
> Allen => Apex Tool Group
> Armstrong => Apex Tool Group
> DeWALT => StanleyBlack
>
>        Output
>
>       +-----+----------------------------------+---------+--------
> -------+------------+
>       |label|sentence_replaced
>       |
>       +-----+----------------------------------+---------+--------
> -------+------------+
>       |0    |Apex Tool Group Armstrong nishanth hemanth Apex Tool Group
>             |
>       |0    |Allen Apex Tool Group nishanth hemanth Allen
>             |
>       |1    |shivu Apex Tool Group nishanth
>             |
>       |2    |shree shivu StanleyBlack
>             |
>       +-----+----------------------------------+---------+--------
> -------+------------+
>
>       Expected Output
>       +-----+----------------------------------+---------+--------
> -------+------------+
>       |label| sentence_replaced
>        |
>       +-----+----------------------------------+---------+--------
> -------+------------+
>       |0    |Apex Tool Group Apex Tool Group nishanth hemanth Apex Tool
> Group         |
>       |1    |shivu Apex Tool Group nishanth
>             |
>       |2    |shree shivu StanleyBlack
>             |
>       +-----+----------------------------------+---------+--------
> -------+------------+
>
> I get such output when there is multiple replacements to do in a row.
> Only Allen has been replaced with Apex Tool Group
> But not in Armstrong with Apex Tool Group in first row.
> Label 0 should merge to get a single row as an output in the data.
> So that no redundancy should exist within the same Dataset Row
> Is there any other method which i must follow to get the proper output.?
> Or is this is limitation of UDF
>
>
> Since return t1.contains(t2); Searches the Input sentence with the string
> in L.H.S
> though it matches and replaces. But it does not traverse to second word in
> the L.H.S.in
> the given sentence. Although it again searches and replaces the word in
> the L.H.S if it's matches,
> that's the reason for getting 2 rows instead of 1 row
>
>
> UDF is executed for cells as 1-1. It does not have context to the whole
> context of column in your case.
> That is why with UDF you can replace only 1 word at a time. I would
> suggest reconsider wrapping replace
> strings as dataframe - make it as map and do replacement in your "replace
> with" using that map of values.
>
>
> Kindly help us with this issue.
>
> I'm looking forward to here from you as soon as possible.
>
>
> Thanking You ,
> Nishanth J
>

Re: Synonym handling replacement issue with UDF in Apache Spark

Reply via email to