What about JOIN your table with a map table? On Thu, Apr 27, 2017 at 9:58 PM, Nishanth <nishanth...@yahoo.com.invalid> wrote:
> I am facing a major issue on replacement of Synonyms in my DataSet. > > I am trying to replace the synonym of the Brand names to its equivalent > names. > > I have tried 2 methods to solve this issue. > > Method 1 (regexp_replace) > > Here i am using the regexp_replace method. > > Hashtable manufacturerNames = new Hashtable(); > Enumeration names; > String str; > double bal; > > manufacturerNames.put("Allen","Apex Tool Group"); > manufacturerNames.put("Armstrong","Apex Tool Group"); > manufacturerNames.put("Campbell","Apex Tool Group"); > manufacturerNames.put("Lubriplate","Apex Tool Group"); > manufacturerNames.put("Delta","Apex Tool Group"); > manufacturerNames.put("Gearwrench","Apex Tool Group"); > manufacturerNames.put("H.K. Porter","Apex Tool Group"); > /*....100 MORE....*/ > manufacturerNames.put("Stanco","Stanco Mfg"); > manufacturerNames.put("Stanco","Stanco Mfg"); > manufacturerNames.put("Standard Safety","Standard Safety > Equipment Company"); > manufacturerNames.put("Standard Safety","Standard Safety > Equipment Company"); > > > > // Show all balances in hash table. > names = manufacturerNames.keys(); > Dataset<Row> dataFileContent = > sqlContext.load("com.databricks.spark.csv", > options); > > > while(names.hasMoreElements()) { > str = (String) names.nextElement(); > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str). > toString())); > } > dataFileContent.show(); > > I got to know that the amount of data is too huge for regexp_replace so > got a solution to use UDF > http://stackoverflow.com/questions/43413513/issue-in- > regex-replace-in-apache-spark-java > > > Method 2 (UDF) > > List<Row> data2 = Arrays.asList( > RowFactory.create("Allen", "Apex Tool Group"), > RowFactory.create("Armstrong","Apex Tool Group"), > RowFactory.create("DeWALT","StanleyBlack") > ); > > StructType schema2 = new StructType(new StructField[] { > new StructField("label2", DataTypes.StringType, false, > Metadata.empty()), > new StructField("sentence2", DataTypes.StringType, false, > Metadata.empty()) > }); > Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, > schema2); > > UDF2<String, String, Boolean> contains = new UDF2<String, String, > Boolean>() { > private static final long serialVersionUID = -5239951370238629896L; > > @Override > public Boolean call(String t1, String t2) throws Exception { > return t1.contains(t2); > } > }; > spark.udf().register("contains", contains, DataTypes.BooleanType); > > UDF3<String, String, String, String> replaceWithTerm = new > UDF3<String, String, String, String>() { > private static final long serialVersionUID = -2882956931420910207L; > > @Override > public String call(String t1, String t2, String t3) throws > Exception { > return t1.replaceAll(t2, t3); > } > }; > spark.udf().register("replaceWithTerm", replaceWithTerm, > DataTypes.StringType); > > Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, > callUDF("contains", sentenceDataFrame.col("sentence"), > sentenceDataFrame2.col("label2"))) > .withColumn("sentence_replaced", > callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), > sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2"))) > .select(col("sentence_replaced")); > > joined.show(false); > } > > > Input > > Allen Armstrong nishanth hemanth Allen > shivu Armstrong nishanth > shree shivu DeWALT > > Replacement of words > The word in LHS has to replace with the words in RHS given in the input > sentence > Allen => Apex Tool Group > Armstrong => Apex Tool Group > DeWALT => StanleyBlack > > Output > > +-----+----------------------------------+---------+-------- > -------+------------+ > |label|sentence_replaced > | > +-----+----------------------------------+---------+-------- > -------+------------+ > |0 |Apex Tool Group Armstrong nishanth hemanth Apex Tool Group > | > |0 |Allen Apex Tool Group nishanth hemanth Allen > | > |1 |shivu Apex Tool Group nishanth > | > |2 |shree shivu StanleyBlack > | > +-----+----------------------------------+---------+-------- > -------+------------+ > > Expected Output > +-----+----------------------------------+---------+-------- > -------+------------+ > |label| sentence_replaced > | > +-----+----------------------------------+---------+-------- > -------+------------+ > |0 |Apex Tool Group Apex Tool Group nishanth hemanth Apex Tool > Group | > |1 |shivu Apex Tool Group nishanth > | > |2 |shree shivu StanleyBlack > | > +-----+----------------------------------+---------+-------- > -------+------------+ > > I get such output when there is multiple replacements to do in a row. > Only Allen has been replaced with Apex Tool Group > But not in Armstrong with Apex Tool Group in first row. > Label 0 should merge to get a single row as an output in the data. > So that no redundancy should exist within the same Dataset Row > Is there any other method which i must follow to get the proper output.? > Or is this is limitation of UDF > > > Since return t1.contains(t2); Searches the Input sentence with the string > in L.H.S > though it matches and replaces. But it does not traverse to second word in > the L.H.S.in > the given sentence. Although it again searches and replaces the word in > the L.H.S if it's matches, > that's the reason for getting 2 rows instead of 1 row > > > UDF is executed for cells as 1-1. It does not have context to the whole > context of column in your case. > That is why with UDF you can replace only 1 word at a time. I would > suggest reconsider wrapping replace > strings as dataframe - make it as map and do replacement in your "replace > with" using that map of values. > > > Kindly help us with this issue. > > I'm looking forward to here from you as soon as possible. > > > Thanking You , > Nishanth J >