[ https://issues.apache.org/jira/browse/SPARK-20491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-20491: ------------------------------ Target Version/s: (was: 2.0.2) > Synonym handling replacement issue in Apache Spark > -------------------------------------------------- > > Key: SPARK-20491 > URL: https://issues.apache.org/jira/browse/SPARK-20491 > Project: Spark > Issue Type: Question > Components: Examples, ML > Affects Versions: 2.0.2 > Environment: Eclipse LUNA, Spring Boot > Reporter: Nishanth J > Labels: maven > > I am facing a major issue on replacement of Synonyms in my DataSet. > I am trying to replace the synonym of the Brand names to its equivalent names. > I have tried 2 methods to solve this issue. > Method 1 (regexp_replace) > Here i am using the regexp_replace method. > Hashtable manufacturerNames = new Hashtable(); > Enumeration names; > String str; > double bal; > manufacturerNames.put("Allen","Apex Tool Group"); > manufacturerNames.put("Armstrong","Apex Tool Group"); > manufacturerNames.put("Campbell","Apex Tool Group"); > manufacturerNames.put("Lubriplate","Apex Tool Group"); > manufacturerNames.put("Delta","Apex Tool Group"); > manufacturerNames.put("Gearwrench","Apex Tool Group"); > manufacturerNames.put("H.K. Porter","Apex Tool Group"); > /*....100 MORE....*/ > manufacturerNames.put("Stanco","Stanco Mfg"); > manufacturerNames.put("Stanco","Stanco Mfg"); > manufacturerNames.put("Standard Safety","Standard Safety Equipment > Company"); > manufacturerNames.put("Standard Safety","Standard Safety Equipment > Company"); > // Show all balances in hash table. > names = manufacturerNames.keys(); > Dataset<Row> dataFileContent = > sqlContext.load("com.databricks.spark.csv", options); > while(names.hasMoreElements()) { > str = (String) names.nextElement(); > dataFileContent=dataFileContent.withColumn("ManufacturerSource", > regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).toString())); > } > dataFileContent.show(); > I got to know that the amount of data is too huge for regexp_replace so got a > solution to use UDF > http://stackoverflow.com/questions/43413513/issue-in-regex-replace-in-apache-spark-java > Method 2 (UDF) > List<Row> data2 = Arrays.asList( > RowFactory.create("Allen", "Apex Tool Group"), > RowFactory.create("Armstrong","Apex Tool Group"), > RowFactory.create("DeWALT","StanleyBlack") > ); > StructType schema2 = new StructType(new StructField[] { > new StructField("label2", DataTypes.StringType, false, > Metadata.empty()), > new StructField("sentence2", DataTypes.StringType, false, > Metadata.empty()) > }); > Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2); > UDF2<String, String, Boolean> contains = new UDF2<String, String, > Boolean>() { > private static final long serialVersionUID = -5239951370238629896L; > @Override > public Boolean call(String t1, String t2) throws Exception { > return t1.contains(t2); > } > }; > spark.udf().register("contains", contains, DataTypes.BooleanType); > UDF3<String, String, String, String> replaceWithTerm = new UDF3<String, > String, String, String>() { > private static final long serialVersionUID = -2882956931420910207L; > @Override > public String call(String t1, String t2, String t3) throws Exception { > return t1.replaceAll(t2, t3); > } > }; > spark.udf().register("replaceWithTerm", replaceWithTerm, > DataTypes.StringType); > Dataset<Row> joined = sentenceDataFrame.join(sentenceDataFrame2, > callUDF("contains", sentenceDataFrame.col("sentence"), > sentenceDataFrame2.col("label2"))) > .withColumn("sentence_replaced", > callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"), > sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2"))) > .select(col("sentence_replaced")); > joined.show(false); > } > Got this output when there are multiple replacements do in a row. > Input- > Allen Armstrong jeevi pramod Allen > sandesh Armstrong jeevi > harsha nischay DeWALT > Output- > Apex Tool Group Armstrong jeevi pramod Apex Tool Group > Allen Apex Tool Group jeevi pramod Allen > sandesh Apex Tool Group jeevi > harsha nischay StanleyBlack > Expected Output- > Apex Tool Group Apex Tool Group jeevi pramod Apex Tool Group > sandesh Apex Tool Group jeevi > harsha nischay StanleyBlack > Are there any other method which must be followed to get the proper output.? > Or is this is limitation of UDF ? > Kindly help us with this issue. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org