aggregateByKey on PairRDD

2016-03-29 Thread Suniti Singh
Hi All, I have an RDD having the data in the following form : tempRDD: RDD[(String, (String, String))] (brand , (product, key)) ("amazon",("book1","tech")) ("eBay",("book1","tech")) ("barns",("book","tech")) ("amazon",("book2","tech")) I would like to group the data by Brand and would

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
the further processing. I am kind of stuck. On Tue, Mar 15, 2016 at 10:50 AM, Suniti Singh <suniti.si...@gmail.com> wrote: > Is it always the case that one title is a substring of another ? -- Not > always. One title can have values like D.O.C, doctor_{areacode}, > doc_{dep,areacode}

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
Is it always the case that one title is a substring of another ? > > On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh <suniti.si...@gmail.com> > wrote: > >> Hi All, >> >> I have two tables with same schema but different data. I have to join the >> tables base

Compare a column in two different tables/find the distance between column data

2016-03-14 Thread Suniti Singh
Hi All, I have two tables with same schema but different data. I have to join the tables based on one column and then do a group by the same column name. now the data in that column in two table might/might not exactly match. (Ex - column name is "title". Table1. title = "doctor" and Table2.

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Suniti Singh
hive 1.6.0 in embed mode doesn't connect to metastore -- https://issues.apache.org/jira/browse/SPARK-9686 https://forums.databricks.com/questions/6512/spark-160-not-able-to-connect-to-hive-metastore.html On Wed, Mar 9, 2016 at 10:48 AM, Suniti Singh <suniti.si...@gmail.com> wrote: >

Re: spark 1.6.0 connect to hive metastore

2016-03-09 Thread Suniti Singh
Hi, I am able to reproduce this error only when using spark 1.6.0 and hive 1.6.0. The hive-site.xml is in the classpath but somehow spark rejects the classpath search for hive-site.xml and start using the default metastore Derby. 16/03/09 10:37:52 INFO MetaStoreDirectSql: Using direct SQL,

Re: Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Suniti Singh
Please check the document for the configuration - http://spark.apache.org/docs/latest/job-scheduling.html#configuration-and-setup On Tue, Mar 8, 2016 at 10:14 AM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > You’ve started the external shuffle service on all worker nodes, correct? >

Re: Adding hive context gives error

2016-03-07 Thread Suniti Singh
-hive" % "1.5.0") > > > > However, I do note that you are using Spark-sql include and the Spark > version you use is 1.6.0. Can you please try with 1.5.0 to see if it works? > I havent yet tried Spark 1.6.0. > > > On 08/03/16 00:15, Suniti Singh wrote: >

Adding hive context gives error

2016-03-07 Thread Suniti Singh
Hi All, I am trying to create a hive context in a scala prog as follows in eclipse: Note -- i have added the maven dependency for spark -core , hive , and sql. import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object