Re: Compare a column in two different tables/find the distance between column data

2016-03-14 Thread Wail Alkowaileet
I think you need some sort of fuzzy join ?
Is it always the case that one title is a substring of another ?

On Tue, Mar 15, 2016 at 6:46 AM, Suniti Singh 
wrote:

> Hi All,
>
> I have two tables with same schema but different data. I have to join the
> tables based on one column and then do a group by the same column name.
>
> now the data in that column in two table might/might not exactly match.
> (Ex - column name is "title". Table1. title = "doctor"   and Table2. title
> = "doc") doctor and doc are actually same titles.
>
> From performance point of view where i have data volume in TB , i am not
> sure if i can achieve this using the sql statement. What would be the best
> approach of solving this problem. Should i look for MLLIB apis?
>
> Spark Gurus any pointers?
>
> Thanks,
> Suniti
>
>
>


-- 

*Regards,*
Wail Alkowaileet


Re: Dataset throws: Task not serializable

2016-01-11 Thread Wail Alkowaileet
Hello Michael,

Sorry for the late replay .. I was crossing the world the last few days.
I actually tried both ... REPEL and SparkApp. The reported exception was in
App.

Unfortunately the data I have is not for distribution ... sorry about that.
I saw it has been resolved.. I will try to reproduce the same error with
dummy data.

Thanks!

On Thu, Jan 7, 2016 at 2:03 PM, Michael Armbrust 
wrote:

> Were you running in the REPL?
>
> On Thu, Jan 7, 2016 at 10:34 AM, Michael Armbrust 
> wrote:
>
>> Thanks for providing a great description.  I've opened
>> https://issues.apache.org/jira/browse/SPARK-12696
>>
>> I'm actually getting a different error (running in notebooks though).
>> Something seems wrong either way.
>>
>>>
>>> *P.S* mapping by name with case classes doesn't work if the order of
>>> the fields of a case class doesn't match with the order of the DataFrame's
>>> schema.
>>
>>
>> We have tests for reordering
>> <https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L97>
>>  can
>> you provide a smaller reproduction of this problem?
>>
>> On Wed, Jan 6, 2016 at 10:27 PM, Wail Alkowaileet 
>> wrote:
>>
>>> Hey,
>>>
>>> I got an error when trying to map a Dataset df.as[CLASS] when I have
>>> some nested case classes
>>> I'm not sure if it's a bug ... or I did something wrong... or I missed
>>> some configuration.
>>>
>>>
>>> I did the following:
>>>
>>> *input snapshot*
>>>
>>> {
>>>   "count": "string",
>>>   "name": [{
>>> "addr_no": "string",
>>> "dais_id": "string",
>>> "display_name": "string",
>>> "first_name": "string",
>>> "full_name": "string",
>>> "last_name": "string",
>>> "r_id": "string",
>>> "reprint": "string",
>>> "role": "string",
>>> "seq_no": "string",
>>> "suffix": "string",
>>> "wos_standard": "string"
>>>   }]
>>> }
>>>
>>> *Case classes:*
>>>
>>> case class listType1(addr_no:String, dais_id:String, display_name:String, 
>>> first_name:String, full_name:String, last_name:String, r_id:String, 
>>> reprint:String, role:String, seq_no:String, suffix:String, 
>>> wos_standard:String)
>>> case class DatasetType1(count:String, name:Array[listType1])
>>>
>>> *Schema:*
>>> root
>>>  |-- count: string (nullable = true)
>>>  |-- name: array (nullable = true)
>>>  ||-- element: struct (containsNull = true)
>>>  |||-- addr_no: string (nullable = true)
>>>  |||-- dais_id: string (nullable = true)
>>>  |||-- display_name: string (nullable = true)
>>>  |||-- first_name: string (nullable = true)
>>>  |||-- full_name: string (nullable = true)
>>>  |||-- last_name: string (nullable = true)
>>>  |||-- r_id: string (nullable = true)
>>>  |||-- reprint: string (nullable = true)
>>>  |||-- role: string (nullable = true)
>>>  |||-- seq_no: string (nullable = true)
>>>  |||-- suffix: string (nullable = true)
>>>  |||-- wos_standard: string (nullable = true)
>>>
>>> *Scala code:*
>>>
>>> import sqlContext.implicits._
>>>
>>> val ds = df.as[DatasetType1]
>>>
>>> //Taking first() works fine
>>> println(ds.first().count)
>>>
>>> //map() then first throws exception
>>> println(ds.map(x => x.count).first())
>>>
>>>
>>> *Exception Message:*
>>> Exception in thread "main" org.apache.spark.SparkException: Task not
>>> serializable
>>> at
>>> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>>> at
>>> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>>> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>>> at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1