Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Yes, that's it! Thank you, Ayan.



On Tue, Sep 13, 2016 at 5:50 PM, ayan guha  wrote:

> >>> df.show()
>
> ++---+---+-+-+-+
> |city|flg| id|  nbr|price|state|
> ++---+---+-+-+-+
> |  CA|  0|  1| 1000|  100|A|
> |  CA|  1|  2| 1010|   96|A|
> |  CA|  1|  3| 1010|  195|A|
> |  NY|  0|  4| 2000|  124|B|
> |  NY|  1|  5| 2001|  128|B|
> |  NY|  0|  6|3|   24|C|
> |  NY|  1|  7|30100|   27|C|
> |  NY|  0|  8|30200|   29|C|
> |  NY|  1|  9|33000|   39|C|
> ++---+---+-+-+-+
>
>
> >>> flg0 = df.filter(df.flg==0)
> >>> flg1 = df.filter(df.flg!=0)
> >>> flg0.registerTempTable("t_flg0")
> >>> flg1.registerTempTable("t_flg1")
>
> >>> j = sqlContext.sql("select *, rank() over (partition by id0 order by
> dist) r from (select *,x.id as id0,y.id as id1, abs(x.nbr/1000 -
> y.nbr/1000) + abs(x.price/100 - y.price/100) as dist from t_flg0 x inner
> join t_flg1 y on (x.city=y.city and x.state=y.state))x ")
>
>
> >>> j.show()
>
> city flg  id   nbr price state city flg  id   nbr price state id0 id1
> dist   r
>   *CA* *0* *1* *1000* *100* *A* *  CA* *1* *2* *1010* *96* *A* *1*
> *2* *0.05* *1*
>   CA 0 1 1000 100 A   CA 1 3 1010 195 A 1 3 0.96 2
>   *NY* *0* *4* *2000* *124* *B* *  NY* *1* *5* *2001* *128* *B*
> *4* *5* *0.041* *1*
>  * NY* *0* *6* *3* *24* *C* *  NY* *1* *7* *30100* *27* *C*
> *6* *7* *0.13* *1*
>   NY 0 6 3 24 C   NY 1 9 33000 39 C 6 9 3.15 2
>   *NY* *0* *8* *30200* *29* *C* *  NY* *1* *7* *30100* *27* *C*
> *8* *7* *0.12* *1*
>   NY 0 8 30200 29 C   NY 1 9 33000 39 C 8 9 2.9 2
> Wherever r=1, you got a match.
>
>
>
> On Wed, Sep 14, 2016 at 5:45 AM, Mobius ReX  wrote:
>
>> Hi Sean,
>>
>> Great!
>>
>> Is there any sample code implementing Locality Sensitive Hashing with
>> Spark, in either scala or python?
>>
>> "However if your rule is really like "must match column A and B and
>> then closest value in column C then just ordering everything by A, B,
>> C lets you pretty much read off the answer from the result set
>> directly. Everything is closest to one of its two neighbors."
>>
>> This is interesting since we can use Lead/Lag Windowing function if we
>> have only one continuous column. However,
>> our rule is "must match column A and B and then closest values in column
>> C and D - for any ID with column E = 0, and the closest ID with Column E = 
>> 1".
>> The distance metric between ID1 (with Column E =0) and ID2 (with Column E
>> =1) is defined as
>> abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
>> One cannot do
>> abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )
>>
>>
>> Any further tips?
>>
>> Best,
>> Rex
>>
>>
>>
>> On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen  wrote:
>>
>>> The key is really to specify the distance metric that defines
>>> "closeness" for you. You have features that aren't on the same scale,
>>> and some that aren't continuous. You might look to clustering for
>>> ideas here, though mostly you just want to normalize the scale of
>>> dimensions to make them comparable.
>>>
>>> You can find nearest neighbors by brute force. If speed really matters
>>> you can consider locality sensitive hashing, which isn't that hard to
>>> implement and can give a lot of speed for a small cost in accuracy.
>>>
>>> However if your rule is really like "must match column A and B and
>>> then closest value in column C then just ordering everything by A, B,
>>> C lets you pretty much read off the answer from the result set
>>> directly. Everything is closest to one of its two neighbors.
>>>
>>> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX  wrote:
>>> > Given a table
>>> >
>>> >> $cat data.csv
>>> >>
>>> >> ID,State,City,Price,Number,Flag
>>> >> 1,CA,A,100,1000,0
>>> >> 2,CA,A,96,1010,1
>>> >> 3,CA,A,195,1010,1
>>> >> 4,NY,B,124,2000,0
>>> >> 5,NY,B,128,2001,1
>>> >> 6,NY,C,24,3,0
>>> >> 7,NY,C,27,30100,1
>>> >> 8,NY,C,29,30200,0
>>> >> 9,NY,C,39,33000,1
>>> >
>>> >
>>> > Expected Result:
>>> >
>>> > ID0, ID1
>>> > 1,2
>>> > 4,5
>>> > 6,7
>>> > 8,7
>>> >
>>> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
>>> with
>>> > the same "State" and "City", and the nearest Price and Number
>>> normalized by
>>> > the corresponding values of that ID with Flag=0.
>>> >
>>> > For example, ID = 1 and ID=2, has the same State and City, but
>>> different
>>> > FLAG.
>>> > After normalized the Price and Number (Price divided by 100, Number
>>> divided
>>> > by 1000), the distance between ID=1 and ID=2 is defined as :
>>> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
>>> >
>>> >
>>> > What's the best way to find such nearest neighbor? Any valuable tips
>>> will be
>>> > greatly appreciated!
>>> >
>>> >
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Hi Sean,

Now let's assume we have column C and column D normalized, and the metric
is simplified to
abs( C1 - C2 ) + abs (D1 - D2)

Can we benefit the performance from LSH?

Thank you again!

Best,
Rex


On Tue, Sep 13, 2016 at 12:47 PM, Sean Owen  wrote:

> Given the nature of your metric, I don't think you can use things like
> LSH which more or less depend on a continuous metric space. This is
> too specific to fit into a general framework usefully I think, but, I
> think you can solve this directly with some code without much trouble.
>
> On Tue, Sep 13, 2016 at 8:45 PM, Mobius ReX  wrote:
> > Hi Sean,
> >
> > Great!
> >
> > Is there any sample code implementing Locality Sensitive Hashing with
> Spark,
> > in either scala or python?
> >
> > "However if your rule is really like "must match column A and B and
> > then closest value in column C then just ordering everything by A, B,
> > C lets you pretty much read off the answer from the result set
> > directly. Everything is closest to one of its two neighbors."
> >
> > This is interesting since we can use Lead/Lag Windowing function if we
> have
> > only one continuous column. However,
> > our rule is "must match column A and B and then closest values in column
> C
> > and D - for any ID with column E = 0, and the closest ID with Column E =
> 1".
> > The distance metric between ID1 (with Column E =0) and ID2 (with Column E
> > =1) is defined as
> > abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
> > One cannot do
> > abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )
> >
> >
> > Any further tips?
> >
> > Best,
> > Rex
> >
> >
> >
> > On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen  wrote:
> >>
> >> The key is really to specify the distance metric that defines
> >> "closeness" for you. You have features that aren't on the same scale,
> >> and some that aren't continuous. You might look to clustering for
> >> ideas here, though mostly you just want to normalize the scale of
> >> dimensions to make them comparable.
> >>
> >> You can find nearest neighbors by brute force. If speed really matters
> >> you can consider locality sensitive hashing, which isn't that hard to
> >> implement and can give a lot of speed for a small cost in accuracy.
> >>
> >> However if your rule is really like "must match column A and B and
> >> then closest value in column C then just ordering everything by A, B,
> >> C lets you pretty much read off the answer from the result set
> >> directly. Everything is closest to one of its two neighbors.
> >>
> >> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX  wrote:
> >> > Given a table
> >> >
> >> >> $cat data.csv
> >> >>
> >> >> ID,State,City,Price,Number,Flag
> >> >> 1,CA,A,100,1000,0
> >> >> 2,CA,A,96,1010,1
> >> >> 3,CA,A,195,1010,1
> >> >> 4,NY,B,124,2000,0
> >> >> 5,NY,B,128,2001,1
> >> >> 6,NY,C,24,3,0
> >> >> 7,NY,C,27,30100,1
> >> >> 8,NY,C,29,30200,0
> >> >> 9,NY,C,39,33000,1
> >> >
> >> >
> >> > Expected Result:
> >> >
> >> > ID0, ID1
> >> > 1,2
> >> > 4,5
> >> > 6,7
> >> > 8,7
> >> >
> >> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
> >> > with
> >> > the same "State" and "City", and the nearest Price and Number
> normalized
> >> > by
> >> > the corresponding values of that ID with Flag=0.
> >> >
> >> > For example, ID = 1 and ID=2, has the same State and City, but
> different
> >> > FLAG.
> >> > After normalized the Price and Number (Price divided by 100, Number
> >> > divided
> >> > by 1000), the distance between ID=1 and ID=2 is defined as :
> >> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 =
> 0.05
> >> >
> >> >
> >> > What's the best way to find such nearest neighbor? Any valuable tips
> >> > will be
> >> > greatly appreciated!
> >> >
> >> >
> >
> >
>


Re: What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Hi Sean,

Great!

Is there any sample code implementing Locality Sensitive Hashing with
Spark, in either scala or python?

"However if your rule is really like "must match column A and B and
then closest value in column C then just ordering everything by A, B,
C lets you pretty much read off the answer from the result set
directly. Everything is closest to one of its two neighbors."

This is interesting since we can use Lead/Lag Windowing function if we have
only one continuous column. However,
our rule is "must match column A and B and then closest values in column C
and D - for any ID with column E = 0, and the closest ID with Column E = 1".
The distance metric between ID1 (with Column E =0) and ID2 (with Column E
=1) is defined as
abs( C1/C1 - C2/C1 ) + abs (D1/D1 - D2/D1)
One cannot do
abs( (C1/C1 + D1/D1) - (C2/C1 + D2/ D1) )


Any further tips?

Best,
Rex



On Tue, Sep 13, 2016 at 11:09 AM, Sean Owen  wrote:

> The key is really to specify the distance metric that defines
> "closeness" for you. You have features that aren't on the same scale,
> and some that aren't continuous. You might look to clustering for
> ideas here, though mostly you just want to normalize the scale of
> dimensions to make them comparable.
>
> You can find nearest neighbors by brute force. If speed really matters
> you can consider locality sensitive hashing, which isn't that hard to
> implement and can give a lot of speed for a small cost in accuracy.
>
> However if your rule is really like "must match column A and B and
> then closest value in column C then just ordering everything by A, B,
> C lets you pretty much read off the answer from the result set
> directly. Everything is closest to one of its two neighbors.
>
> On Tue, Sep 13, 2016 at 6:18 PM, Mobius ReX  wrote:
> > Given a table
> >
> >> $cat data.csv
> >>
> >> ID,State,City,Price,Number,Flag
> >> 1,CA,A,100,1000,0
> >> 2,CA,A,96,1010,1
> >> 3,CA,A,195,1010,1
> >> 4,NY,B,124,2000,0
> >> 5,NY,B,128,2001,1
> >> 6,NY,C,24,3,0
> >> 7,NY,C,27,30100,1
> >> 8,NY,C,29,30200,0
> >> 9,NY,C,39,33000,1
> >
> >
> > Expected Result:
> >
> > ID0, ID1
> > 1,2
> > 4,5
> > 6,7
> > 8,7
> >
> > for each ID with Flag=0 above, we want to find another ID from Flag=1,
> with
> > the same "State" and "City", and the nearest Price and Number normalized
> by
> > the corresponding values of that ID with Flag=0.
> >
> > For example, ID = 1 and ID=2, has the same State and City, but different
> > FLAG.
> > After normalized the Price and Number (Price divided by 100, Number
> divided
> > by 1000), the distance between ID=1 and ID=2 is defined as :
> > abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
> >
> >
> > What's the best way to find such nearest neighbor? Any valuable tips
> will be
> > greatly appreciated!
> >
> >
>


What's the best way to find the nearest neighbor in Spark? Any windowing function?

2016-09-13 Thread Mobius ReX
Given a table

> $cat data.csv
>
> ID,State,City,Price,Number,Flag
> 1,CA,A,100,1000,0
> 2,CA,A,96,1010,1
> 3,CA,A,195,1010,1
> 4,NY,B,124,2000,0
> 5,NY,B,128,2001,1
> 6,NY,C,24,3,0
> 7,NY,C,27,30100,1
> 8,NY,C,29,30200,0
> 9,NY,C,39,33000,1


Expected Result:

ID0, ID1
1,2
4,5
6,7
8,7

for each ID with Flag=0 above, we want to find another ID from Flag=1, with
the same "State" and "City", and the nearest Price and Number normalized by
the corresponding values of that ID with Flag=0.

For example, ID = 1 and ID=2, has the same State and City, but different
FLAG.
After normalized the Price and Number (Price divided by 100, Number divided
by 1000), the distance between ID=1 and ID=2 is defined as :
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05


What's the best way to find such nearest neighbor? Any valuable tips will
be greatly appreciated!


What's the best way to detect and remove outliers in a table?

2016-09-01 Thread Mobius ReX
Given a table with hundreds of columns mixed with both categorical and
numerical attributes, and the distribution of values is unknown, what's the
best way to detect outliers?

For example, given a table
Category  Price
A 1
A 1.3
A 100
C  1

If category C above appears rarely, for example less than 0.1%, then we
should remove all rows with Category=C.

Assuming continuous distribution, if Price of Category A is rarely above
1000, then 100 above is another outlier.

What's the best scalable way to remove all outliers? It would be laborious
to plot the distribution curve for each numerical column, and histogram for
each categorical column.

Any tips would be greatly appreciated!

Regards
Rex