Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
Hi Interesting problem :) And this is where my knowledge is limited. But what I understand is this is a clustering problem of names. You may want to find a bunch of names belongs to same group by doing say distance between them. Spark supports few clustering algorithm under mllib. Love to know

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
Many thanks This is a challenge of some sort. I did this for my own work. I downloaded my bank account for the past few years as a CSV format and loaded into Hive ORC table with databricks stuff. All tables are transactional and bucketed. A typical row looks like this (they vary from bank to

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
Hi here is a quick setup (Based on airlines.txt dataset): -- *from datetime import datetime, timedelta* *from pyspark.sql.types import **

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
It is true that whatever an analytic function does can be done by standard SQL, with join and sub-queries. But the same routine done by analytic function is always faster, or at least as fast, when compared to standard SQL. I will try to see if I can do analytic functions with Spark FP on Data

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
The point is, window functions are supposed designed to be faster by doing the calculations in one pass, instead of 2 pass in case of max. DF supports window functions (using sql.Window) so instead of writing sql, you can use it as well. Best Ayan On Sun, Jul 31, 2016 at 7:48 PM, Mich

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
yes reserved word issue thanks hive> select * > from (select transactiondate, transactiondescription, debitamount > , rank() over (order by transactiondate desc) r > from accounts.ll_18740868 where transactiondescription like '%HARRODS%' > ) RS > where r=1 > ; Query ID =

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
I think the word "INNER" is reserved in Hive. Please change the alias to something else. Not sure about scala, but essentially it is string replacement. On Sun, Jul 31, 2016 at 7:27 PM, Mich Talebzadeh wrote: > thanks how about scala? > > BTW the same analytic code

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
thanks how about scala? BTW the same analytic code fails in Hive itself:( hive> select * > from (select transactiondate, transactiondescription, debitamount > from (select transactiondate, transactiondescription, debitamount > , rank() over (order by transactiondate desc) r >

Re: How to filter based on a constant value

2016-07-31 Thread ayan guha
Hi This is because Spark does not provide a way to "bind" variables like Oracle does. So you can build the sql string, like below (in python) val = 'XYZ' sqlbase = "select . where col = ''".replace(',val) On Sun, Jul 31, 2016 at 6:25 PM, Mich Talebzadeh

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
Thanks Ayan. This is the one I used scala> sqltext = """ | select * | from (select transactiondate, transactiondescription, debitamount | , rank() over (order by transactiondate desc) r | from ll_18740868 where transactiondescription like '%XYZ%' | ) inner |

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
Thanks all scala> var maxdate = ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")).collect.apply(0).getDate(0) maxdate: java.sql.Date = 2015-12-15 scala> ll_18740868.filter(col("transactiondescription").contains(HASHTAG) && col("transactiondate") ===

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
thanks Nicholas got it Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own

Re: How to filter based on a constant value

2016-07-31 Thread Nicholas Hakobian
>From the online docs: https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/Row.html#apply(int) java.lang.Object apply(int i) Returns the value at position i. If the value is null, null is returned. The following is a mapping between Spark SQL types and return types: So its

Re: How to filter based on a constant value

2016-07-31 Thread Mich Talebzadeh
thanks gents. I am trying to understand this better. As I understand a DataFrame is basically an equivalent table in relational term. so scala> var maxdate = ll_18740868.filter(col("transactiondescription").contains(HASHTAG)).agg(max("transactiondate")) maxdate: org.apache.spark.sql.DataFrame

Re: How to filter based on a constant value

2016-07-30 Thread Xinh Huynh
Hi Mitch, I think you were missing a step: [your result] maxdate: org.apache.spark.sql.Row = [2015-12-15] Since maxdate is of type Row, you would want to extract the first column of the Row with: >> val maxdateStr = maxdate.getString(0) assuming the column type is String. API doc is here:

Re: How to filter based on a constant value

2016-07-30 Thread ayan guha
select * from (select *, rank() over (order by transactiondate) r from ll_18740868 where transactiondescription='XYZ' ) inner where r=1 Hi Mitch, If using SQL is fine, you can try the code above. You need to register ll_18740868 as temp table. On Sun, Jul 31, 2016 at

How to filter based on a constant value

2016-07-30 Thread Mich Talebzadeh
Hi, I would like to find out when it was the last time I paid a company with Debit Card This is the way I do it. 1) Find the date when I paid last 2) Find the rest of details from the row(s) So var HASHTAG = "XYZ" scala> var maxdate =