kafka + mysql filtering problem

2016-02-29 Thread franco barrientos
Hi all,

I want to read some filtering rules from mysql (jdbc mysql driver) specifically 
its a char type containing a field and value to process in a kafka streaming 
input.

The main idea is to process this from a web UI (livy server).

Any suggestion or guidelines?

e.g., I have this:

object Streaming {
  def main(args: Array[String]) {
if (args.length < 4) {
  System.err.println("Usage: KafkaWordCount
")
  System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
var spc = SparkContext.getOrCreate()
val ssc = new StreamingContext(spc, Seconds(3))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topics -> 
5)).map(_._2)
/* TEST MYSQL */
val sqlContext = new SQLContext(spc)
val prop = new java.util.Properties
val url = "jdbc:mysql://52.22.38.81:3306/tmp"
val tbl_users = "santander_demo_users"
val tbl_rules = "santander_demo_filters"
val tbl_campaigns = "santander_demo_campaigns"
prop.setProperty("user", "root")
prop.setProperty("password", "Exalitica2014")
val users = sqlContext.read.jdbc(url, tbl_users, prop)
val rules = sqlContext.read.jdbc(url, tbl_rules, prop)
val campaigns = sqlContext.read.jdbc(url, tbl_campaigns, prop)
val toolbox = currentMirror.mkToolBox()
val toRemove = "\"”.toSet
var mto = “0"

def rule_apply (n:Int, t:String, rules:DataFrame) : String = {
 // reading rules from mysql
  var r = (rules.filter(rules("CID") === 
n).select("FILTER_DSC").first())(0).toString()
  
  // using mkToolbox for pre-processing rules
return toolbox.eval(toolbox.parse("""
  val mto = """ + t + """
  if(""" + r + """) {
return “true"
  } else {
return “false"
}
""")).toString()
}
/* TEST MYSQL */

lines.map{x =>
  if(x.split(",").length > 1) {
// reading from kafka input
mto = spc.broadcast(x.split(",")(5).filterNot(toRemove))
  }
}
var msg = rule_apply(1, mto, rules)
var word = lines.map(x => msg)
word.print()
ssc.start()
ssc.awaitTermination()
  }
}

The problem is that mto variable always returns to “0” value after mapping 
lines DStream. I tried to process rule_apply into map but I get not 
serializable mkToolbox class error.

Thanks in advance.

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

franco.barrien...@exalitica.com 

www.exalitica.com




TF-IDF Question

2015-06-04 Thread franco barrientos
Hi all!,

I have a .txt file where each row of it it¹s a collection of terms of a
document separated by space. For example:

1 Hola spark²
2 ..

I followed this example of spark site
https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get
something like this:

tfidf.first()
org.apache.spark.mllib.linalg.Vector =
(1048576,[35587,884670],[3.458767233,3.458767233])

I think this:

1. First parameter ³1048576² i don¹t know what it is but always it´s the
same number (maybe the number of terms).
2. Second parameter ³[35587,884670]² i think are the terms of the first line
in my .txt file.
3. Third parameter ³[3.458767233,3.458767233]² i think are the tfidf values
for my terms.
Anyone knows the exact interpretation of this and in the second point if
these values are the terms, how can i match this values with the original
terms values (³[35587=Hola,884670=spark]²)?.

Regards and thanks in advance.

Franco Barrientos
Data Scientist
Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com
www.exalitica.com
 http://www.exalitica.com/




null Error in ALS model predict

2014-12-24 Thread Franco Barrientos
Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-  Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-  Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)= 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



RE: Effects problems in logistic regression

2014-12-22 Thread Franco Barrientos
Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me!

 

De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:42
Para: 'DB Tsai'
CC: 'Sean Owen'; user@spark.apache.org
Asunto: RE: Effects problems in logistic regression

 

Thanks I will try.

 

De: DB Tsai [mailto:dbt...@dbtsai.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Can you try LogisticRegressionWithLBFGS? I verified that this will be converged 
to the same result trained by R's glmnet package without regularization. The 
problem of LogisticRegressionWithSGD is it's very slow in term of converging, 
and lots of time, it's very sensitive to stepsize which can lead to wrong 
answer. 

 

The regularization logic in MLLib is not entirely correct, and it will penalize 
the intercept. In general, with really high regularization, all the 
coefficients will be zeros except the intercept. In logistic regression, the 
non-zero intercept can be understood as the prior-probability of each class, 
and in linear regression, this will be the mean of response. I'll have a PR to 
fix this issue.





Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

 

On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  
wrote:

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com mailto:so...@cloudera.com ] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
franco.barrien...@exalitica.com wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649 tel:%28%2B562%29-29699649 
(+569)-76347893 tel:%28%2B569%29-76347893 

franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  

www.exalitica.com http://www.exalitica.com/ 


  http://exalitica.com/web/img/frim.png 

 



Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set
with one variable (wich is a amount of an item) and intercept, I get weights
of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense
to me because I run a logistic regression for the same data in SAS and I get
these weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from
data set, this probability is near to zero or zero in much cases, because
when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big
number, in fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



RE: Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Thanks I will try.

 

De: DB Tsai [mailto:dbt...@dbtsai.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org
Asunto: Re: Effects problems in logistic regression

 

Can you try LogisticRegressionWithLBFGS? I verified that this will be converged 
to the same result trained by R's glmnet package without regularization. The 
problem of LogisticRegressionWithSGD is it's very slow in term of converging, 
and lots of time, it's very sensitive to stepsize which can lead to wrong 
answer. 

 

The regularization logic in MLLib is not entirely correct, and it will penalize 
the intercept. In general, with really high regularization, all the 
coefficients will be zeros except the intercept. In logistic regression, the 
non-zero intercept can be understood as the prior-probability of each class, 
and in linear regression, this will be the mean of response. I'll have a PR to 
fix this issue.





Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

 

On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  
wrote:

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com mailto:so...@cloudera.com ] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org mailto:user@spark.apache.org 
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  
wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649 tel:%28%2B562%29-29699649 
(+569)-76347893 tel:%28%2B569%29-76347893 

franco.barrien...@exalitica.com mailto:franco.barrien...@exalitica.com  

www.exalitica.com http://www.exalitica.com/ 


  http://exalitica.com/web/img/frim.png 

 



Percentile

2014-11-27 Thread Franco Barrientos
Hi folks!,

 

Anyone known how can I calculate for each elements of a variable in a RDD
its percentile? I tried to calculate trough Spark SQL with subqueries but I
think that is imposible in Spark SQL. Any idea will be welcome.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



join 2 tables

2014-11-12 Thread Franco Barrientos
I have 2 tables in a hive context, and I want to select one field of each
table where id’s of each table are equal. For example,

 

val tmp2=sqlContext.sql(select a.ult_fecha,b.pri_fecha from
fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id=b.id)

 

but i get an error:

 



 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png 

 



S3 table to spark sql

2014-11-11 Thread Franco Barrientos
How can i create a date field in spark sql? I have a S3 table and  i load it
into a RDD.

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD

 

case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int,
sku: String, unidades: Double, monto: Double)

 

val tabla =
sc.textFile(s3n://exalitica.com/trx_u3m/trx_u3m.txt).map(_.split(,)).map
(p = trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString,
p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
p(6).trim.toDouble))

tabla.registerTempTable(trx_u3m)

 

Now my problema i show can i transform string variable into date variables
(fechau3m)?

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 mailto:franco.barrien...@exalitica.com franco.barrien...@exalitica.com 

 http://www.exalitica.com/ www.exalitica.com


  http://exalitica.com/web/img/frim.png