Iterating all columns in a pyspark dataframe

2020-09-04 Thread Devi P.V
Hi all,
What is the best approach for iterating all columns in a pyspark
dataframe?I want to apply some conditions on all columns in the dataframe.
Currently I am using for loop for iteration. Is it a good practice while
using Spark and I am using Spark 3.0
Please advice

Thanks,
Devi


FP growth - Items in a transaction must be unique

2017-02-01 Thread Devi P.V
Hi all,

I am trying to run FP growth algorithm using spark and scala.sample input
dataframe is following,

+---+
|productName

+---+
|Apple Iphone 7 128GB Jet Black with
Facetime
|Levi’s Blue Slim Fit Jeans- L5112,Rimmel London Lasting Finish Matte by
Kate Moss 101 Dusky|
|Iphone 6 Plus (5.5",Limited Stocks, TRA Oman
Approved)
+---+

Each row contains unique items.

I converted it into rdd like following

val transactions = names.as[String].rdd.map(s =>s.split(","))

val fpg = new FPGrowth().
  setMinSupport(0.3).
  setNumPartitions(100)


val model = fpg.run(transactions)

But I got error

WARN TaskSetManager: Lost task 2.0 in stage 27.0 (TID 622, localhost):
org.apache.spark.SparkException:
Items in a transaction must be unique but got WrappedArray(
Huawei GR3 Dual Sim 16GB 13MP 5Inch 4G,
 Huawei G8 Gold 32GB,  4G,
5.5 Inches, HTC Desire 816 (Dual Sim, 3G, 8GB),
 Samsung Galaxy S7 Single Sim - 32GB,  4G LTE,
Gold, Huawei P8 Lite 16GB,  4G LTE, Huawei Y625,
Samsung Galaxy Note 5 - 32GB,  4G LTE,
Samsung Galaxy S7 Dual Sim - 32GB)


How to solve this?


Thanks


How to find unique values after groupBy() in spark dataframe ?

2016-12-08 Thread Devi P.V
Hi all,

I have a dataframe like following,

+-+--+
|client_id|Date  |
+ +--+
| a   |2016-11-23|
| b   |2016-11-18|
| a   |2016-11-23|
| a   |2016-11-23|
| a   |2016-11-24|
+-+--+

I want to find unique dates of each client_id using spark dataframe.

expected output

a  (2016-11-23, 2016-11-24)
b   2016-11-18

I tried with df.groupBy("client_id").But I don't know how to find distinct
values after groupBy().
How to do this?
Is any other efficient methods are available for doing this ?
I am using scala 2.11.8 & spark 2.0


Thanks


Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Devi P.V
Thanks.It works.

On Mon, Dec 5, 2016 at 2:03 PM, Michal Šenkýř <mike.sen...@gmail.com> wrote:

> Yet another approach:
> scala> val df1 = df.selectExpr("client_id", 
> "from_unixtime(ts/1000,'-MM-dd')
> as ts")
>
> Mgr. Michal Šenkýřmike.sen...@gmail.com
> +420 605 071 818
>
> On 5.12.2016 09:22, Deepak Sharma wrote:
>
> Another simpler approach will be:
> scala> val findf = sqlContext.sql("select 
> client_id,from_unixtime(ts/1000,'-MM-dd')
> ts from ts")
> findf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string]
>
> scala> findf.show
> ++--+
> |   client_id|ts|
> ++--+
> |cd646551-fceb-416...|2016-11-01|
> |3bc61951-0f49-43b...|2016-11-01|
> |688acc61-753f-4a3...|2016-11-23|
> |5ff1eb6c-14ec-471...|2016-11-23|
> ++--+
>
> I registered temp table out of the original DF
> Thanks
> Deepak
>
> On Mon, Dec 5, 2016 at 1:49 PM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> This is the correct way to do it.The timestamp that you mentioned was not
>> correct:
>>
>> scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd")
>> ts1: org.apache.spark.sql.Column = fromunixtime((ts / 1000),-MM-dd)
>>
>> scala> val finaldf = df.withColumn("ts1",ts1)
>> finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string,
>> ts1: string]
>>
>> scala> finaldf.show
>> ++-+--+
>> |   client_id|   ts|   ts1|
>> ++-+--+
>> |cd646551-fceb-416...|1477989416803|2016-11-01|
>> |3bc61951-0f49-43b...|1477983725292|2016-11-01|
>> |688acc61-753f-4a3...|1479899459947|2016-11-23|
>> |5ff1eb6c-14ec-471...|1479901374026|2016-11-23|
>> ++-+--+
>>
>>
>> Thanks
>> Deepak
>>
>> On Mon, Dec 5, 2016 at 1:46 PM, Deepak Sharma <deepakmc...@gmail.com>
>> wrote:
>>
>>> This is how you can do it in scala:
>>> scala> val ts1 = from_unixtime($"ts", "-MM-dd")
>>> ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd)
>>>
>>> scala> val finaldf = df.withColumn("ts1",ts1)
>>> finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts:
>>> string, ts1: string]
>>>
>>> scala> finaldf.show
>>> ++-+---+
>>> |   client_id|   ts|    ts1|
>>> ++-+---+
>>> |cd646551-fceb-416...|1477989416803|48805-08-14|
>>> |3bc61951-0f49-43b...|1477983725292|48805-06-09|
>>> |688acc61-753f-4a3...|1479899459947|48866-02-22|
>>> |5ff1eb6c-14ec-471...|1479901374026|48866-03-16|
>>> ++-+---+
>>>
>>> The year is returning wrong here.May be the input timestamp is not
>>> correct .Not sure.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Mon, Dec 5, 2016 at 1:34 PM, Devi P.V <devip2...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks for replying to my question.
>>>> I am using scala
>>>>
>>>> On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>>  In python you can use date time.fromtimestamp(..).str
>>>>> ftime('%Y%m%d')
>>>>> Which spark API are you using?
>>>>> Kr
>>>>>
>>>>> On 5 Dec 2016 7:38 am, "Devi P.V" <devip2...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a dataframe like following,
>>>>>>
>>>>>> ++---+
>>>>>> |client_id   |timestamp|
>>>>>> ++---+
>>>>>> |cd646551-fceb-4166-acbc-b9|1477989416803  |
>>>>>> |3bc61951-0f49-43bf-9848-b2|1477983725292  |
>>>>>> |688acc61-753f-4a33-a034-bc|1479899459947  |
>>>>>> |5ff1eb6c-14ec-4716-9798-00|1479901374026  |
>>>>>> ++---+
>>>>>>
>>>>>>  I want to convert timestamp column into -MM-dd format.
>>>>>> How to do this?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>
>
>


Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-05 Thread Devi P.V
Hi,

Thanks for replying to my question.
I am using scala

On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hi
>  In python you can use date time.fromtimestamp(..).
> strftime('%Y%m%d')
> Which spark API are you using?
> Kr
>
> On 5 Dec 2016 7:38 am, "Devi P.V" <devip2...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a dataframe like following,
>>
>> ++---+
>> |client_id   |timestamp|
>> ++---+
>> |cd646551-fceb-4166-acbc-b9|1477989416803  |
>> |3bc61951-0f49-43bf-9848-b2|1477983725292  |
>> |688acc61-753f-4a33-a034-bc|1479899459947  |
>> |5ff1eb6c-14ec-4716-9798-00|1479901374026  |
>> ++---+
>>
>>  I want to convert timestamp column into -MM-dd format.
>> How to do this?
>>
>>
>> Thanks
>>
>


How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-04 Thread Devi P.V
Hi all,

I have a dataframe like following,

++---+
|client_id   |timestamp|
++---+
|cd646551-fceb-4166-acbc-b9|1477989416803  |
|3bc61951-0f49-43bf-9848-b2|1477983725292  |
|688acc61-753f-4a33-a034-bc|1479899459947  |
|5ff1eb6c-14ec-4716-9798-00|1479901374026  |
++---+

 I want to convert timestamp column into -MM-dd format.
How to do this?


Thanks


what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-15 Thread Devi P.V
Hi all,

I have 4 data frames with three columns,

client_id,product_id,interest

I want to combine these 4 dataframes into one dataframe.I used union like
following

df1.union(df2).union(df3).union(df4)

But it is time consuming for bigdata.what is the optimized way for doing
this using spark 2.0 & scala


Thanks


Re: Couchbase-Spark 2.0.0

2016-10-17 Thread Devi P.V
Hi,
I tried with the following code

import com.couchbase.spark._
val conf = new SparkConf()
  .setAppName(this.getClass.getName)
  .setMaster("local[*]")
  .set("com.couchbase.bucket.bucketName","password")
  .set("com.couchbase.nodes", "node")
.set ("com.couchbase.queryEnabled", "true")
val sc = new SparkContext(conf)

I need full document from bucket,so i gave query like this,

val query = "SELECT META(`bucketName`).id as id FROM `bucketName` "
 sc
  .couchbaseQuery(Query.simple(query))
  .map(_.value.getString("id"))
  .couchbaseGet[JsonDocument]()
  .collect()
  .foreach(println)

But it can't take Query.simple(query)

I used libraryDependencies += "com.couchbase.client" %
"spark-connector_2.11" % "1.2.1" in built.sbt.
Is my query wrong or anything else needed to import?


Please help.

On Sun, Oct 16, 2016 at 8:23 PM, Rodrick Brown <rodr...@orchardplatform.com>
wrote:

>
>
> On Sun, Oct 16, 2016 at 10:51 AM, Devi P.V <devip2...@gmail.com> wrote:
>
>> Hi all,
>> I am trying to read data from couchbase using spark 2.0.0.I need to fetch
>> complete data from a bucket as  Rdd.How can I solve this?Does spark 2.0.0
>> support couchbase?Please help.
>>
>> Thanks
>>
> https://github.com/couchbase/couchbase-spark-connector
>
>
> --
>
> [image: Orchard Platform] <http://www.orchardplatform.com/>
>
> *Rodrick Brown */ *DevOPs*
>
> 9174456839 / rodr...@orchardplatform.com
>
> Orchard Platform
> 101 5th Avenue, 4th Floor, New York, NY
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>


Couchbase-Spark 2.0.0

2016-10-16 Thread Devi P.V
Hi all,
I am trying to read data from couchbase using spark 2.0.0.I need to fetch
complete data from a bucket as  Rdd.How can I solve this?Does spark 2.0.0
support couchbase?Please help.

Thanks


Re: How to write data into CouchBase using Spark & Scala?

2016-09-07 Thread Devi P.V
Thanks.Now it is working.

On Thu, Sep 8, 2016 at 12:57 AM, aka.fe2s <aka.f...@gmail.com> wrote:

> Most likely you are missing an import statement that enables some Scala
> implicits. I haven't used this connector, but looks like you need "import
> com.couchbase.spark._"
>
> --
> Oleksiy Dyagilev
>
> On Wed, Sep 7, 2016 at 9:42 AM, Devi P.V <devip2...@gmail.com> wrote:
>
>> I am newbie in CouchBase.I am trying to write data into CouchBase.My
>> sample code is following,
>>
>> val cfg = new SparkConf()
>>   .setAppName("couchbaseQuickstart")
>>   .setMaster("local[*]")
>>   .set("com.couchbase.bucket.MyBucket","pwd")
>>
>> val sc = new SparkContext(cfg)
>> val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", 
>> "content"))
>> val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", 
>> "content", "in", "here"))
>>
>> val data = sc
>>   .parallelize(Seq(doc1, doc2))
>>
>> But I can't access data.saveToCouchbase().
>>
>> I am using Spark 1.6.1 & Scala 2.11.8
>>
>> I gave following dependencies in built.sbt
>>
>> libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.6.1"
>> libraryDependencies += "com.couchbase.client" % "spark-connector_2.11" % 
>> "1.2.1"
>>
>>
>> How can I write data into CouchBase using Spark & Scala?
>>
>>
>>
>>
>>
>


How to write data into CouchBase using Spark & Scala?

2016-09-07 Thread Devi P.V
I am newbie in CouchBase.I am trying to write data into CouchBase.My sample
code is following,

val cfg = new SparkConf()
  .setAppName("couchbaseQuickstart")
  .setMaster("local[*]")
  .set("com.couchbase.bucket.MyBucket","pwd")

val sc = new SparkContext(cfg)
val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some",
"content"))
val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more",
"content", "in", "here"))

val data = sc
  .parallelize(Seq(doc1, doc2))

But I can't access data.saveToCouchbase().

I am using Spark 1.6.1 & Scala 2.11.8

I gave following dependencies in built.sbt

libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.6.1"
libraryDependencies += "com.couchbase.client" % "spark-connector_2.11" % "1.2.1"


How can I write data into CouchBase using Spark & Scala?


Re: How to install spark with s3 on AWS?

2016-08-26 Thread Devi P.V
The following piece of code works for me to read data from S3 using Spark.

val conf = new SparkConf().setAppName("Simple
Application").setMaster("local[*]")

val sc = new SparkContext(conf)
val hadoopConf=sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native
.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId",AccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey",SecretKey)
var jobInput = sc.textFile("s3://path to bucket")

Thanks


On Fri, Aug 26, 2016 at 5:16 PM, kant kodali  wrote:

> Hi guys,
>
> Are there any instructions on how to setup spark with S3 on AWS?
>
> Thanks!
>
>


Re: Spark MLlib:Collaborative Filtering

2016-08-24 Thread Devi P.V
Thanks a lot.I solved the problem using string indexer.

On Wed, Aug 24, 2016 at 3:40 PM, Praveen Devarao <praveen...@in.ibm.com>
wrote:

> You could use the string indexer to convert your string userids and
> product ids numeric value. http://spark.apache.org/docs/
> latest/ml-features.html#stringindexer
>
> Thanking You
> 
> -
> Praveen Devarao
> IBM India Software Labs
> 
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:glen <cng...@126.com>
> To:"Devi P.V" <devip2...@gmail.com>
> Cc:"user@spark.apache.org" <user@spark.apache.org>
> Date:24/08/2016 02:10 pm
> Subject:Re: Spark MLlib:Collaborative Filtering
> --
>
>
>
> Hash it to int
>
>
>
> On 2016-08-24 16:28 , *Devi P.V* <devip2...@gmail.com> Wrote:
>
> Hi all,
> I am newbie in collaborative filtering.I want to implement collaborative
> filtering algorithm(need to find top 10 recommended products) using Spark
> and Scala.I have a rating dataset where userID & ProductID are String type.
>
> UserID   ProductID Rating
> b3a68043-c1  p1-160ff5fDS-f74   1
> b3a68043-c2  p5-160ff5fDS-f74   1
> b3a68043-c0  p9-160ff5fDS-f74   1
>
>
> I tried ALS algorithm using spark MLlib.But it support rating userID &
> productID only Integer type.How can I solve this problem?
>
> Thanks In Advance
>
>
>
>
>


Spark MLlib:Collaborative Filtering

2016-08-24 Thread Devi P.V
Hi all,
I am newbie in collaborative filtering.I want to implement collaborative
filtering algorithm(need to find top 10 recommended products) using Spark
and Scala.I have a rating dataset where userID & ProductID are String type.

UserID   ProductID Rating
b3a68043-c1  p1-160ff5fDS-f74   1
b3a68043-c2  p5-160ff5fDS-f74   1
b3a68043-c0  p9-160ff5fDS-f74   1


I tried ALS algorithm using spark MLlib.But it support rating userID &
productID only Integer type.How can I solve this problem?

Thanks In Advance


What are the configurations needs to connect spark and ms-sql server?

2016-08-08 Thread Devi P.V
Hi all,

I am trying to write a spark dataframe into MS-Sql Server.I have tried
using the following code,

 val sqlprop = new java.util.Properties
sqlprop.setProperty("user","uname")
sqlprop.setProperty("password","pwd")

sqlprop.setProperty("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver")
val url = "jdbc:sqlserver://samplesql.amazonaws.com:1433/dbName"
val dfWriter = df.write
dfWriter.jdbc(url, "tableName", sqlprop)

But I got following error

Exception in thread "main" java.lang.ClassNotFoundException:
com.microsoft.sqlserver.jdbc.SQLServerDriver

what are the configurations needs to connect to MS-Sql Server.Not found any
library dependencies for connecting spark and MS-Sql.

Thanks


How to connect Power BI to Apache Spark on local machine?

2016-08-04 Thread Devi P.V
Hi all,
I am newbie in Power BI.What are the configurations need to connect Power
BI to spark on my local machine? I found some documents that mentioned
spark over Azure's HDInsight .But didn't find any reference materials for
connecting Spark to remote machine? Is it possible?

following is the previously mentioned link that refers steps for connecting
spark over Azure's HDInsight

https://powerbi.microsoft.com/en-us/documentation/powerbi-spark-on-hdinsight-with-direct-connect/

Thanks


Optimized way to multiply two large matrices and save output using Spark and Scala

2016-01-13 Thread Devi P.V
I want to multiply two large matrices (from csv files)using Spark and Scala
and save output.I use the following code

  val rows=file1.coalesce(1,false).map(x=>{
  val line=x.split(delimiter).map(_.toDouble)
  Vectors.sparse(line.length,
line.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0))

})

val rmat = new RowMatrix(rows)

val dm=file2.coalesce(1,false).map(x=>{
  val line=x.split(delimiter).map(_.toDouble)
  Vectors.dense(line)
})

val ma = dm.map(_.toArray).take(dm.count.toInt)
val localMat = Matrices.dense( dm.count.toInt,
  dm.take(1)(0).size,

  transpose(ma).flatten)

// Multiply two matrices
val s=rmat.multiply(localMat).rows

s.map(x=>x.toArray.mkString(delimiter)).saveAsTextFile(OutputPath)

  }

  def transpose(m: Array[Array[Double]]): Array[Array[Double]] = {
(for {
  c <- m(0).indices
} yield m.map(_(c)) ).toArray
  }

When I save file it takes more time and output file has very large in
size.what is the optimized way to multiply two large files and save the
output to a text file ?


Count of distinct values in each column

2015-07-29 Thread Devi P.V
Hi All,

I have a 5GB CSV dataset having 69 columns..I need to find the count of
distinct values in each column. What is the optimized way to find the same
using spark scala?

Example CSV format :

a,b,c,d
a,c,b,a
b,b,c,d
b,b,c,a
c,b,b,a

Output expecting :

(a,2),(b,2),(c,1) #- First column distinct count
(b,4),(c,1)   #- Second column distinct count
(c,3),(b,2)   #- Third column distinct count
(d,2),(a,3)   #- Fourth column distinct count


Thanks in Advance