Iterating all columns in a pyspark dataframe
Hi all, What is the best approach for iterating all columns in a pyspark dataframe?I want to apply some conditions on all columns in the dataframe. Currently I am using for loop for iteration. Is it a good practice while using Spark and I am using Spark 3.0 Please advice Thanks, Devi
FP growth - Items in a transaction must be unique
Hi all, I am trying to run FP growth algorithm using spark and scala.sample input dataframe is following, +---+ |productName +---+ |Apple Iphone 7 128GB Jet Black with Facetime |Levi’s Blue Slim Fit Jeans- L5112,Rimmel London Lasting Finish Matte by Kate Moss 101 Dusky| |Iphone 6 Plus (5.5",Limited Stocks, TRA Oman Approved) +---+ Each row contains unique items. I converted it into rdd like following val transactions = names.as[String].rdd.map(s =>s.split(",")) val fpg = new FPGrowth(). setMinSupport(0.3). setNumPartitions(100) val model = fpg.run(transactions) But I got error WARN TaskSetManager: Lost task 2.0 in stage 27.0 (TID 622, localhost): org.apache.spark.SparkException: Items in a transaction must be unique but got WrappedArray( Huawei GR3 Dual Sim 16GB 13MP 5Inch 4G, Huawei G8 Gold 32GB, 4G, 5.5 Inches, HTC Desire 816 (Dual Sim, 3G, 8GB), Samsung Galaxy S7 Single Sim - 32GB, 4G LTE, Gold, Huawei P8 Lite 16GB, 4G LTE, Huawei Y625, Samsung Galaxy Note 5 - 32GB, 4G LTE, Samsung Galaxy S7 Dual Sim - 32GB) How to solve this? Thanks
How to find unique values after groupBy() in spark dataframe ?
Hi all, I have a dataframe like following, +-+--+ |client_id|Date | + +--+ | a |2016-11-23| | b |2016-11-18| | a |2016-11-23| | a |2016-11-23| | a |2016-11-24| +-+--+ I want to find unique dates of each client_id using spark dataframe. expected output a (2016-11-23, 2016-11-24) b 2016-11-18 I tried with df.groupBy("client_id").But I don't know how to find distinct values after groupBy(). How to do this? Is any other efficient methods are available for doing this ? I am using scala 2.11.8 & spark 2.0 Thanks
Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?
Thanks.It works. On Mon, Dec 5, 2016 at 2:03 PM, Michal Šenkýř wrote: > Yet another approach: > scala> val df1 = df.selectExpr("client_id", > "from_unixtime(ts/1000,'-MM-dd') > as ts") > > Mgr. Michal Šenkýřmike.sen...@gmail.com > +420 605 071 818 > > On 5.12.2016 09:22, Deepak Sharma wrote: > > Another simpler approach will be: > scala> val findf = sqlContext.sql("select > client_id,from_unixtime(ts/1000,'-MM-dd') > ts from ts") > findf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string] > > scala> findf.show > ++--+ > | client_id|ts| > ++--+ > |cd646551-fceb-416...|2016-11-01| > |3bc61951-0f49-43b...|2016-11-01| > |688acc61-753f-4a3...|2016-11-23| > |5ff1eb6c-14ec-471...|2016-11-23| > ++--+ > > I registered temp table out of the original DF > Thanks > Deepak > > On Mon, Dec 5, 2016 at 1:49 PM, Deepak Sharma > wrote: > >> This is the correct way to do it.The timestamp that you mentioned was not >> correct: >> >> scala> val ts1 = from_unixtime($"ts"/1000, "-MM-dd") >> ts1: org.apache.spark.sql.Column = fromunixtime((ts / 1000),-MM-dd) >> >> scala> val finaldf = df.withColumn("ts1",ts1) >> finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: string, >> ts1: string] >> >> scala> finaldf.show >> ++-+--+ >> | client_id| ts| ts1| >> ++-+--+ >> |cd646551-fceb-416...|1477989416803|2016-11-01| >> |3bc61951-0f49-43b...|1477983725292|2016-11-01| >> |688acc61-753f-4a3...|1479899459947|2016-11-23| >> |5ff1eb6c-14ec-471...|1479901374026|2016-11-23| >> ++-+--+ >> >> >> Thanks >> Deepak >> >> On Mon, Dec 5, 2016 at 1:46 PM, Deepak Sharma >> wrote: >> >>> This is how you can do it in scala: >>> scala> val ts1 = from_unixtime($"ts", "-MM-dd") >>> ts1: org.apache.spark.sql.Column = fromunixtime(ts,-MM-dd) >>> >>> scala> val finaldf = df.withColumn("ts1",ts1) >>> finaldf: org.apache.spark.sql.DataFrame = [client_id: string, ts: >>> string, ts1: string] >>> >>> scala> finaldf.show >>> ++-+---+ >>> | client_id| ts| ts1| >>> ++-+---+ >>> |cd646551-fceb-416...|1477989416803|48805-08-14| >>> |3bc61951-0f49-43b...|1477983725292|48805-06-09| >>> |688acc61-753f-4a3...|1479899459947|48866-02-22| >>> |5ff1eb6c-14ec-471...|1479901374026|48866-03-16| >>> ++---------+---+ >>> >>> The year is returning wrong here.May be the input timestamp is not >>> correct .Not sure. >>> >>> Thanks >>> Deepak >>> >>> On Mon, Dec 5, 2016 at 1:34 PM, Devi P.V wrote: >>> >>>> Hi, >>>> >>>> Thanks for replying to my question. >>>> I am using scala >>>> >>>> On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni >>>> wrote: >>>> >>>>> Hi >>>>> In python you can use date time.fromtimestamp(..).str >>>>> ftime('%Y%m%d') >>>>> Which spark API are you using? >>>>> Kr >>>>> >>>>> On 5 Dec 2016 7:38 am, "Devi P.V" wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I have a dataframe like following, >>>>>> >>>>>> ++---+ >>>>>> |client_id |timestamp| >>>>>> ++---+ >>>>>> |cd646551-fceb-4166-acbc-b9|1477989416803 | >>>>>> |3bc61951-0f49-43bf-9848-b2|1477983725292 | >>>>>> |688acc61-753f-4a33-a034-bc|1479899459947 | >>>>>> |5ff1eb6c-14ec-4716-9798-00|1479901374026 | >>>>>> ++---+ >>>>>> >>>>>> I want to convert timestamp column into -MM-dd format. >>>>>> How to do this? >>>>>> >>>>>> >>>>>> Thanks >>>>>> >>>>> >>>> >>> >>> >>> -- >>> Thanks >>> Deepak >>> www.bigdatabig.com >>> www.keosha.net >>> >> >> >> >> -- >> Thanks >> Deepak >> www.bigdatabig.com >> www.keosha.net >> > > > > -- > Thanks > Deepak > www.bigdatabig.com > www.keosha.net > > >
Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?
Hi, Thanks for replying to my question. I am using scala On Mon, Dec 5, 2016 at 1:20 PM, Marco Mistroni wrote: > Hi > In python you can use date time.fromtimestamp(..). > strftime('%Y%m%d') > Which spark API are you using? > Kr > > On 5 Dec 2016 7:38 am, "Devi P.V" wrote: > >> Hi all, >> >> I have a dataframe like following, >> >> ++---+ >> |client_id |timestamp| >> ++---+ >> |cd646551-fceb-4166-acbc-b9|1477989416803 | >> |3bc61951-0f49-43bf-9848-b2|1477983725292 | >> |688acc61-753f-4a33-a034-bc|1479899459947 | >> |5ff1eb6c-14ec-4716-9798-00|1479901374026 | >> ++---+ >> >> I want to convert timestamp column into -MM-dd format. >> How to do this? >> >> >> Thanks >> >
How to convert a unix timestamp column into date format(yyyy-MM-dd) ?
Hi all, I have a dataframe like following, ++---+ |client_id |timestamp| ++---+ |cd646551-fceb-4166-acbc-b9|1477989416803 | |3bc61951-0f49-43bf-9848-b2|1477983725292 | |688acc61-753f-4a33-a034-bc|1479899459947 | |5ff1eb6c-14ec-4716-9798-00|1479901374026 | ++---+ I want to convert timestamp column into -MM-dd format. How to do this? Thanks
what is the optimized way to combine multiple dataframes into one dataframe ?
Hi all, I have 4 data frames with three columns, client_id,product_id,interest I want to combine these 4 dataframes into one dataframe.I used union like following df1.union(df2).union(df3).union(df4) But it is time consuming for bigdata.what is the optimized way for doing this using spark 2.0 & scala Thanks
Re: Couchbase-Spark 2.0.0
Hi, I tried with the following code import com.couchbase.spark._ val conf = new SparkConf() .setAppName(this.getClass.getName) .setMaster("local[*]") .set("com.couchbase.bucket.bucketName","password") .set("com.couchbase.nodes", "node") .set ("com.couchbase.queryEnabled", "true") val sc = new SparkContext(conf) I need full document from bucket,so i gave query like this, val query = "SELECT META(`bucketName`).id as id FROM `bucketName` " sc .couchbaseQuery(Query.simple(query)) .map(_.value.getString("id")) .couchbaseGet[JsonDocument]() .collect() .foreach(println) But it can't take Query.simple(query) I used libraryDependencies += "com.couchbase.client" % "spark-connector_2.11" % "1.2.1" in built.sbt. Is my query wrong or anything else needed to import? Please help. On Sun, Oct 16, 2016 at 8:23 PM, Rodrick Brown wrote: > > > On Sun, Oct 16, 2016 at 10:51 AM, Devi P.V wrote: > >> Hi all, >> I am trying to read data from couchbase using spark 2.0.0.I need to fetch >> complete data from a bucket as Rdd.How can I solve this?Does spark 2.0.0 >> support couchbase?Please help. >> >> Thanks >> > https://github.com/couchbase/couchbase-spark-connector > > > -- > > [image: Orchard Platform] <http://www.orchardplatform.com/> > > *Rodrick Brown */ *DevOPs* > > 9174456839 / rodr...@orchardplatform.com > > Orchard Platform > 101 5th Avenue, 4th Floor, New York, NY > > *NOTICE TO RECIPIENTS*: This communication is confidential and intended > for the use of the addressee only. If you are not an intended recipient of > this communication, please delete it immediately and notify the sender by > return email. Unauthorized reading, dissemination, distribution or copying > of this communication is prohibited. This communication does not constitute > an offer to sell or a solicitation of an indication of interest to purchase > any loan, security or any other financial product or instrument, nor is it > an offer to sell or a solicitation of an indication of interest to purchase > any products or services to any persons who are prohibited from receiving > such information under applicable law. The contents of this communication > may not be accurate or complete and are subject to change without notice. > As such, Orchard App, Inc. (including its subsidiaries and affiliates, > "Orchard") makes no representation regarding the accuracy or completeness > of the information contained herein. The intended recipient is advised to > consult its own professional advisors, including those specializing in > legal, tax and accounting matters. Orchard does not provide legal, tax or > accounting advice. >
Couchbase-Spark 2.0.0
Hi all, I am trying to read data from couchbase using spark 2.0.0.I need to fetch complete data from a bucket as Rdd.How can I solve this?Does spark 2.0.0 support couchbase?Please help. Thanks
Re: How to write data into CouchBase using Spark & Scala?
Thanks.Now it is working. On Thu, Sep 8, 2016 at 12:57 AM, aka.fe2s wrote: > Most likely you are missing an import statement that enables some Scala > implicits. I haven't used this connector, but looks like you need "import > com.couchbase.spark._" > > -- > Oleksiy Dyagilev > > On Wed, Sep 7, 2016 at 9:42 AM, Devi P.V wrote: > >> I am newbie in CouchBase.I am trying to write data into CouchBase.My >> sample code is following, >> >> val cfg = new SparkConf() >> .setAppName("couchbaseQuickstart") >> .setMaster("local[*]") >> .set("com.couchbase.bucket.MyBucket","pwd") >> >> val sc = new SparkContext(cfg) >> val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", >> "content")) >> val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", >> "content", "in", "here")) >> >> val data = sc >> .parallelize(Seq(doc1, doc2)) >> >> But I can't access data.saveToCouchbase(). >> >> I am using Spark 1.6.1 & Scala 2.11.8 >> >> I gave following dependencies in built.sbt >> >> libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.6.1" >> libraryDependencies += "com.couchbase.client" % "spark-connector_2.11" % >> "1.2.1" >> >> >> How can I write data into CouchBase using Spark & Scala? >> >> >> >> >> >
How to write data into CouchBase using Spark & Scala?
I am newbie in CouchBase.I am trying to write data into CouchBase.My sample code is following, val cfg = new SparkConf() .setAppName("couchbaseQuickstart") .setMaster("local[*]") .set("com.couchbase.bucket.MyBucket","pwd") val sc = new SparkContext(cfg) val doc1 = JsonDocument.create("doc1", JsonObject.create().put("some", "content")) val doc2 = JsonArrayDocument.create("doc2", JsonArray.from("more", "content", "in", "here")) val data = sc .parallelize(Seq(doc1, doc2)) But I can't access data.saveToCouchbase(). I am using Spark 1.6.1 & Scala 2.11.8 I gave following dependencies in built.sbt libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.6.1" libraryDependencies += "com.couchbase.client" % "spark-connector_2.11" % "1.2.1" How can I write data into CouchBase using Spark & Scala?
Re: How to install spark with s3 on AWS?
The following piece of code works for me to read data from S3 using Spark. val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val hadoopConf=sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native .NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId",AccessKey) hadoopConf.set("fs.s3.awsSecretAccessKey",SecretKey) var jobInput = sc.textFile("s3://path to bucket") Thanks On Fri, Aug 26, 2016 at 5:16 PM, kant kodali wrote: > Hi guys, > > Are there any instructions on how to setup spark with S3 on AWS? > > Thanks! > >
Re: Spark MLlib:Collaborative Filtering
Thanks a lot.I solved the problem using string indexer. On Wed, Aug 24, 2016 at 3:40 PM, Praveen Devarao wrote: > You could use the string indexer to convert your string userids and > product ids numeric value. http://spark.apache.org/docs/ > latest/ml-features.html#stringindexer > > Thanking You > > - > Praveen Devarao > IBM India Software Labs > > - > "Courage doesn't always roar. Sometimes courage is the quiet voice at the > end of the day saying I will try again" > > > > From:glen > To:"Devi P.V" > Cc:"user@spark.apache.org" > Date:24/08/2016 02:10 pm > Subject:Re: Spark MLlib:Collaborative Filtering > -- > > > > Hash it to int > > > > On 2016-08-24 16:28 , *Devi P.V* Wrote: > > Hi all, > I am newbie in collaborative filtering.I want to implement collaborative > filtering algorithm(need to find top 10 recommended products) using Spark > and Scala.I have a rating dataset where userID & ProductID are String type. > > UserID ProductID Rating > b3a68043-c1 p1-160ff5fDS-f74 1 > b3a68043-c2 p5-160ff5fDS-f74 1 > b3a68043-c0 p9-160ff5fDS-f74 1 > > > I tried ALS algorithm using spark MLlib.But it support rating userID & > productID only Integer type.How can I solve this problem? > > Thanks In Advance > > > > >
Spark MLlib:Collaborative Filtering
Hi all, I am newbie in collaborative filtering.I want to implement collaborative filtering algorithm(need to find top 10 recommended products) using Spark and Scala.I have a rating dataset where userID & ProductID are String type. UserID ProductID Rating b3a68043-c1 p1-160ff5fDS-f74 1 b3a68043-c2 p5-160ff5fDS-f74 1 b3a68043-c0 p9-160ff5fDS-f74 1 I tried ALS algorithm using spark MLlib.But it support rating userID & productID only Integer type.How can I solve this problem? Thanks In Advance
What are the configurations needs to connect spark and ms-sql server?
Hi all, I am trying to write a spark dataframe into MS-Sql Server.I have tried using the following code, val sqlprop = new java.util.Properties sqlprop.setProperty("user","uname") sqlprop.setProperty("password","pwd") sqlprop.setProperty("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver") val url = "jdbc:sqlserver://samplesql.amazonaws.com:1433/dbName" val dfWriter = df.write dfWriter.jdbc(url, "tableName", sqlprop) But I got following error Exception in thread "main" java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver what are the configurations needs to connect to MS-Sql Server.Not found any library dependencies for connecting spark and MS-Sql. Thanks
How to connect Power BI to Apache Spark on local machine?
Hi all, I am newbie in Power BI.What are the configurations need to connect Power BI to spark on my local machine? I found some documents that mentioned spark over Azure's HDInsight .But didn't find any reference materials for connecting Spark to remote machine? Is it possible? following is the previously mentioned link that refers steps for connecting spark over Azure's HDInsight https://powerbi.microsoft.com/en-us/documentation/powerbi-spark-on-hdinsight-with-direct-connect/ Thanks
Optimized way to multiply two large matrices and save output using Spark and Scala
I want to multiply two large matrices (from csv files)using Spark and Scala and save output.I use the following code val rows=file1.coalesce(1,false).map(x=>{ val line=x.split(delimiter).map(_.toDouble) Vectors.sparse(line.length, line.zipWithIndex.map(e => (e._2, e._1)).filter(_._2 != 0.0)) }) val rmat = new RowMatrix(rows) val dm=file2.coalesce(1,false).map(x=>{ val line=x.split(delimiter).map(_.toDouble) Vectors.dense(line) }) val ma = dm.map(_.toArray).take(dm.count.toInt) val localMat = Matrices.dense( dm.count.toInt, dm.take(1)(0).size, transpose(ma).flatten) // Multiply two matrices val s=rmat.multiply(localMat).rows s.map(x=>x.toArray.mkString(delimiter)).saveAsTextFile(OutputPath) } def transpose(m: Array[Array[Double]]): Array[Array[Double]] = { (for { c <- m(0).indices } yield m.map(_(c)) ).toArray } When I save file it takes more time and output file has very large in size.what is the optimized way to multiply two large files and save the output to a text file ?
Count of distinct values in each column
Hi All, I have a 5GB CSV dataset having 69 columns..I need to find the count of distinct values in each column. What is the optimized way to find the same using spark scala? Example CSV format : a,b,c,d a,c,b,a b,b,c,d b,b,c,a c,b,b,a Output expecting : (a,2),(b,2),(c,1) #- First column distinct count (b,4),(c,1) #- Second column distinct count (c,3),(b,2) #- Third column distinct count (d,2),(a,3) #- Fourth column distinct count Thanks in Advance