Renaming nested columns in dataframe
Hi How can I rename nested columns in dataframe through scala API? Like following schema > |-- site: struct (nullable = false) > > ||-- site_id: string (nullable = true) > > ||-- site_name: string (nullable = true) > > ||-- site_domain: string (nullable = true) > > ||-- site_cat: array (nullable = true) > > |||-- element: string (containsNull = false) > > ||-- publisher: struct (nullable = false) > > |||-- site_pub_id: string (nullable = true) > > |||-- site_pub_name: string (nullable = true) > I want to change it to > |-- site: struct (nullable = false) > > ||-- id: string (nullable = true) > > ||-- name: string (nullable = true) > > ||-- domain: string (nullable = true) > > ||-- cat: array (nullable = true) > > |||-- element: string (containsNull = false) > > ||-- publisher: struct (nullable = false) > > |||-- id: string (nullable = true) > > |||-- name: string (nullable = true) > > Regards Prashant
Re: Creating Nested dataframe from flat data.
Thank you. That's exactly I was looking for. Regards Prashant On Fri, May 13, 2016 at 9:38 PM, Xinh Huynh <xinh.hu...@gmail.com> wrote: > Hi Prashant, > > You can create struct columns using the struct() function in > org.apache.spark.sql.functions -- > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ > > val df = sc.parallelize(List(("a", "b", "c"))).toDF("A", "B", "C") > > import org.apache.spark.sql.functions._ > df.withColumn("D", struct($"a", $"b", $"c")).show() > > ---+---+---+---+ | A| B| C| D| +---+---+---+---+ | a| b| > c|[a,b,c]| +---+---+---+---+ > > You can repeat to get the inner nesting. > > Xinh > > On Fri, May 13, 2016 at 4:51 AM, Prashant Bhardwaj < > prashant2006s...@gmail.com> wrote: > >> Hi >> >> Let's say I have a flat dataframe with 6 columns like. >> { >> "a": "somevalue", >> "b": "somevalue", >> "c": "somevalue", >> "d": "somevalue", >> "e": "somevalue", >> "f": "somevalue" >> } >> >> Now I want to convert this dataframe to contain nested column like >> >> { >> "nested_obj1": { >> "a": "somevalue", >> "b": "somevalue" >> }, >> "nested_obj2": { >> "c": "somevalue", >> "d": "somevalue", >> "nested_obj3": { >> "e": "somevalue", >> "f": "somevalue" >> } >> } >> } >> >> How can I achieve this? I'm using Spark-sql in scala. >> >> Regards >> Prashant >> > >
Creating Nested dataframe from flat data.
Hi Let's say I have a flat dataframe with 6 columns like. { "a": "somevalue", "b": "somevalue", "c": "somevalue", "d": "somevalue", "e": "somevalue", "f": "somevalue" } Now I want to convert this dataframe to contain nested column like { "nested_obj1": { "a": "somevalue", "b": "somevalue" }, "nested_obj2": { "c": "somevalue", "d": "somevalue", "nested_obj3": { "e": "somevalue", "f": "somevalue" } } } How can I achieve this? I'm using Spark-sql in scala. Regards Prashant
Re: Filtering records based on empty value of column in SparkSql
Already tried it. But getting following error. overloaded method value filter with alternatives: (conditionExpr: String)org.apache.spark.sql.DataFrame (condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be applied to (Boolean) Also tried: val req_logs_with_dpid = req_logs.filter(req_logs("req_info.dpid").toString.length != 0 ) But getting same error. On Wed, Dec 9, 2015 at 6:45 PM, Fengdong Yu <fengdo...@everstring.com> wrote: > val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" ) > > Azuryy Yu > Sr. Infrastructure Engineer > > cel: 158-0164-9103 > wetchat: azuryy > > > On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj < > prashant2006s...@gmail.com> wrote: > >> Hi >> >> I have two columns in my json which can have null, empty and non-empty >> string as value. >> I know how to filter records which have non-null value using following: >> >> val req_logs = sqlContext.read.json(filePath) >> >> val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or >> req_info.dpid_sha1 is not null") >> >> But how to filter if value of column is empty string? >> -- >> Regards >> Prashant >> > > -- Regards Prashant
Filtering records based on empty value of column in SparkSql
Hi I have two columns in my json which can have null, empty and non-empty string as value. I know how to filter records which have non-null value using following: val req_logs = sqlContext.read.json(filePath) val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or req_info.dpid_sha1 is not null") But how to filter if value of column is empty string? -- Regards Prashant
Re: Filtering records based on empty value of column in SparkSql
Anyway I got it. I have to use !== instead of ===. Thank BTW. On Wed, Dec 9, 2015 at 9:39 PM, Prashant Bhardwaj < prashant2006s...@gmail.com> wrote: > I have to do opposite of what you're doing. I have to filter non-empty > records. > > On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D <email2...@gmail.com> > wrote: > >> Hello Prashant - >> >> Can you please try like this : >> >> For the instance, input file name is "student_detail.txt" and >> >> ID,Name,Sex,Age >> === >> 101,Alfred,Male,30 >> 102,Benjamin,Male,31 >> 103,Charlie,Female,30 >> 104,Julie,Female,30 >> 105,Maven,Male,30 >> 106,Dexter,Male,30 >> 107,Lundy,Male,32 >> 108,Rita,Female,30 >> 109,Aster,Female,30 >> 110,Harrison,Male,15 >> 111,Rita,,30 >> 112,Aster,,30 >> 113,Harrison,,15 >> 114,Rita,Male,20 >> 115,Aster,,30 >> 116,Harrison,,20 >> >> [image: Inline image 2] >> >> *Output:* >> >> Total No.of Records without SEX 5 >> [111,Rita,,30] >> [112,Aster,,30] >> [113,Harrison,,15] >> [115,Aster,,30] >> [116,Harrison,,20] >> >> Total No.of Records with AGE <=15 2 >> [110,Harrison,Male,15] >> [113,Harrison,,15] >> >> Thanks & Regards, >> Gokula Krishnan* (Gokul)* >> Contact :+1 980-298-1740 >> >> On Wed, Dec 9, 2015 at 8:24 AM, Prashant Bhardwaj < >> prashant2006s...@gmail.com> wrote: >> >>> Already tried it. But getting following error. >>> >>> overloaded method value filter with alternatives: (conditionExpr: >>> String)org.apache.spark.sql.DataFrame (condition: >>> org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be >>> applied to (Boolean) >>> >>> Also tried: >>> >>> val req_logs_with_dpid = >>> req_logs.filter(req_logs("req_info.dpid").toString.length >>> != 0 ) >>> >>> But getting same error. >>> >>> >>> On Wed, Dec 9, 2015 at 6:45 PM, Fengdong Yu <fengdo...@everstring.com> >>> wrote: >>> >>>> val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != >>>> "" ) >>>> >>>> Azuryy Yu >>>> Sr. Infrastructure Engineer >>>> >>>> cel: 158-0164-9103 >>>> wetchat: azuryy >>>> >>>> >>>> On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj < >>>> prashant2006s...@gmail.com> wrote: >>>> >>>>> Hi >>>>> >>>>> I have two columns in my json which can have null, empty and non-empty >>>>> string as value. >>>>> I know how to filter records which have non-null value using following: >>>>> >>>>> val req_logs = sqlContext.read.json(filePath) >>>>> >>>>> val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or >>>>> req_info.dpid_sha1 is not null") >>>>> >>>>> But how to filter if value of column is empty string? >>>>> -- >>>>> Regards >>>>> Prashant >>>>> >>>> >>>> >>> >>> >>> -- >>> Regards >>> Prashant >>> >> >> > > > -- > Regards > Prashant > -- Regards Prashant
Re: Filtering records based on empty value of column in SparkSql
I have to do opposite of what you're doing. I have to filter non-empty records. On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D <email2...@gmail.com> wrote: > Hello Prashant - > > Can you please try like this : > > For the instance, input file name is "student_detail.txt" and > > ID,Name,Sex,Age > === > 101,Alfred,Male,30 > 102,Benjamin,Male,31 > 103,Charlie,Female,30 > 104,Julie,Female,30 > 105,Maven,Male,30 > 106,Dexter,Male,30 > 107,Lundy,Male,32 > 108,Rita,Female,30 > 109,Aster,Female,30 > 110,Harrison,Male,15 > 111,Rita,,30 > 112,Aster,,30 > 113,Harrison,,15 > 114,Rita,Male,20 > 115,Aster,,30 > 116,Harrison,,20 > > [image: Inline image 2] > > *Output:* > > Total No.of Records without SEX 5 > [111,Rita,,30] > [112,Aster,,30] > [113,Harrison,,15] > [115,Aster,,30] > [116,Harrison,,20] > > Total No.of Records with AGE <=15 2 > [110,Harrison,Male,15] > [113,Harrison,,15] > > Thanks & Regards, > Gokula Krishnan* (Gokul)* > Contact :+1 980-298-1740 > > On Wed, Dec 9, 2015 at 8:24 AM, Prashant Bhardwaj < > prashant2006s...@gmail.com> wrote: > >> Already tried it. But getting following error. >> >> overloaded method value filter with alternatives: (conditionExpr: >> String)org.apache.spark.sql.DataFrame (condition: >> org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be >> applied to (Boolean) >> >> Also tried: >> >> val req_logs_with_dpid = >> req_logs.filter(req_logs("req_info.dpid").toString.length >> != 0 ) >> >> But getting same error. >> >> >> On Wed, Dec 9, 2015 at 6:45 PM, Fengdong Yu <fengdo...@everstring.com> >> wrote: >> >>> val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" >>> ) >>> >>> Azuryy Yu >>> Sr. Infrastructure Engineer >>> >>> cel: 158-0164-9103 >>> wetchat: azuryy >>> >>> >>> On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj < >>> prashant2006s...@gmail.com> wrote: >>> >>>> Hi >>>> >>>> I have two columns in my json which can have null, empty and non-empty >>>> string as value. >>>> I know how to filter records which have non-null value using following: >>>> >>>> val req_logs = sqlContext.read.json(filePath) >>>> >>>> val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or >>>> req_info.dpid_sha1 is not null") >>>> >>>> But how to filter if value of column is empty string? >>>> -- >>>> Regards >>>> Prashant >>>> >>> >>> >> >> >> -- >> Regards >> Prashant >> > > -- Regards Prashant
Spark and Kafka Integration
Hi Some Background: We have a Kafka cluster with ~45 topics. Some of topics contains logs in Json format and some in PSV(pipe separated value) format. Now I want to consume these logs using Spark streaming and store them in Parquet format in HDFS. Now my question is: 1. Can we create a InputDStream per topic in the same application? Since for every topic Schema of logs might differ, so want to process some topics in different way. I want to store logs in different output directory based on the topic name. 2. Also how to partition logs based on timestamp? -- Regards Prashant