Renaming nested columns in dataframe

2016-05-16 Thread Prashant Bhardwaj
Hi

How can I rename nested columns in dataframe through scala API? Like
following schema

> |-- site: struct (nullable = false)
>
> ||-- site_id: string (nullable = true)
>
> ||-- site_name: string (nullable = true)
>
> ||-- site_domain: string (nullable = true)
>
> ||-- site_cat: array (nullable = true)
>
> |||-- element: string (containsNull = false)
>
> ||-- publisher: struct (nullable = false)
>
> |||-- site_pub_id: string (nullable = true)
>
> |||-- site_pub_name: string (nullable = true)
>

I want to change it to

> |-- site: struct (nullable = false)
>
> ||-- id: string (nullable = true)
>
> ||-- name: string (nullable = true)
>
> ||-- domain: string (nullable = true)
>
> ||-- cat: array (nullable = true)
>
> |||-- element: string (containsNull = false)
>
> ||-- publisher: struct (nullable = false)
>
> |||-- id: string (nullable = true)
>
> |||-- name: string (nullable = true)
>
>
Regards
Prashant


Re: Creating Nested dataframe from flat data.

2016-05-13 Thread Prashant Bhardwaj
Thank you. That's exactly I was looking for.

Regards
Prashant

On Fri, May 13, 2016 at 9:38 PM, Xinh Huynh <xinh.hu...@gmail.com> wrote:

> Hi Prashant,
>
> You can create struct columns using the struct() function in
> org.apache.spark.sql.functions --
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
>
> val df = sc.parallelize(List(("a", "b", "c"))).toDF("A", "B", "C")
>
> import org.apache.spark.sql.functions._
> df.withColumn("D", struct($"a", $"b", $"c")).show()
>
> ---+---+---+---+ | A| B| C| D| +---+---+---+---+ | a| b|
> c|[a,b,c]| +---+---+---+---+
>
> You can repeat to get the inner nesting.
>
> Xinh
>
> On Fri, May 13, 2016 at 4:51 AM, Prashant Bhardwaj <
> prashant2006s...@gmail.com> wrote:
>
>> Hi
>>
>> Let's say I have a flat dataframe with 6 columns like.
>> {
>> "a": "somevalue",
>> "b": "somevalue",
>> "c": "somevalue",
>> "d": "somevalue",
>> "e": "somevalue",
>> "f": "somevalue"
>> }
>>
>> Now I want to convert this dataframe to contain nested column like
>>
>> {
>> "nested_obj1": {
>> "a": "somevalue",
>> "b": "somevalue"
>> },
>> "nested_obj2": {
>> "c": "somevalue",
>> "d": "somevalue",
>> "nested_obj3": {
>> "e": "somevalue",
>> "f": "somevalue"
>> }
>> }
>> }
>>
>> How can I achieve this? I'm using Spark-sql in scala.
>>
>> Regards
>> Prashant
>>
>
>


Creating Nested dataframe from flat data.

2016-05-13 Thread Prashant Bhardwaj
Hi

Let's say I have a flat dataframe with 6 columns like.
{
"a": "somevalue",
"b": "somevalue",
"c": "somevalue",
"d": "somevalue",
"e": "somevalue",
"f": "somevalue"
}

Now I want to convert this dataframe to contain nested column like

{
"nested_obj1": {
"a": "somevalue",
"b": "somevalue"
},
"nested_obj2": {
"c": "somevalue",
"d": "somevalue",
"nested_obj3": {
"e": "somevalue",
"f": "somevalue"
}
}
}

How can I achieve this? I'm using Spark-sql in scala.

Regards
Prashant


Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
Already tried it. But getting following error.

overloaded method value filter with alternatives: (conditionExpr:
String)org.apache.spark.sql.DataFrame  (condition:
org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be
applied to (Boolean)

Also tried:

val req_logs_with_dpid =
req_logs.filter(req_logs("req_info.dpid").toString.length
!= 0 )

But getting same error.


On Wed, Dec 9, 2015 at 6:45 PM, Fengdong Yu <fengdo...@everstring.com>
wrote:

> val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" )
>
> Azuryy Yu
> Sr. Infrastructure Engineer
>
> cel: 158-0164-9103
> wetchat: azuryy
>
>
> On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj <
> prashant2006s...@gmail.com> wrote:
>
>> Hi
>>
>> I have two columns in my json which can have null, empty and non-empty
>> string as value.
>> I know how to filter records which have non-null value using following:
>>
>> val req_logs = sqlContext.read.json(filePath)
>>
>> val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or
>> req_info.dpid_sha1 is not null")
>>
>> But how to filter if value of column is empty string?
>> --
>> Regards
>> Prashant
>>
>
>


-- 
Regards
Prashant


Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
Hi

I have two columns in my json which can have null, empty and non-empty
string as value.
I know how to filter records which have non-null value using following:

val req_logs = sqlContext.read.json(filePath)

val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or
req_info.dpid_sha1 is not null")

But how to filter if value of column is empty string?
-- 
Regards
Prashant


Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
Anyway I got it. I have to use !== instead of ===. Thank BTW.

On Wed, Dec 9, 2015 at 9:39 PM, Prashant Bhardwaj <
prashant2006s...@gmail.com> wrote:

> I have to do opposite of what you're doing. I have to filter non-empty
> records.
>
> On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D <email2...@gmail.com>
> wrote:
>
>> Hello Prashant -
>>
>> Can you please try like this :
>>
>> For the instance, input file name is "student_detail.txt" and
>>
>> ID,Name,Sex,Age
>> ===
>> 101,Alfred,Male,30
>> 102,Benjamin,Male,31
>> 103,Charlie,Female,30
>> 104,Julie,Female,30
>> 105,Maven,Male,30
>> 106,Dexter,Male,30
>> 107,Lundy,Male,32
>> 108,Rita,Female,30
>> 109,Aster,Female,30
>> 110,Harrison,Male,15
>> 111,Rita,,30
>> 112,Aster,,30
>> 113,Harrison,,15
>> 114,Rita,Male,20
>> 115,Aster,,30
>> 116,Harrison,,20
>>
>> [image: Inline image 2]
>>
>> *Output:*
>>
>> Total No.of Records without SEX 5
>> [111,Rita,,30]
>> [112,Aster,,30]
>> [113,Harrison,,15]
>> [115,Aster,,30]
>> [116,Harrison,,20]
>>
>> Total No.of Records with AGE <=15 2
>> [110,Harrison,Male,15]
>> [113,Harrison,,15]
>>
>> Thanks & Regards,
>> Gokula Krishnan* (Gokul)*
>> Contact :+1 980-298-1740
>>
>> On Wed, Dec 9, 2015 at 8:24 AM, Prashant Bhardwaj <
>> prashant2006s...@gmail.com> wrote:
>>
>>> Already tried it. But getting following error.
>>>
>>> overloaded method value filter with alternatives: (conditionExpr:
>>> String)org.apache.spark.sql.DataFrame  (condition:
>>> org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be
>>> applied to (Boolean)
>>>
>>> Also tried:
>>>
>>> val req_logs_with_dpid = 
>>> req_logs.filter(req_logs("req_info.dpid").toString.length
>>> != 0 )
>>>
>>> But getting same error.
>>>
>>>
>>> On Wed, Dec 9, 2015 at 6:45 PM, Fengdong Yu <fengdo...@everstring.com>
>>> wrote:
>>>
>>>> val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") !=
>>>> "" )
>>>>
>>>> Azuryy Yu
>>>> Sr. Infrastructure Engineer
>>>>
>>>> cel: 158-0164-9103
>>>> wetchat: azuryy
>>>>
>>>>
>>>> On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj <
>>>> prashant2006s...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I have two columns in my json which can have null, empty and non-empty
>>>>> string as value.
>>>>> I know how to filter records which have non-null value using following:
>>>>>
>>>>> val req_logs = sqlContext.read.json(filePath)
>>>>>
>>>>> val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or
>>>>> req_info.dpid_sha1 is not null")
>>>>>
>>>>> But how to filter if value of column is empty string?
>>>>> --
>>>>> Regards
>>>>> Prashant
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards
>>> Prashant
>>>
>>
>>
>
>
> --
> Regards
> Prashant
>



-- 
Regards
Prashant


Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Prashant Bhardwaj
I have to do opposite of what you're doing. I have to filter non-empty
records.

On Wed, Dec 9, 2015 at 9:33 PM, Gokula Krishnan D <email2...@gmail.com>
wrote:

> Hello Prashant -
>
> Can you please try like this :
>
> For the instance, input file name is "student_detail.txt" and
>
> ID,Name,Sex,Age
> ===
> 101,Alfred,Male,30
> 102,Benjamin,Male,31
> 103,Charlie,Female,30
> 104,Julie,Female,30
> 105,Maven,Male,30
> 106,Dexter,Male,30
> 107,Lundy,Male,32
> 108,Rita,Female,30
> 109,Aster,Female,30
> 110,Harrison,Male,15
> 111,Rita,,30
> 112,Aster,,30
> 113,Harrison,,15
> 114,Rita,Male,20
> 115,Aster,,30
> 116,Harrison,,20
>
> [image: Inline image 2]
>
> *Output:*
>
> Total No.of Records without SEX 5
> [111,Rita,,30]
> [112,Aster,,30]
> [113,Harrison,,15]
> [115,Aster,,30]
> [116,Harrison,,20]
>
> Total No.of Records with AGE <=15 2
> [110,Harrison,Male,15]
> [113,Harrison,,15]
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
> Contact :+1 980-298-1740
>
> On Wed, Dec 9, 2015 at 8:24 AM, Prashant Bhardwaj <
> prashant2006s...@gmail.com> wrote:
>
>> Already tried it. But getting following error.
>>
>> overloaded method value filter with alternatives: (conditionExpr:
>> String)org.apache.spark.sql.DataFrame  (condition:
>> org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame cannot be
>> applied to (Boolean)
>>
>> Also tried:
>>
>> val req_logs_with_dpid = 
>> req_logs.filter(req_logs("req_info.dpid").toString.length
>> != 0 )
>>
>> But getting same error.
>>
>>
>> On Wed, Dec 9, 2015 at 6:45 PM, Fengdong Yu <fengdo...@everstring.com>
>> wrote:
>>
>>> val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != ""
>>> )
>>>
>>> Azuryy Yu
>>> Sr. Infrastructure Engineer
>>>
>>> cel: 158-0164-9103
>>> wetchat: azuryy
>>>
>>>
>>> On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj <
>>> prashant2006s...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> I have two columns in my json which can have null, empty and non-empty
>>>> string as value.
>>>> I know how to filter records which have non-null value using following:
>>>>
>>>> val req_logs = sqlContext.read.json(filePath)
>>>>
>>>> val req_logs_with_dpid = req_log.filter("req_info.dpid is not null or
>>>> req_info.dpid_sha1 is not null")
>>>>
>>>> But how to filter if value of column is empty string?
>>>> --
>>>> Regards
>>>> Prashant
>>>>
>>>
>>>
>>
>>
>> --
>> Regards
>> Prashant
>>
>
>


-- 
Regards
Prashant


Spark and Kafka Integration

2015-12-07 Thread Prashant Bhardwaj
Hi

Some Background:
We have a Kafka cluster with ~45 topics. Some of topics contains logs in
Json format and some in PSV(pipe separated value) format. Now I want to
consume these logs using Spark streaming and store them in Parquet format
in HDFS.

Now my question is:
1. Can we create a InputDStream per topic in the same application?

 Since for every topic Schema of logs might differ, so want to process some
topics in different way.
I want to store logs in different output directory based on the topic name.

2. Also how to partition logs based on timestamp?

-- 
Regards
Prashant