RE: Difference between dataset and dataframe

2019-02-18 Thread Lunagariya, Dhaval
It does for dataframe also. Please try example.

df1 = spark.range(2, 1000, 2)
df2 = spark.range(2, 1000, 4)
step1 = df1.repartition(5)
step12 = df2.repartition(6)
step2 = step1.selectExpr("id * 5 as id")
step3 = step2.join(step12, ["id"])
step4 = step3.selectExpr("sum(id)")
step4.collect()

step4._jdf.queryExecution().debug().codegen()

You will see the generated code.

Regards,
Dhaval

From: [External] Akhilanand 
Sent: Tuesday, February 19, 2019 10:29 AM
To: Koert Kuipers 
Cc: user 
Subject: Re: Difference between dataset and dataframe

Thanks for the reply. But can you please tell why dataframes are performant 
than datasets? Any specifics would be helpful.

Also, could you comment on the tungsten code gen part of my question?

On Feb 18, 2019, at 10:47 PM, Koert Kuipers 
mailto:ko...@tresata.com>> wrote:
in the api DataFrame is just Dataset[Row]. so this makes you think Dataset is 
the generic api. interestingly enough under the hood everything is really 
Dataset[Row], so DataFrame is really the "native" language for spark sql, not 
Dataset.

i find DataFrame to be significantly more performant. in general if you use 
Dataset you miss out on some optimizations. also Encoders are not very pleasant 
to work with.

On Mon, Feb 18, 2019 at 9:09 PM Akhilanand 
mailto:akhilanand...@gmail.com>> wrote:

Hello,

I have been recently exploring about dataset and dataframes. I would really 
appreciate if someone could answer these questions:

1) Is there any difference in terms performance when we use datasets over 
dataframes? Is it significant to choose 1 over other. I do realise there would 
be some overhead due case classes but how significant is that? Are there any 
other implications.

2) Is the Tungsten code generation done only for datasets or is there any 
internal process to generate bytecode for dataframes as well? Since its related 
to jvm , I think its just for datasets but I couldn’t find anything that tells 
it specifically. If its just for datasets , does that mean we miss out on the 
project tungsten optimisation for dataframes?



Regards,
Akhilanand BV

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org


Recall: Difference between dataset and dataframe

2019-02-18 Thread Lunagariya, Dhaval
Lunagariya, Dhaval [CCC-OT] would like to recall the message, "Difference 
between dataset and dataframe".
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: Difference between dataset and dataframe

2019-02-18 Thread Lunagariya, Dhaval
It does for dataframe also. Please try example.

df1 = spark.range(2, 1000, 2)
df2 = spark.range(2, 1000, 4)
step1 = df1.repartition(5)
step12 = df2.repartition(6)
step2 = step1.selectExpr("id * 5 as id")
step3 = step2.join(step12, ["id"])
step4 = step3.selectExpr("sum(id)")
step4.collect()

step4._jdf.queryExecution().debug().codegen()

You will see the generated code.

Best Regards
Dhaval Lunagariya
CitiRisk Retail, ETS, Pune
Desk : +91-20-6709 8557 | M : +91 7755966916

From: [External] Akhilanand 
Sent: Tuesday, February 19, 2019 10:29 AM
To: Koert Kuipers 
Cc: user 
Subject: Re: Difference between dataset and dataframe

Thanks for the reply. But can you please tell why dataframes are performant 
than datasets? Any specifics would be helpful.

Also, could you comment on the tungsten code gen part of my question?

On Feb 18, 2019, at 10:47 PM, Koert Kuipers 
mailto:ko...@tresata.com>> wrote:
in the api DataFrame is just Dataset[Row]. so this makes you think Dataset is 
the generic api. interestingly enough under the hood everything is really 
Dataset[Row], so DataFrame is really the "native" language for spark sql, not 
Dataset.

i find DataFrame to be significantly more performant. in general if you use 
Dataset you miss out on some optimizations. also Encoders are not very pleasant 
to work with.

On Mon, Feb 18, 2019 at 9:09 PM Akhilanand 
mailto:akhilanand...@gmail.com>> wrote:

Hello,

I have been recently exploring about dataset and dataframes. I would really 
appreciate if someone could answer these questions:

1) Is there any difference in terms performance when we use datasets over 
dataframes? Is it significant to choose 1 over other. I do realise there would 
be some overhead due case classes but how significant is that? Are there any 
other implications.

2) Is the Tungsten code generation done only for datasets or is there any 
internal process to generate bytecode for dataframes as well? Since its related 
to jvm , I think its just for datasets but I couldn’t find anything that tells 
it specifically. If its just for datasets , does that mean we miss out on the 
project tungsten optimisation for dataframes?



Regards,
Akhilanand BV

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org


Re: Difference between dataset and dataframe

2019-02-18 Thread Akhilanand
Thanks for the reply. But can you please tell why dataframes are performant 
than datasets? Any specifics would be helpful.

Also, could you comment on the tungsten code gen part of my question?


> On Feb 18, 2019, at 10:47 PM, Koert Kuipers  wrote:
> 
> in the api DataFrame is just Dataset[Row]. so this makes you think Dataset is 
> the generic api. interestingly enough under the hood everything is really 
> Dataset[Row], so DataFrame is really the "native" language for spark sql, not 
> Dataset.
> 
> i find DataFrame to be significantly more performant. in general if you use 
> Dataset you miss out on some optimizations. also Encoders are not very 
> pleasant to work with.
> 
>> On Mon, Feb 18, 2019 at 9:09 PM Akhilanand  wrote:
>> 
>> Hello, 
>> 
>> I have been recently exploring about dataset and dataframes. I would really 
>> appreciate if someone could answer these questions:
>> 
>> 1) Is there any difference in terms performance when we use datasets over 
>> dataframes? Is it significant to choose 1 over other. I do realise there 
>> would be some overhead due case classes but how significant is that? Are 
>> there any other implications. 
>> 
>> 2) Is the Tungsten code generation done only for datasets or is there any 
>> internal process to generate bytecode for dataframes as well? Since its 
>> related to jvm , I think its just for datasets but I couldn’t find anything 
>> that tells it specifically. If its just for datasets , does that mean we 
>> miss out on the project tungsten optimisation for dataframes?
>> 
>> 
>> 
>> Regards,
>> Akhilanand BV
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 


Re: Difference between dataset and dataframe

2019-02-18 Thread Koert Kuipers
in the api DataFrame is just Dataset[Row]. so this makes you think Dataset
is the generic api. interestingly enough under the hood everything is
really Dataset[Row], so DataFrame is really the "native" language for spark
sql, not Dataset.

i find DataFrame to be significantly more performant. in general if you use
Dataset you miss out on some optimizations. also Encoders are not very
pleasant to work with.

On Mon, Feb 18, 2019 at 9:09 PM Akhilanand  wrote:

>
> Hello,
>
> I have been recently exploring about dataset and dataframes. I would
> really appreciate if someone could answer these questions:
>
> 1) Is there any difference in terms performance when we use datasets over
> dataframes? Is it significant to choose 1 over other. I do realise there
> would be some overhead due case classes but how significant is that? Are
> there any other implications.
>
> 2) Is the Tungsten code generation done only for datasets or is there any
> internal process to generate bytecode for dataframes as well? Since its
> related to jvm , I think its just for datasets but I couldn’t find anything
> that tells it specifically. If its just for datasets , does that mean we
> miss out on the project tungsten optimisation for dataframes?
>
>
>
> Regards,
> Akhilanand BV
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Difference between dataset and dataframe

2019-02-18 Thread Akhilanand


Hello, 

I have been recently exploring about dataset and dataframes. I would really 
appreciate if someone could answer these questions:

1) Is there any difference in terms performance when we use datasets over 
dataframes? Is it significant to choose 1 over other. I do realise there would 
be some overhead due case classes but how significant is that? Are there any 
other implications. 

2) Is the Tungsten code generation done only for datasets or is there any 
internal process to generate bytecode for dataframes as well? Since its related 
to jvm , I think its just for datasets but I couldn’t find anything that tells 
it specifically. If its just for datasets , does that mean we miss out on the 
project tungsten optimisation for dataframes?



Regards,
Akhilanand BV

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Streaming Tab in Kafka Structured Streaming

2019-02-18 Thread KhajaAsmath Mohammed
Hi,

I am new to the structured streaming but used dstreams a lot. Difference I
saw in the spark UI is the streaming tab for dstreams.

Is there a way to know how many records and batches were executed in
structred streaming and also any option on how to see streaming tab?

Thanks,
Asmath


Avoiding MUltiple GroupBy

2019-02-18 Thread Kumar sp
Can we avoid multiple group by , l have a million records and its a
performance concern.

Below is my query , even with Windows functions also i guess it is a
performance hit, can you please advice if there is a better alternative.
I need to get max no of equipments for that house for list of dates

 ds.groupBy("house", "date").agg(countDistinct("equiId") as "count").
  drop("date").groupBy("house").agg(max("count") as "noOfEquipments")

Regards,
Kumar


Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Wenchen Fan
great job!

On Mon, Feb 18, 2019 at 4:24 PM Hyukjin Kwon  wrote:

> Yay! Good job Takeshi!
>
> On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro 
>> We are happy to announce the availability of Spark 2.3.3!
>>
>> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3
>> maintenance branch of Spark. We strongly recommend all 2.3.x users to
>> upgrade to this stable release.
>>
>> To download Spark 2.3.3, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-3-3.html
>>
>> We would like to acknowledge all community members for contributing to
>> this release. This release would not have been possible without you.
>>
>> Best,
>> Takeshi
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Hyukjin Kwon
Yay! Good job Takeshi!

On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro  We are happy to announce the availability of Spark 2.3.3!
>
> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3
> maintenance branch of Spark. We strongly recommend all 2.3.x users to
> upgrade to this stable release.
>
> To download Spark 2.3.3, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-3-3.html
>
> We would like to acknowledge all community members for contributing to
> this release. This release would not have been possible without you.
>
> Best,
> Takeshi
>
> --
> ---
> Takeshi Yamamuro
>