Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. The ambition is to say "divide the data into partitions, but make sure you don't move it in doing so". On Tue, Aug 28, 2018 at 2:06 PM, Patrick McCarthy wrote: >

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If this is actually happening, it's just wasteful overhead. On Tue, Aug 28, 2018 at 1:03 PM, Sonal Goyal wrote: > Hi Patrick, > > Sorry is there something here that helps you beyond repartition(number of > partitons) or

Re: [External Sender] Pitfalls of partitioning by host?

2018-08-28 Thread Jayesh Lalwani
If you group by the host that you have computed using the UDF, Spark is always going to shuffle your dataset, even if the end result is that all the new partitions look exactly like the old partitions, just placed on differrent nodes. Remember the hostname will probably hash differrently than the

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Sonal Goyal
Hi Patrick, Sorry is there something here that helps you beyond repartition(number of partitons) or calling your udf on foreachPartition? If your data is on disk, Spark is already partitioning it for you by rows. How is adding the host info helping? Thanks, Sonal Nube Technologies

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

RDD Collect Issue

2018-08-28 Thread Aakash Basu
Hi, I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and other operations working as expected. But, RDD collect is failing - distFile = spark.sparkContext.textFile("/Users/aakash/Documents/Final_HOME_ORIGINAL/Downloads/PreloadedDataset/breast-cancer-wisconsin.csv")

Re: How do I generate current UTC timestamp in raw spark sql?

2018-08-28 Thread Nikita Goyal
Hi, One way is to write own UDF and use UTC zone inside it. Something like : import org.joda.time.{DateTime,DateTimeZone} import java.sql.Timestamp val getCurrentTimestampUTC = udf(() => { new Timestamp(new DateTime(new Date()).withZone(DateTimeZone.UTC).getMillis) }) Note : I've not tested

How do I generate current UTC timestamp in raw spark sql?

2018-08-28 Thread kant kodali
Hi All, How do I generate current UTC timestamp using spark sql? When I do curent_timestamp() it is giving me local time. to_utc_timestamp(current_time(), ) takes timezone in the second parameter and I see no udf that can give me current timezone. when I do