I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If
this is actually happening, it's just wasteful overhead. The ambition is to
say "divide the data into partitions, but make sure you don't move it in
doing so".
On Tue, Aug 28, 2018 at 2:06 PM, Patrick McCarthy
wrote:
>
I'm not 100% sure, but a naive repartition() seems to cause a shuffle. If
this is actually happening, it's just wasteful overhead.
On Tue, Aug 28, 2018 at 1:03 PM, Sonal Goyal wrote:
> Hi Patrick,
>
> Sorry is there something here that helps you beyond repartition(number of
> partitons) or
If you group by the host that you have computed using the UDF, Spark is
always going to shuffle your dataset, even if the end result is that all
the new partitions look exactly like the old partitions, just placed on
differrent nodes. Remember the hostname will probably hash differrently
than the
Hi Patrick,
Sorry is there something here that helps you beyond repartition(number of
partitons) or calling your udf on foreachPartition? If your data is on
disk, Spark is already partitioning it for you by rows. How is adding the
host info helping?
Thanks,
Sonal
Nube Technologies
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Mostly I'm guessing that it adds efficiency to a job where partitioning is
required but shuffling is not.
For example, if I want to apply a UDF to 1tb of records on disk, I might
need to repartition(5) to get the task size down to an acceptable size
for my cluster. If I don't care that it's
Hi,
I configured a new system, spark 2.3.0, python 3.6.0, dataframe read and
other operations working as expected.
But, RDD collect is failing -
distFile =
spark.sparkContext.textFile("/Users/aakash/Documents/Final_HOME_ORIGINAL/Downloads/PreloadedDataset/breast-cancer-wisconsin.csv")
Hi,
One way is to write own UDF and use UTC zone inside it.
Something like :
import org.joda.time.{DateTime,DateTimeZone}
import java.sql.Timestamp
val getCurrentTimestampUTC = udf(() => {
new Timestamp(new DateTime(new Date()).withZone(DateTimeZone.UTC).getMillis)
})
Note : I've not tested
Hi All,
How do I generate current UTC timestamp using spark sql?
When I do curent_timestamp() it is giving me local time.
to_utc_timestamp(current_time(), ) takes timezone in the second
parameter and I see no udf that can give me current timezone.
when I do
13 matches
Mail list logo