Hi Vikas,
He suggested to use the select() function after your withColumn function.
val ds1 = ds.select("Col1", "Col3").withColumn("Col2",
lit("sample”)).select(“Col1”, “Col2”, “Col3")
Thanks,
Subash
On Thu, Nov 12, 2020 at 9:19 PM Vikas Garg wrote:
> I am deriving the col2 using with
Looks like he had a very bad appraisal this year.. Fun fact : the coming
year would be too :)
On Thu, 16 Apr 2020 at 12:07, Qi Kang wrote:
> Well man, check your attitude, you’re way over the line
>
>
> On Apr 16, 2020, at 13:26, jane thorpe
> wrote:
>
> F*U*C*K O*F*F
> C*U*N*T*S
>
>
>
Hi Team,
I have two questions regarding Arrow and Spark integration,
1. I am joining two huge tables (1PB) each - will the performance be huge
when I use Arrow format before shuffling ? Will the
serialization/deserialization cost have significant improvement?
2. Can we store the final data in
What is the no of part files in that big table? And what is the
distribution of request ID? Is the variance of the column is less or huge?
Because partitionBy clause will move data with same request ID to one
executor. If the data is huge it might put load on executor.
On Sun, 25 Aug 2019 at
When you mean by process is it two separate spark jobs? Or two stages
within same spark code?
Thanks
Subash
On Wed, 28 Aug 2019 at 19:06, wrote:
> Take a look at this article
>
>
>
>
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html
>
>
>
> *From:* Tzahi File
I had the similar issue reading the external parquet table . In my case I
had permission issue in one partition so I added filter to exclude that
partition but still the spark didn’t prune it. Then I read that in order
for spark to be aware of all the partitions it first read the folders and
then
Hi,
I am running a Spark Dataframe function of NTILE over a huge data - it
spills lot of data while sorting and eventually it fails.
The data size is roughly 80 Million record with size of 4G (not sure
whether its serialized or deserialized) - I am calculating NTILE(10) for
all these records
Hi,
I have a series of queries to extract from multiple tables in hive and do a
feature engineering on the extracted final data.. I can run queries using
spark sql and use mllib to perform the feature transformation I needed.
The question is do you guys use any kind of tool to perform this
Hey Rajat,
The documentation page is self explanatory..
You can refer this for more configs
https://spark.apache.org/docs/2.0.0/configuration.html
or any version of Spark documentation
Thanks.
Subash
On Sat, 20 Apr 2019 at 16:04, rajat kumar
wrote:
> Hi,
>
> Can anyone pls explain ?
>
>
>
Hi All,
I have a doubt about checkpointing and persist/saving.
Say we have one RDD - containing huge data,
1. We checkpoint and perform join
2. We persist as StorageLevel.MEMORY_AND_DISK and perform join
3. We save that intermediate RDD and perform join (using same RDD - saving
is to just
Hi,
While saving in Spark2 as text file - I see encoded/hash value attached in
the part files along with part number. I am curious to know what is that
value is about ?
Example:
ds.write.save(SaveMode.Overwrite).option("compression","gzip").text(path)
Produces,
11 matches
Mail list logo