Hi All,
I'm using spark 2.4 and trying to convert a timestamp column to unix with
milliseconds using the unix_timestamp function.
I tried to convert the result to double cast(unix_timestamp(timestamp) as
double).
I also tried using the timestamp format "-MM-dd HH:mm:ss.sss" and no
matter
t; <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html>
> for optimal use of S3A.
>
> Thanks,
> Hariharan
>
>
>
> On Wed, Apr 7, 2021 at 12:15 AM Tzahi File wrote:
>
>> Hi All,
>>
>> We have
Hi All,
We have a spark cluster on aws ec2 that has 60 X i3.4xlarge.
The spark job running on that cluster reads from an S3 bucket and writes to
that bucket.
the bucket and the ec2 run in the same region.
As part of our efforts to reduce the runtime of our spark jobs we found
there's serious
Hi,
We have many Spark jobs that create multiple small files. We would like to
improve analyst reading performance, doing so I'm testing the parquet
optimal file size.
I've found that the optimal file size should be around 1GB, and not less
than 128MB, depending on the size of the data.
I took
al terabytes is bad.
>
> It depends on your use case but you might look also at partitions etc.
>
> Am 31.08.2020 um 16:17 schrieb Tzahi File :
>
>
> Hi,
>
> I would like to develop a process that merges parquet files.
> My first intention was to develop it with PySpa
Hi,
I'm using PySpark to write df to s3 in parquet.
I would like to add the partitioned columns to the file as well.
What is the best way to do this?
e.g df.write.partitionBy('day','hour')
file out come -> day,hour,time,name
and not time,name
Thanks!
Tzahi
Hi,
I would like to develop a process that merges parquet files.
My first intention was to develop it with PySpark using coalesce(1) - to
create only 1 file.
This process is going to run on a huge amount of files.
I wanted your advice on what is the best way to implement it (PySpark isn't
a
I don't want to query with a distinct on the partitioned columns, the df
contains over 1 Billion of records.
I just want to know the partitions that were created..
On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke wrote:
> By doing a select on the df ?
>
> Am 25.06.2020 um 14:52 schrieb T
Hi,
I'm using pyspark to write df to s3, using the following command:
"df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)".
Is there any way to get the partitions created?
e.g.
day=2020-06-20/hour=1/country=US
day=2
Hi,
This is a general question regarding moving spark SQL query to PySpark, if
needed I will add some more from the errors log and query syntax.
I'm trying to move a spark SQL query to run through PySpark.
The query syntax and spark configuration are the same.
For some reason the query failed to
Hi,
I saw that spark has an option to adapt the join and shuffle configuration.
For example: "spark.sql.adaptive.shuffle.targetPostShuffleInputSize"
I wanted to know if you had an experience with such configuration, how it
changed the performance?
Another question is whether along Spark SQL
Hi All,
I'm using one spark cluster cluster that contains 50 nodes from type i3.4xl
(16Vcores).
I'm trying to run 4 Spark SQL queries simultaneously.
The data is split to 10 even partitions and the 4 queries run on the same
data,but different partition. I have tried to configure the cluster so
no.. there're 100M records both even and odd
On Tue, Dec 17, 2019 at 8:13 PM Russell Spitzer
wrote:
> Is there a chance your data is all even or all odd?
>
> On Tue, Dec 17, 2019 at 11:01 AM Tzahi File
> wrote:
>
>> I have in my spark sql query a calculated fiel
I have in my spark sql query a calculated field that gets the value if
field1 % 3.
I'm using this field as a partition so I expected to get 3 partitions in
the mentioned case, and I do get. The issue happened with even numbers
(instead of 3 - 4,2 ... ).
When I tried to use even numbers, for
on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive
Hi,
Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
percentile function. I'm trying to improve this job by moving it to run
with spark SQL.
Any suggestions on how to use a percentile function in Spark?
Thanks,
--
Tzahi File
Data Engineer
[image: ironSource] <h
t; Take a look at this article
>>
>>
>>
>>
>> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html
>>
>>
>>
>> *From:* Tzahi File
>> *Sent:* Wednesday, August 28, 2019 5:18 AM
>> *To:* user
>> *Subject:* Caching t
Hi,
Looking for your knowledge with some question.
I have 2 different processes that read from the same raw data table (around
1.5 TB).
Is there a way to read this data once and cache it somehow and to use this
data in both processes?
Thanks
--
Tzahi File
Data Engineer
[image: ironSource
Hi,
I encountered some issue to run a spark SQL query, and will happy to some
advice.
I'm trying to run a query on a very big data set (around 1.5TB) and it
getting failures in all of my tries. A template of the query is as below:
insert overwrite table partition(part)
select /*+ BROADCAST(c) */
Sengupta
wrote:
> Hi Tzahi,
>
> I think that SPARK automatically broadcasts with the latest versions, but
> you might have to check with your version. Did you try filtering first and
> then doing the LEFT JOIN?
>
> Regards,
> Gourav Sengupta
>
> On Sun, Jan 13, 201
a smaller data
> set by having joined on the device_id before as a subquery or separate
> query. And when you are writing the output of the JOIN between csv_file and
> raw_e to ORDER BY the output based on campaign_ID.
>
> Thanks and Regards,
> Gourav Sengupta
>
>
> On Thu,
av Sengupta
>
> On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote:
>
>> What is your performance issue?
>>
>>
>>
>>
>>
>> At 2019-01-08 22:09:24, "Tzahi File" wrote:
>>
>> Hello,
>>
>> I have some performance issue ru
Hello,
I have some performance issue running SQL query on Spark.
The query contains one parquet partitioned table (partition by date) one
each partition is about 200gb and simple table with about 100 records. The
spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface
for
23 matches
Mail list logo