Convert timestamp to unix miliseconds

2021-08-04 Thread Tzahi File
Hi All, I'm using spark 2.4 and trying to convert a timestamp column to unix with milliseconds using the unix_timestamp function. I tried to convert the result to double cast(unix_timestamp(timestamp) as double). I also tried using the timestamp format "-MM-dd HH:mm:ss.sss" and no matter

Re: Spark performance over S3

2021-04-07 Thread Tzahi File
t; <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-performance.html> > for optimal use of S3A. > > Thanks, > Hariharan > > > > On Wed, Apr 7, 2021 at 12:15 AM Tzahi File wrote: > >> Hi All, >> >> We have

Spark performance over S3

2021-04-06 Thread Tzahi File
Hi All, We have a spark cluster on aws ec2 that has 60 X i3.4xlarge. The spark job running on that cluster reads from an S3 bucket and writes to that bucket. the bucket and the ec2 run in the same region. As part of our efforts to reduce the runtime of our spark jobs we found there's serious

Spark Parquet file size

2020-11-10 Thread Tzahi File
Hi, We have many Spark jobs that create multiple small files. We would like to improve analyst reading performance, doing so I'm testing the parquet optimal file size. I've found that the optimal file size should be around 1GB, and not less than 128MB, depending on the size of the data. I took

Re: Merging Parquet Files

2020-08-31 Thread Tzahi File
al terabytes is bad. > > It depends on your use case but you might look also at partitions etc. > > Am 31.08.2020 um 16:17 schrieb Tzahi File : > >  > Hi, > > I would like to develop a process that merges parquet files. > My first intention was to develop it with PySpa

Adding Partioned Field to The File

2020-08-31 Thread Tzahi File
Hi, I'm using PySpark to write df to s3 in parquet. I would like to add the partitioned columns to the file as well. What is the best way to do this? e.g df.write.partitionBy('day','hour') file out come -> day,hour,time,name and not time,name Thanks! Tzahi

Merging Parquet Files

2020-08-31 Thread Tzahi File
Hi, I would like to develop a process that merges parquet files. My first intention was to develop it with PySpark using coalesce(1) - to create only 1 file. This process is going to run on a huge amount of files. I wanted your advice on what is the best way to implement it (PySpark isn't a

Re: Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
I don't want to query with a distinct on the partitioned columns, the df contains over 1 Billion of records. I just want to know the partitions that were created.. On Thu, Jun 25, 2020 at 4:04 PM Jörn Franke wrote: > By doing a select on the df ? > > Am 25.06.2020 um 14:52 schrieb T

Getting PySpark Partitions Locations

2020-06-25 Thread Tzahi File
Hi, I'm using pyspark to write df to s3, using the following command: "df.write.partitionBy("day","hour","country").mode("overwrite").parquet(s3_output)". Is there any way to get the partitions created? e.g. day=2020-06-20/hour=1/country=US day=2

Issue with pyspark query

2020-06-10 Thread Tzahi File
Hi, This is a general question regarding moving spark SQL query to PySpark, if needed I will add some more from the errors log and query syntax. I'm trying to move a spark SQL query to run through PySpark. The query syntax and spark configuration are the same. For some reason the query failed to

Spark Adaptive configuration

2020-04-22 Thread Tzahi File
Hi, I saw that spark has an option to adapt the join and shuffle configuration. For example: "spark.sql.adaptive.shuffle.targetPostShuffleInputSize" I wanted to know if you had an experience with such configuration, how it changed the performance? Another question is whether along Spark SQL

Splitting resource in Spark cluster

2019-12-29 Thread Tzahi File
Hi All, I'm using one spark cluster cluster that contains 50 nodes from type i3.4xl (16Vcores). I'm trying to run 4 Spark SQL queries simultaneously. The data is split to 10 even partitions and the 4 queries run on the same data,but different partition. I have tried to configure the cluster so

Re: Issue With mod function in Spark SQL

2019-12-17 Thread Tzahi File
no.. there're 100M records both even and odd On Tue, Dec 17, 2019 at 8:13 PM Russell Spitzer wrote: > Is there a chance your data is all even or all odd? > > On Tue, Dec 17, 2019 at 11:01 AM Tzahi File > wrote: > >> I have in my spark sql query a calculated fiel

Issue With mod function in Spark SQL

2019-12-17 Thread Tzahi File
I have in my spark sql query a calculated field that gets the value if field1 % 3. I'm using this field as a partition so I expected to get 3 partitions in the mentioned case, and I do get. The issue happened with even numbers (instead of 3 - 4,2 ... ). When I tried to use even numbers, for

Re: Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
on the data frame. Are you currently using a user-defined function >> for this task? Because I bet that's what's slowing you down. >> >> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File >> wrote: >> >>> Hi, >>> >>> Currently, I'm using hive

Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Hi, Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a percentile function. I'm trying to improve this job by moving it to run with spark SQL. Any suggestions on how to use a percentile function in Spark? Thanks, -- Tzahi File Data Engineer [image: ironSource] <h

Re: Caching tables in spark

2019-08-28 Thread Tzahi File
t; Take a look at this article >> >> >> >> >> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-caching.html >> >> >> >> *From:* Tzahi File >> *Sent:* Wednesday, August 28, 2019 5:18 AM >> *To:* user >> *Subject:* Caching t

Caching tables in spark

2019-08-28 Thread Tzahi File
Hi, Looking for your knowledge with some question. I have 2 different processes that read from the same raw data table (around 1.5 TB). Is there a way to read this data once and cache it somehow and to use this data in both processes? Thanks -- Tzahi File Data Engineer [image: ironSource

[Spark SQL] failure in query

2019-08-25 Thread Tzahi File
Hi, I encountered some issue to run a spark SQL query, and will happy to some advice. I'm trying to run a query on a very big data set (around 1.5TB) and it getting failures in all of my tries. A template of the query is as below: insert overwrite table partition(part) select /*+ BROADCAST(c) */

Re: Performance Issue

2019-01-13 Thread Tzahi File
Sengupta wrote: > Hi Tzahi, > > I think that SPARK automatically broadcasts with the latest versions, but > you might have to check with your version. Did you try filtering first and > then doing the LEFT JOIN? > > Regards, > Gourav Sengupta > > On Sun, Jan 13, 201

Re: Performance Issue

2019-01-13 Thread Tzahi File
a smaller data > set by having joined on the device_id before as a subquery or separate > query. And when you are writing the output of the JOIN between csv_file and > raw_e to ORDER BY the output based on campaign_ID. > > Thanks and Regards, > Gourav Sengupta > > > On Thu,

Re: Performance Issue

2019-01-10 Thread Tzahi File
av Sengupta > > On Wed, Jan 9, 2019 at 1:53 AM 大啊 wrote: > >> What is your performance issue? >> >> >> >> >> >> At 2019-01-08 22:09:24, "Tzahi File" wrote: >> >> Hello, >> >> I have some performance issue ru

Performance Issue

2019-01-08 Thread Tzahi File
Hello, I have some performance issue running SQL query on Spark. The query contains one parquet partitioned table (partition by date) one each partition is about 200gb and simple table with about 100 records. The spark cluster is of type m5.2xlarge - 8 cores. I'm using Qubole interface for