Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
icies overlap, the shorter expiration policy is honored so that data is not stored for longer than expected. Likewise, if two transition policies overlap, S3 Lifecycle transitions your objects to the lower-cost storage class."On Thu, Apr 13, 2023, 12:29 "Yuri Oleynikov (‫יורי אולי

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
My naïve assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue Best regards > On 13 Apr 2023, at 11:52, Yuval Itzchakov wrote: > >  > Hi everyone, > > I am using Sparks FileStreamSink in order to write files to S3. On the S3 > bucket, I

Re: Data ingestion

2022-08-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
If you are on aws, you can use RDS + AWS DMS to save data to s3 and then read streaming data with spark structured streaming from s3 into hive Best regards > On 17 Aug 2022, at 20:51, Akash Vellukai wrote: > >  > Dear Sir, > > > How we could do data ingestion from MySQL to Hive with the

Re: When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hi Sean Persisting/caching is useful when you’re going to reuse dataframe. So in your case no persisting/caching is required. This is regarding to “when”. The “where” usually belongs to the closest point of reusing calculations/transformations Btw, I’m not sure if caching is useful when you

Unsubscribe

2021-09-08 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Unsubscribe

2021-09-03 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Unsubscribe

Unsubscribe

2021-09-03 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Bechmarks on Spark running on Yarn versus Spark on K8s

2021-07-05 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Not a big expert on Spark, but I’m not really understand how you are going to compare and what? Reading-writing to and from Hdfs? How does it related to yarn and k8s… these are recourse managers (YARN yet another resource manager) : what and how much to allocate and when… (cpu, ram). Local Disk

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
ss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > &g

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
You can do the enrichment with stream(events)-static(device table) join when the device table is slow changing dimension (let’s say once a day change) and it’s in delta format, then for every micro batch with stream-static John the device table will be rescanned and up to date device data will

Re: Is it enable to use Multiple UGIs in One Spark Context?

2021-03-25 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Assuming that all tables have same schema, you can make entire global table partitioned by some column. Then apply specific UGOs permissions/ACLs per partition subdirectory > On 25 Mar 2021, at 15:13, Kwangsun Noh wrote: > >  > Hi, Spark users. > > Currently I have to make multiple tables

Re: Rdd - zip with index

2021-03-23 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
repartition with no luck... > On 24 Mar 2021, at 03:47, KhajaAsmath Mohammed > wrote: > > So spark by default doesn’t split the large 10gb file when loaded? > > Sent from my iPhone > >> On Mar 23, 2021, at 8:44 PM, Yuri Oleynikov (‫יורי אולייניקוב‬‎) >

Re: Rdd - zip with index

2021-03-23 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hi, Mohammed I think that the reason that only one executor is running and have single partition is because you have single file that might be read/loaded into memory. In order to achieve better parallelism I’d suggest to split the csv file. Another problem is question: why are you using rdd?

Re: configuring .sparkStaging with group rwx

2021-02-25 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Spark-submit --conf spark.hadoop.fs.permissions.umask-mode=007 You may also set sticky bit on staging dir Sent from my iPhone > On 26 Feb 2021, at 03:29, Bulldog20630405 wrote: > >  > > we have a spark cluster running on with multiple users... > when running with the user owning the cluster

Re: Dynamic Spark metrics creation

2021-01-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
cek Laskowski > > https://about.me/JacekLaskowski > "The Internals Of" Online Books > Follow me on https://twitter.com/jaceklaskowski > > > > ‪On Sat, Jan 16, 2021 at 2:21 PM ‫Yuri Oleynikov (יורי אולייניקוב‬‎ > wrote:‬ >> Hi a

Re: Caching

2020-12-07 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
You are using same csv twice? Отправлено с iPhone > 7 дек. 2020 г., в 18:32, Amit Sharma написал(а): > >  > Hi All, I am using caching in my code. I have a DF like > val DF1 = read csv. > val DF2 = DF1.groupBy().agg().select(.) > > Val DF3 = read csv .join(DF1).join(DF2) > DF3 .save.

Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-21 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
I think MaxOffsetsPerTrigger in Spark + Kafka integration docs would meet your requirement Отправлено с iPhone > 21 окт. 2020 г., в 12:36, KhajaAsmath Mohammed > написал(а): > > Thanks. Do we have option to limit number of records ? Like process only > 1 or the property we pass ? This

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
It seems that thread converted to holy war that has nothing to do with original question. If it is, it’s super disappointing Отправлено с iPhone > 17 окт. 2020 г., в 15:53, Molotch написал(а): > > I would say the pros and cons of Python vs Scala is both down to Spark, the > languages in

Re: Hive on Spark in Kubernetes.

2020-10-07 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Thank you very much! Отправлено с iPhone > 7 окт. 2020 г., в 17:38, mykidong написал(а): > > Hi all, > > I have recently written a blog about hive on spark in kubernetes > environment: > - https://itnext.io/hive-on-spark-in-kubernetes-115c8e9fa5c1 > > In this blog, you can find how to run

Re: Arbitrary stateful aggregation: updating state without setting timeout

2020-10-06 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
t; set the timeout duration every time the function is called, otherwise there >> will not be any timeout set. > > Simply saying, you'd want to always set timeout unless you remove state for > the group (key). > > Hope this helps. > > Thanks, > Jungtaek Lim (HeartSaVioR)