Re: [pyspark 2.3+] how to dynamically determine DataFrame partitions while writing

2019-05-22 Thread Rishi Shah
Hi All, Any idea about this? Thanks, Rishi On Tue, May 21, 2019 at 11:29 PM Rishi Shah wrote: > Hi All, > > What is the best way to determine partitions of a dataframe dynamically > before writing to disk? > > 1) statically determine based on data and use coalesce or repartition > while

Re: [pyspark 2.3+] repartition followed by window function

2019-05-22 Thread Shraddha Shah
Any suggestions? On Wed, May 22, 2019 at 6:32 AM Rishi Shah wrote: > Hi All, > > If dataframe is repartitioned in memory by (date, id) columns and then if > I use multiple window functions which uses partition by clause with (date, > id) columns --> we can avoid shuffle/sort again I believe..

Similar Narrow Transformations should be chanined?

2019-05-22 Thread nicks29
I was having query regarding narrow transformation that is it ok to write adhoc logic into single transformation or we should split it into multiple tranfromations for ex: if i would like to explode one of the list and then using value of list again explode it and then using its values again

Specifying yarn queue w/ Livy Controller Service

2019-05-22 Thread Varun Rao
Hello In nifi in order to run the processor ExecuteSparkInteractive it needs a LivySessionController. When we start a livy controller service we do not see that we specify the yarn queue. Is there a way to do this? Thanks

Java heap error

2019-05-22 Thread Kumar sp
Hi , I am getting # java.lang.OutOfMemoryError: Java heap space . I have increased my driver memory and executor memory still i am facing this issue. I am using r4 for driver and core nodes(16). How can we see which step or whether its related to any GC . Can we pin point to single point on code

Re: Structred Streaming Error

2019-05-22 Thread KhajaAsmath Mohammed
I was able to resolve the error. Initially I was giving custom name for subscribe but was giving topic name at topics options.Giving the same name at both places worked. I am confused now with the difference of giving value for topic and subsribe here. do you have any suggestions?

Re: [spark on yarn] spark on yarn without DFS

2019-05-22 Thread Gourav Sengupta
just wondering what is the advantage of doing this? Regards Gourav Sengupta On Wed, May 22, 2019 at 3:01 AM Huizhe Wang wrote: > Hi Hari, > Thanks :) I tried to do it as u said. It works ;) > > > Hariharan 于2019年5月20日 周一下午3:54写道: > >> Hi Huizhe, >> >> You can set the "fs.defaultFS" field in

[Spark K8] Kube2Iam Annotation Support

2019-05-22 Thread Chandu Kavar
Hi, We have started using Spark on Kubernetes and in most cases, our all jobs use AWS S3 to read/write data. We are setting up the aws key and secret using these properties: spark.kubernetes.driver.secretKeyRef.[EnvName] spark.kubernetes.executor.secretKeyRef.[EnvName] But, we already have

Re: Structred Streaming Error

2019-05-22 Thread Gabor Somogyi
Have you tried what the exception suggests? If startingOffsets contains specific offsets, you must specify all TopicPartitions. BR, G On Tue, May 21, 2019 at 9:16 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am getting below errror when running sample strreaming app.

[pyspark 2.3+] repartition followed by window function

2019-05-22 Thread Rishi Shah
Hi All, If dataframe is repartitioned in memory by (date, id) columns and then if I use multiple window functions which uses partition by clause with (date, id) columns --> we can avoid shuffle/sort again I believe.. Can someone confirm this? However what happens when dataframe repartition was

How does number of partitions in DataFrame get decided while reading from HIVE

2019-05-22 Thread Shivam Sharma
Hi all, I just need to know how spark decide how many partitions should be created while reading a table from hive. Thanks -- Shivam Sharma Indian Institute Of Information Technology, Design and Manufacturing Jabalpur Email:- 28shivamsha...@gmail.com