Kafka Zeppelin integration

2020-06-19 Thread silavala
hi here is my question. Spark code run on zeppelin is unable to find kafka source even though a dependency is specified. I ask is there any way to fix this. Zeppelin version is 0.9.0, Spark version is 2.4.6, and kafka version is 2.4.1. I have specified the dependency in the packages and add a

Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

2020-06-19 Thread Rishi Shah
Thanks Sean! To combat the skew I do have another column I partitionby and that has worked well (like below). However in the image I attached in my original email - it looks like 2 tasks processed nothing, may I reading SPARKUI task table right? All 4 dates have date - 2 dates have ~200MB & other

Re: [pyspark 2.3+] read/write huge data with smaller block size (128MB per block)

2020-06-19 Thread Sean Owen
Yes you'll generally get 1 partition per block, and 1 task per partition. The amount of RAM isn't directly relevant; it's not loaded into memory. But you may nevertheless get some improvement with larger partitions / tasks, though typically only if your tasks are very small and very fast right now

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Thanks, you meant in a for loop. could you please put pseudocode in spark On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke wrote: > Make every json object a line and then read t as jsonline not as multiline > > Am 19.06.2020 um 14:37 schrieb Chetan Khatri >: > >  > All transactions in JSON, It is

Re: Reading TB of JSON file

2020-06-19 Thread Jörn Franke
Make every json object a line and then read t as jsonline not as multiline > Am 19.06.2020 um 14:37 schrieb Chetan Khatri : > >  > All transactions in JSON, It is not a single array. > >> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner >> wrote: >> It's an interesting problem. What is the

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
All transactions in JSON, It is not a single array. On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner wrote: > It's an interesting problem. What is the structure of the file? One big > array? On hash with many key-value pairs? > > Stephan > > On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri >

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Yes On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta wrote: > Hi, > So you have a single JSON record in multiple lines? > And all the 50 GB is in one file? > > Regards, > Gourav > > On Thu, 18 Jun 2020, 14:34 Chetan Khatri, > wrote: > >> It is dynamically generated and written at s3 bucket not

Re: Hey good looking toPandas ()

2020-06-19 Thread Anwar AliKhan
I got an illegal argument error with 2.4.6. I then pointed my Jupiter notebook to 3.0 version and it worked as expected. Using same .ipnyb file. I was following this machine learning example. “Your First Apache Spark ML Model” by Favio Vázquez

Re: Hey good looking toPandas ()

2020-06-19 Thread Stephen Boesch
afaik It has been there since Spark 2.0 in 2015. Not certain about Spark 1.5/1.6 On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan wrote: > I first ran the command > df.show() > > For sanity check of my dataFrame. > > I wasn't impressed with the display. > > I then ran > df.toPandas() in Jupiter

Hey good looking toPandas ()

2020-06-19 Thread Anwar AliKhan
I first ran the command df.show() For sanity check of my dataFrame. I wasn't impressed with the display. I then ran df.toPandas() in Jupiter Notebook. Now the display is really good looking . Is toPandas() a new function which became available in Spark 3.0 ?