Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Thanks, you meant in a for loop. could you please put pseudocode in spark On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke wrote: > Make every json object a line and then read t as jsonline not as multiline > > Am 19.06.2020 um 14:37 schrieb Chetan Khatri >: > >  > All transactions in JSON, It is

Re: Reading TB of JSON file

2020-06-19 Thread Jörn Franke
Make every json object a line and then read t as jsonline not as multiline > Am 19.06.2020 um 14:37 schrieb Chetan Khatri : > >  > All transactions in JSON, It is not a single array. > >> On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner >> wrote: >> It's an interesting problem. What is the

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
All transactions in JSON, It is not a single array. On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner wrote: > It's an interesting problem. What is the structure of the file? One big > array? On hash with many key-value pairs? > > Stephan > > On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri >

Re: Reading TB of JSON file

2020-06-19 Thread Chetan Khatri
Yes On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta wrote: > Hi, > So you have a single JSON record in multiple lines? > And all the 50 GB is in one file? > > Regards, > Gourav > > On Thu, 18 Jun 2020, 14:34 Chetan Khatri, > wrote: > >> It is dynamically generated and written at s3 bucket not

Re: Reading TB of JSON file

2020-06-18 Thread Stephan Wehner
It's an interesting problem. What is the structure of the file? One big array? On hash with many key-value pairs? Stephan On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri wrote: > Hi Spark Users, > > I have a 50GB of JSON file, I would like to read and persist at HDFS so it > can be taken into

Re: Reading TB of JSON file

2020-06-18 Thread Gourav Sengupta
Hi, So you have a single JSON record in multiple lines? And all the 50 GB is in one file? Regards, Gourav On Thu, 18 Jun 2020, 14:34 Chetan Khatri, wrote: > It is dynamically generated and written at s3 bucket not historical data > so I guess it doesn't have jsonlines format > > On Thu, Jun

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
It is dynamically generated and written at s3 bucket not historical data so I guess it doesn't have jsonlines format On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke wrote: > Depends on the data types you use. > > Do you have in jsonlines format? Then the amount of memory plays much less > a role. >

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
File is available at S3 Bucket. On Thu, Jun 18, 2020 at 9:15 AM Patrick McCarthy wrote: > Assuming that the file can be easily split, I would divide it into a > number of pieces and move those pieces to HDFS before using spark at all, > using `hdfs dfs` or similar. At that point you can use

Re: Reading TB of JSON file

2020-06-18 Thread nihed mbarek
Hi, What is the size of one json document ? There is also the scan of your json to define the schema, the overhead can be huge. 2 solution: define a schema and use directly during the load or ask spark to analyse a small part of the json file (I don't remember how to do it) Regards, On Thu,

Re: Reading TB of JSON file

2020-06-18 Thread Jörn Franke
Depends on the data types you use. Do you have in jsonlines format? Then the amount of memory plays much less a role. Otherwise if it is one large object or array I would not recommend it. > Am 18.06.2020 um 15:12 schrieb Chetan Khatri : > >  > Hi Spark Users, > > I have a 50GB of JSON

Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
Assuming that the file can be easily split, I would divide it into a number of pieces and move those pieces to HDFS before using spark at all, using `hdfs dfs` or similar. At that point you can use your executors to perform the reading instead of the driver. On Thu, Jun 18, 2020 at 9:12 AM Chetan

Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
Hi Spark Users, I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general,