Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Takeshi Yamamuro
Congrats, all! Bests, Takeshi On Fri, Jun 19, 2020 at 1:16 PM Felix Cheung wrote: > Congrats > > -- > *From:* Jungtaek Lim > *Sent:* Thursday, June 18, 2020 8:18:54 PM > *To:* Hyukjin Kwon > *Cc:* Mridul Muralidharan ; Reynold Xin < > r...@databricks.com>; dev ;

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Felix Cheung
Congrats From: Jungtaek Lim Sent: Thursday, June 18, 2020 8:18:54 PM To: Hyukjin Kwon Cc: Mridul Muralidharan ; Reynold Xin ; dev ; user Subject: Re: [ANNOUNCE] Apache Spark 3.0.0 Great, thanks all for your efforts on the huge step forward! On Fri, Jun 19,

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Jungtaek Lim
Great, thanks all for your efforts on the huge step forward! On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon wrote: > Yay! > > 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성: > >> Great job everyone ! Congratulations :-) >> >> Regards, >> Mridul >> >> On Thu, Jun 18, 2020 at 10:21 AM Reynold

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Hyukjin Kwon
Yay! 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성: > Great job everyone ! Congratulations :-) > > Regards, > Mridul > > On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin wrote: > >> Hi all, >> >> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on >> many of the innovations

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread Stephen Coy
Hi Murat Migdisoglu, Unfortunately you need the secret sauce to resolve this. It is necessary to check out the Apache Spark source code and build it with the right command line options. This is what I have been using: dev/make-distribution.sh --name my-spark --tgz -Pyarn -Phadoop-3.2 -Pyarn

Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread murat migdisoglu
Hi all I've upgraded my test cluster to spark 3 and change my comitter to directory and I still get this error.. The documentations are somehow obscure on that. Do I need to add a third party jar to support new comitters? java.lang.ClassNotFoundException:

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Mridul Muralidharan
Great job everyone ! Congratulations :-) Regards, Mridul On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin wrote: > Hi all, > > Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many > of the innovations from Spark 2.x, bringing new ideas as well as continuing > long-term

Custom Metrics

2020-06-18 Thread Bryan Jeffrey
Hello. We're using Spark 2.4.4. We have a custom metrics sink consuming the Spark-produced metrics (e.g. heap free, etc.). I am trying to determine a good mechanism to pass the Spark application name into the metrics sink. Current the application ID is included, but not the application name. Is

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Gaetano Fabiano
Congratulations 拾 Celebrating 拾 Sent from my iPhone > On 18 Jun 2020, at 20:38, Gourav Sengupta wrote: > >  > CELEBRATIONS!!! > >> On Thu, Jun 18, 2020 at 6:21 PM Reynold Xin wrote: >> Hi all, >> >> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many >> of

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Gourav Sengupta
CELEBRATIONS!!! On Thu, Jun 18, 2020 at 6:21 PM Reynold Xin wrote: > Hi all, > > Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many > of the innovations from Spark 2.x, bringing new ideas as well as continuing > long-term projects that have been in development.

[ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Reynold Xin
Hi all, Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of the innovations from Spark 2.x, bringing new ideas as well as continuing long-term projects that have been in development. This release resolves more than 3400 tickets. We'd like to thank our contributors

Re: Reading TB of JSON file

2020-06-18 Thread Stephan Wehner
It's an interesting problem. What is the structure of the file? One big array? On hash with many key-value pairs? Stephan On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri wrote: > Hi Spark Users, > > I have a 50GB of JSON file, I would like to read and persist at HDFS so it > can be taken into

Re: Reading TB of JSON file

2020-06-18 Thread Gourav Sengupta
Hi, So you have a single JSON record in multiple lines? And all the 50 GB is in one file? Regards, Gourav On Thu, 18 Jun 2020, 14:34 Chetan Khatri, wrote: > It is dynamically generated and written at s3 bucket not historical data > so I guess it doesn't have jsonlines format > > On Thu, Jun

Re: GPU Acceleration for spark-3.0.0

2020-06-18 Thread Bobby Evans
"So if I am going to use GPU in my job running on the spark , I still need to code the map and reduce function in cuda or in c++ and then invoke them throught jni or something like GPUEnabler , is that right ?" Sort of. You could go through all of that work yourself, or you could use the plugin

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
It is dynamically generated and written at s3 bucket not historical data so I guess it doesn't have jsonlines format On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke wrote: > Depends on the data types you use. > > Do you have in jsonlines format? Then the amount of memory plays much less > a role. >

Re: Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
File is available at S3 Bucket. On Thu, Jun 18, 2020 at 9:15 AM Patrick McCarthy wrote: > Assuming that the file can be easily split, I would divide it into a > number of pieces and move those pieces to HDFS before using spark at all, > using `hdfs dfs` or similar. At that point you can use

Re: Reading TB of JSON file

2020-06-18 Thread nihed mbarek
Hi, What is the size of one json document ? There is also the scan of your json to define the schema, the overhead can be huge. 2 solution: define a schema and use directly during the load or ask spark to analyse a small part of the json file (I don't remember how to do it) Regards, On Thu,

Re: Reading TB of JSON file

2020-06-18 Thread Jörn Franke
Depends on the data types you use. Do you have in jsonlines format? Then the amount of memory plays much less a role. Otherwise if it is one large object or array I would not recommend it. > Am 18.06.2020 um 15:12 schrieb Chetan Khatri : > >  > Hi Spark Users, > > I have a 50GB of JSON

Re: Reading TB of JSON file

2020-06-18 Thread Patrick McCarthy
Assuming that the file can be easily split, I would divide it into a number of pieces and move those pieces to HDFS before using spark at all, using `hdfs dfs` or similar. At that point you can use your executors to perform the reading instead of the driver. On Thu, Jun 18, 2020 at 9:12 AM Chetan

Reading TB of JSON file

2020-06-18 Thread Chetan Khatri
Hi Spark Users, I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general,

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-18 Thread Jacek Laskowski
Hi Rachana, > Should I go backward and use Spark Streaming DStream based. No. Never. It's no longer supported (and should really be removed from the codebase once and for all - dreaming...). Spark focuses on Spark SQL and Spark Structured Streaming as user-facing modules for batch and streaming