Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread DB Tsai
If it's standalone mode, it's even easier. You should be able to connect to hadoop 2.6 hdfs using 3.2 client. In your k8s cluster, just don't put hadoop 2.6 into your classpath. On Sun, Jul 19, 2020 at 10:25 PM Ashika Umanga Umagiliya wrote: > > Hello > > "spark.yarn.populateHadoopClasspath" is u

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread Ashika Umanga Umagiliya
Hello "spark.yarn.populateHadoopClasspath" is used in YARN mode correct? However our Spark cluster is standalone cluster not using YARN. We only connect to HDFS/Hive to access data.Computation is done on our spark cluster running on K8s (not Yarn) On Mon, Jul 20, 2020 at 2:04 PM DB Tsai wrote:

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread Prashant Sharma
Hi Ashika, Hadoop 2.6 is now no longer supported, and since it has not been maintained in the last 2 years, it means it may have some security issues unpatched. Spark 3.0 onwards, we no longer support it, in other words, we have modified our codebase in a way that Hadoop 2.6 won't work. However, i

Re: Spark UI

2020-07-19 Thread Piyush Acharya
https://www.youtube.com/watch?v=YgQgJceojJY (Xiao's video ) On Mon, Jul 20, 2020 at 8:03 AM Xiao Li wrote: > https://spark.apache.org/docs/3.0.0/web-ui.html is the official doc > for Spark UI. > > Xiao > > On Sun, Jul 19, 2020 at 1:38 PM venkatadevarapu > wrote: > >> Hi, >> >> I'm looking

Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread Ashika Umanga
Greetings, Hadoop 2.6 has been removed according to this ticket https://issues.apache.org/jira/browse/SPARK-25016 We run our Spark cluster on K8s in standalone mode. We access HDFS/Hive running on a Hadoop 2.6 cluster. We've been using Spark 2.4.5 and planning on upgrading to Spark 3.0.0 However,

Re: Spark UI

2020-07-19 Thread Xiao Li
https://spark.apache.org/docs/3.0.0/web-ui.html is the official doc for Spark UI. Xiao On Sun, Jul 19, 2020 at 1:38 PM venkatadevarapu wrote: > Hi, > > I'm looking for a tutorial/video/material which explains the content of > various tabes in SPARK WEB UI. > Can some one direct me with the rele

Re: Overwrite Mode not Working Correctly in spark 3.0.0

2020-07-19 Thread anbutech
Hi, When im using option 1,it is completely overwrite the whole table.this is not expected here.im running for multiple tables with different hours. When im using option 2,im getting the following error Predicate references non-partition column 'json_feeds_flatten_data'. Only the partition colum

Spark UI

2020-07-19 Thread venkatadevarapu
Hi, I'm looking for a tutorial/video/material which explains the content of various tabes in SPARK WEB UI. Can some one direct me with the relevant info. Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ -

Re: Schedule/Orchestrate spark structured streaming job

2020-07-19 Thread Piyush Acharya
Some of the options of workflows https://medium.com/@xunnan.xu/workflow-processing-engine-overview-2018-airflow-vs-azkaban-vs-conductor-vs-oozie-vs-amazon-step-90affc54d53b Streaming is a kind of infinitely running job, so, you just have to trigger it only once unless you re not using it with Trig

Re: Overwrite Mode not Working Correctly in spark 3.0.0

2020-07-19 Thread Piyush Acharya
Can you please send the error message? it would ve very helpful to get to the root cause. On Sun, Jul 19, 2020 at 10:57 PM anbutech wrote: > Hi Team, > > I'm facing weird behavior in the pyspark dataframe(databricks delta spark > 3.0.0 supported) > > I have tried the below two options to write t

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Piyush Acharya
Please try with maxBytesPerTrigger option, probably files are big enough to crash the JVM. Please give some info on Executors and file info ( size etc) Regards, ..Piyush On Sun, Jul 19, 2020 at 3:29 PM Rachana Srivastava wrote: > *Issue:* I am trying to process 5000+ files of gzipped json file

Schedule/Orchestrate spark structured streaming job

2020-07-19 Thread anbutech
Hi Team, I'm very new to spark structured streaming.could you please guide me how to Schedule/Orchestrate spark structured streaming job.Any scheduler similar like airflow.I knew airflow doesn't support streaming jobs. Thanks Anbu -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.c

Overwrite Mode not Working Correctly in spark 3.0.0

2020-07-19 Thread anbutech
Hi Team, I'm facing weird behavior in the pyspark dataframe(databricks delta spark 3.0.0 supported) I have tried the below two options to write the processed datafame data into delta table with respect to the partition columns in the table.Actually overwrite mode completely overwrite the whole ta

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Sanjeev Mishra
Can you reduce maxFilesPerTrigger further and see if the OOM still persists, if it does then the problem may be somewhere else. > On Jul 19, 2020, at 5:37 AM, Jungtaek Lim > wrote: > > Please provide logs and dump file for the OOM case - otherwise no one could > say what's the cause. > > Add

Re: OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Jungtaek Lim
Please provide logs and dump file for the OOM case - otherwise no one could say what's the cause. Add JVM options to driver/executor => -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="...dir..." On Sun, Jul 19, 2020 at 6:56 PM Rachana Srivastava wrote: > *Issue:* I am trying to process 5000+

OOM while processing read/write to S3 using Spark Structured Streaming

2020-07-19 Thread Rachana Srivastava
Issue: I am trying to process 5000+ files of gzipped json file periodically from S3 using Structured Streaming code.  Here are the key steps: - Read json schema and broadccast to executors - Read Stream Dataset inputDS = sparkSession.readStream() .format("text") .option("inf