Re: Control Sqoop job from Spark job

2019-08-30 Thread Chris Teoh
I'd say this is an uncommon approach, could you use a workflow/scheduling system to call Sqoop outside of Spark? Spark is usually multiprocess distributed so putting in this Sqoop job in the Spark code seems to imply you want to run Sqoop first, then Spark. If you're really insistent on this, call

Re: [pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

2019-08-30 Thread Chris Teoh
Look at your DAG. Are there lots of CSV files? Does your input CSV dataframe have lots of partitions to start with? Bear in mind cross join makes the dataset much larger so expect to have more tasks. On Fri, 30 Aug 2019 at 14:11, Rishi Shah wrote: > Hi All, > > I am scratching my head against

[VOTE][RESULT] Spark 2.4.4 (RC3)

2019-08-30 Thread Dongjoon Hyun
Hi, All. The vote passes. Thanks to all who helped with this release 2.4.4! It was very intensive vote with +11 (including +8 PMC votes) and no -1. I'll follow up later with a release announcement once everything is published. +1 (* = binding): Dongjoon Hyun Kazuaki Ishizaki Sean Owen* Wenchen

Read ORC file with subset of schema

2019-08-30 Thread Isabelle Phan
Hello, When reading an older ORC file where the schema is a subset of the current schema, reader throws an error. Please see sample code below (ran on spark 2.1). The same commands on a parquet file do not error out, they return the new column with null values. Is there a setting to add to the

EMR Spark 2.4.3 executor hang

2019-08-30 Thread Daniel Zhang
Hi, All: We are testing the EMR and compare with our on-premise HDP solution. We use one application as the test: EMR (5.21.1) with Hadoop 2.8.5 + Spark 2.4.3 vs HDP (2.6.3) with Hadoop 2.7.3 + Spark 2.2.0 The application is very simple, just read Parquet raw file, then do a