I'd say this is an uncommon approach, could you use a workflow/scheduling
system to call Sqoop outside of Spark? Spark is usually multiprocess
distributed so putting in this Sqoop job in the Spark code seems to imply
you want to run Sqoop first, then Spark. If you're really insistent on
this, call
Look at your DAG. Are there lots of CSV files? Does your input CSV
dataframe have lots of partitions to start with? Bear in mind cross join
makes the dataset much larger so expect to have more tasks.
On Fri, 30 Aug 2019 at 14:11, Rishi Shah wrote:
> Hi All,
>
> I am scratching my head against
Hi, All.
The vote passes. Thanks to all who helped with this release 2.4.4!
It was very intensive vote with +11 (including +8 PMC votes) and no -1.
I'll follow up later with a release announcement once everything is
published.
+1 (* = binding):
Dongjoon Hyun
Kazuaki Ishizaki
Sean Owen*
Wenchen
Hello,
When reading an older ORC file where the schema is a subset of the current
schema, reader throws an error. Please see sample code below (ran on spark
2.1).
The same commands on a parquet file do not error out, they return the new
column with null values.
Is there a setting to add to the
Hi, All:
We are testing the EMR and compare with our on-premise HDP solution. We use one
application as the test:
EMR (5.21.1) with Hadoop 2.8.5 + Spark 2.4.3 vs HDP (2.6.3) with Hadoop 2.7.3 +
Spark 2.2.0
The application is very simple, just read Parquet raw file, then do a