Re: Spark DataFrame Creation

2020-07-22 Thread Andrew Melo
Hi Mark, On Wed, Jul 22, 2020 at 4:49 PM Mark Bidewell wrote: > > Sorry if this is the wrong place for this. I am trying to debug an issue > with this library: > https://github.com/springml/spark-sftp > > When I attempt to create a dataframe: > > spark.read. >

Re: Spark DataFrame Creation

2020-07-22 Thread Sean Owen
You'd probably do best to ask that project, but scanning the source code, that looks like it's how it's meant to work. It downloads to a temp file on the driver then copies to distributed storage then returns a DataFrame for that. I can't see how it would be implemented directly over sftp as there

Spark DataFrame Creation

2020-07-22 Thread Mark Bidewell
Sorry if this is the wrong place for this. I am trying to debug an issue with this library: https://github.com/springml/spark-sftp When I attempt to create a dataframe: spark.read. format("com.springml.spark.sftp"). option("host", "..."). option("username",

Spark dataframe creation through already distributed in-memory data sets

2020-06-16 Thread Tanveer Ahmad - EWI
Hi all, I am new to the Spark community. Please ignore if this question doesn't make sense. My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec). Explanation: I have a huge Arrow RecordBatches collection which is equally