How about parallelize and then union all of them to one data frame?

On Wed, 1 Mar 2017 at 3:07 am, Sean Owen <so...@cloudera.com> wrote:

> Broadcasts let you send one copy of read only data to each executor.
> That's not the same as a DataFrame and itseems nature means it doesnt make
> sense to think of them as not distributed. But consider things like
> broadcast hash joins which may be what you are looking for if you really
> mean to join on a small DF efficiently.
>
> On Tue, Feb 28, 2017, 16:03 johndesuv <desu...@gmail.com> wrote:
>
> Hi,
>
> I have an application that runs on a series of JVMs that each contain a
> subset of a large dataset in memory.  I'd like to use this data in spark
> and
> am looking at ways to use this as a data source in spark without writing
> the
> data to disk as a handoff.
>
> Parallelize doesn't work for me since I need to use the data across all the
> JVMs as one DataFrame.
>
> The only option I've come up with so far is to write a custom DataSource
> that then transmits the data from each of the JVMs over the network.  This
> seems like overkill though.
>
> Is there a simpler solution for getting this data into a DataFrame?
>
> Thanks,
> John
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-tp28438.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Best Regards,
Ayan Guha

Reply via email to