Since the data is in multiple JVMs, only 1 of them can be the driver.   So
I can parallelize the data from 1 of the VMs but don't have a way to do the
same for the others.   Or am I missing something?

On Tue, Feb 28, 2017 at 3:53 PM, ayan guha <guha.a...@gmail.com> wrote:

> How about parallelize and then union all of them to one data frame?
>
> On Wed, 1 Mar 2017 at 3:07 am, Sean Owen <so...@cloudera.com> wrote:
>
>> Broadcasts let you send one copy of read only data to each executor.
>> That's not the same as a DataFrame and itseems nature means it doesnt make
>> sense to think of them as not distributed. But consider things like
>> broadcast hash joins which may be what you are looking for if you really
>> mean to join on a small DF efficiently.
>>
>> On Tue, Feb 28, 2017, 16:03 johndesuv <desu...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have an application that runs on a series of JVMs that each contain a
>> subset of a large dataset in memory.  I'd like to use this data in spark
>> and
>> am looking at ways to use this as a data source in spark without writing
>> the
>> data to disk as a handoff.
>>
>> Parallelize doesn't work for me since I need to use the data across all
>> the
>> JVMs as one DataFrame.
>>
>> The only option I've come up with so far is to write a custom DataSource
>> that then transmits the data from each of the JVMs over the network.  This
>> seems like overkill though.
>>
>> Is there a simpler solution for getting this data into a DataFrame?
>>
>> Thanks,
>> John
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-
>> tp28438.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> --
> Best Regards,
> Ayan Guha
>

Reply via email to