Re: Shared memory between C++ process and Spark

Robin East Mon, 07 Dec 2015 12:10:19 -0800

I’m not sure what point you’re trying to prove and I’m not particularly 
interested in getting into a protracted discussion. Here is what you wrote: The 
architecture of Spark is to run on top of HDFS. I interpreted that as a 
statement implying that Spark has to run on HDFS which is definitely not the 
case. If you didn’t mean then we are both in agreement.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action 
<http://www.manning.com/books/spark-graphx-in-action>






> On 7 Dec 2015, at 19:56, Annabel Melongo <[email protected]> wrote:
> 
> Robin,
> 
> To prove my point, this is an unresolved issue still in the implementation 
> stage.
> 
> 
> 
> On Monday, December 7, 2015 2:49 PM, Robin East <[email protected]> 
> wrote:
> 
> 
> Hi Annabel
> 
> I certainly did read your post. My point was that Spark can read from HDFS 
> but is in no way tied to that storage layer . A very interesting use case 
> that sounds very similar to Jia's (as mentioned by another poster) is 
> contained in https://issues.apache.org/jira/browse/SPARK-10399 
> <https://issues.apache.org/jira/browse/SPARK-10399>. The comments section 
> provides a specific example of processing very large images using a 
> pre-existing c++ library.
> 
> Robin
> 
> Sent from my iPhone
> 
> On 7 Dec 2015, at 18:50, Annabel Melongo <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Jia,
>> 
>> I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
>> What you're requesting, reading and writing to a C++ process, is not part of 
>> that requirement.
>> 
>> 
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:42 PM, Jia <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 
>> Thanks, Annabel, but I may need to clarify that I have no intention to write 
>> and run Spark UDF in C++, I'm just wondering whether Spark can read and 
>> write data to a C++ process with zero copy.
>> 
>> Best Regards,
>> Jia
>>  
>> 
>> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, 
>>> I'm afraid this is not possible. Spark has support for Java, Python, Scala 
>>> and R.
>>> 
>>> The best way to achieve this is to run your application in C++ and used the 
>>> data created by said application to do manipulation within Spark.
>>> 
>>> 
>>> 
>>> On Monday, December 7, 2015 1:15 PM, Jia <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> Thanks, Dewful!
>>> 
>>> My impression is that Tachyon is a very nice in-memory file system that can 
>>> connect to multiple storages.
>>> However, because our data is also hold in memory, I suspect that connecting 
>>> to Spark directly may be more efficient in performance.
>>> But definitely I need to look at Tachyon more carefully, in case it has a 
>>> very efficient C++ binding mechanism.
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>>> Maybe looking into something like Tachyon would help, I see some sample 
>>>> c++ bindings, not sure how much of the current functionality they 
>>>> support...
>>>> Hi, Robin, 
>>>> Thanks for your reply and thanks for copying my question to user mailing 
>>>> list.
>>>> Yes, we have a distributed C++ application, that will store data on each 
>>>> node in the cluster, and we hope to leverage Spark to do more fancy 
>>>> analytics on those data. But we need high performance, that’s why we want 
>>>> shared memory.
>>>> Suggestions will be highly appreciated!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>>> -dev, +user (this is not a question about development of Spark itself so 
>>>>> you’ll get more answers in the user mailing list)
>>>>> 
>>>>> First up let me say that I don’t really know how this could be done - I’m 
>>>>> sure it would be possible with enough tinkering but it’s not clear what 
>>>>> you are trying to achieve. Spark is a distributed processing system, it 
>>>>> has multiple JVMs running on different machines that each run a small 
>>>>> part of the overall processing. Unless you have some sort of idea to have 
>>>>> multiple C++ processes collocated with the distributed JVMs using named 
>>>>> memory mapped files doesn’t make architectural sense. 
>>>>> -------------------------------------------------------------------------------
>>>>> Robin East
>>>>> Spark GraphX in Action Michael Malak and Robin East
>>>>> Manning Publications Co.
>>>>> http://www.manning.com/books/spark-graphx-in-action 
>>>>> <http://www.manning.com/books/spark-graphx-in-action>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 6 Dec 2015, at 20:43, Jia <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Dears, for one project, I need to implement something so Spark can read 
>>>>>> data from a C++ process. 
>>>>>> To provide high performance, I really hope to implement this through 
>>>>>> shared memory between the C++ process and Java JVM process.
>>>>>> It seems it may be possible to use named memory mapped files and JNI to 
>>>>>> do this, but I wonder whether there is any existing efforts or more 
>>>>>> efficient approach to do this?
>>>>>> Thank you very much!
>>>>>> 
>>>>>> Best Regards,
>>>>>> Jia
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> For additional commands, e-mail: [email protected] 
>>>>>> <mailto:[email protected]>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: Shared memory between C++ process and Spark

Reply via email to