Re: Unsubscribe

2020-08-26 Thread Annabel Melongo
 Thanks, Stephen
On Wednesday, August 26, 2020, 07:07:05 PM PDT, Stephen Coy 
 wrote:  
 
 The instructions for all Apache mail lists are in the mail headers:

List-Unsubscribe: <mailto:user-unsubscr...@spark.apache.org>




On 27 Aug 2020, at 7:49 am, Jeff Evans  wrote:
That is not how you unsubscribe.  See here for instructions: 
https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e
On Wed, Aug 26, 2020, 4:22 PM Annabel Melongo 
 wrote:

Please remove me from the mailing list


This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/  

Unsubscribe

2020-08-26 Thread Annabel Melongo
Please remove me from the mailing list

unsubscribe

2019-12-14 Thread Annabel Melongo
unsubscribe


Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Annabel Melongo
Richard,
In the provided documentation, under the paragraph "Schema Merging", you can 
actually perform what you want this way:
1. Create a schema that read the raw json, line by line
2. Create another schema that read the json file and structure it in ("id", 
"ln", "fn")
3. Merge the two schemas and you'll get what you want.
Thanks 

On Thursday, December 29, 2016 7:18 PM, Richard Xin 
<richardxin...@yahoo.com> wrote:
 

 thanks, I have seen this, but this doesn't cover my question.
What I need is read json and include raw json as part of my dataframe. 

On Friday, December 30, 2016 10:23 AM, Annabel Melongo 
<melongo_anna...@yahoo.com.INVALID> wrote:
 

 Richard,
Below documentation will show you how to create a sparkSession and how to 
programmatically load data:
Spark SQL and DataFrames - Spark 2.1.0 Documentation

  
|  
|   |  
Spark SQL and DataFrames - Spark 2.1.0 Documentation
   |  |

  |

 
 

On Thursday, December 29, 2016 5:16 PM, Richard Xin 
<richardxin...@yahoo.com.INVALID> wrote:
 

 Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25}
{"id":1235,"ln":"Doe","fn":"Jane","age":22}
java code snippet:        final SparkConf sparkConf = new 
SparkConf().setMaster("local[2]").setAppName("json_test");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    HiveContext hc = new HiveContext(ctx.sc());
    DataFrame df = hc.read().json("files/json/example2.json");

what I need is a DataFrame with columns id, ln, fn, age as well as raw_json 
string
any advice on the best practice in java?Thanks,
Richard


   

   

   

Re: DataFrame to read json and include raw Json in DataFrame

2016-12-29 Thread Annabel Melongo
Richard,
Below documentation will show you how to create a sparkSession and how to 
programmatically load data:
Spark SQL and DataFrames - Spark 2.1.0 Documentation

  
|  
|   |  
Spark SQL and DataFrames - Spark 2.1.0 Documentation
   |  |

  |

 
 

On Thursday, December 29, 2016 5:16 PM, Richard Xin 
 wrote:
 

 Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25}
{"id":1235,"ln":"Doe","fn":"Jane","age":22}
java code snippet:        final SparkConf sparkConf = new 
SparkConf().setMaster("local[2]").setAppName("json_test");
        JavaSparkContext ctx = new JavaSparkContext(sparkConf);
    HiveContext hc = new HiveContext(ctx.sc());
    DataFrame df = hc.read().json("files/json/example2.json");

what I need is a DataFrame with columns id, ln, fn, age as well as raw_json 
string
any advice on the best practice in java?Thanks,
Richard


   

Re: trouble using eclipse to view spark source code

2016-01-18 Thread Annabel Melongo
Andy,
This has nothing to do with Spark but I guess you don't have the proper Scala 
version. The version you're currently running doesn't recognize a method in 
Scala ArrayOps, namely:          scala.collection.mutable.ArrayOps.$colon$plus 

On Monday, January 18, 2016 7:53 PM, Andy Davidson 
 wrote:
 

 Many thanks. I was using a different scala plug in. this one seems to work 
better I no longer get compile error how ever I get the following stack trace 
when I try to run my unit tests with mllib open
I am still using eclipse luna.
Andy
java.lang.NoSuchMethodError: 
scala.collection.mutable.ArrayOps.$colon$plus(Ljava/lang/Object;Lscala/reflect/ClassTag;)Ljava/lang/Object;
 at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:73) at 
org.apache.spark.ml.feature.HashingTF.transformSchema(HashingTF.scala:76) at 
org.apache.spark.ml.feature.HashingTF.transform(HashingTF.scala:64) at 
com.pws.fantasySport.ml.TDIDFTest.runPipleLineTF_IDF(TDIDFTest.java:52) at 
com.pws.fantasySport.ml.TDIDFTest.test(TDIDFTest.java:36) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497) at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
 at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) 
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
 at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
 at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
 at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)

From:  Jakob Odersky 
Date:  Monday, January 18, 2016 at 3:20 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: trouble using eclipse to view spark source code


Have you followed the guide on how to import spark into eclipse 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
 ?

On 18 January 2016 at 13:04, Andy Davidson  
wrote:

Hi 
My project is implemented using Java 8 and Python. Some times its handy to look 
at the spark source code. For unknown reason if I open a spark project my java 
projects show tons of compiler errors. I think it may have something to do with 
Scala. If I close the projects my java code is fine.
I typically I only want to import the machine learning and streaming projects.
I am not sure if this is an issue or not but my java projects are built using 
gradel
In eclipse preferences -> scala -> installations I selected Scala: 2.10.6 
(built in)
Any suggestions would be greatly appreciate
Andy






-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

  
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: pre-install 3-party Python package on spark cluster

2016-01-11 Thread Annabel Melongo
When you run spark submit in either client or cluster mode, you can either use 
the options --packages or -jars to automatically copy your packages to the 
worker machines.
Thanks 

On Monday, January 11, 2016 12:52 PM, Andy Davidson 
 wrote:
 

 I use https://code.google.com/p/parallel-ssh/ to upgrade all my slaves


From:  "taotao.li" 
Date:  Sunday, January 10, 2016 at 9:50 PM
To:  "user @spark" 
Subject:  pre-install 3-party Python package on spark cluster


I have a spark cluster, from machine-1 to machine 100, and machine-1 acts asthe 
master.
Then one day my program need use a 3-party python package which is notinstalled 
on every machine of the cluster.
so here comes my problem: to make that 3-party python package usable onmaster 
and slaves, should I manually ssh to every machine and use pip toinstall that 
package?
I believe there should be some deploy scripts or other things to make 
thisgrace, but I can't find anything after googling.


--View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pre-install-3-party-Python-package-on-spark-cluster-tp25930.htmlSent
 from the Apache Spark User List mailing list archive at Nabble.com.
-To 
unsubscribe, e-mail: user-unsubscribe@spark.apache.orgFor additional commands, 
e-mail: user-h...@spark.apache.org




  

Re: Spark job uses only one Worker

2016-01-07 Thread Annabel Melongo
Michael,
I don't know what's your environment but if it's Cloudera, you should be able 
to see the link to your master in the Hue.
Thanks 

On Thursday, January 7, 2016 5:03 PM, Michael Pisula 
 wrote:
 

  I had tried several parameters, including --total-executor-cores, no effect.
 As for the port, I tried 7077, but if I remember correctly I got some kind of 
error that suggested to try 6066, with which it worked just fine (apart from 
this issue here).
 
 Each worker has two cores. I also tried increasing cores, again no effect. I 
was able to increase the number of cores the job was using on one worker, but 
it would not use any other worker (and it would not start if the number of 
cores the job wanted was higher than the number available on one worker).
 
 On 07.01.2016 22:51, Igor Berman wrote:
  
 read about --total-executor-cores not sure why you specify port 6066 in 
master...usually it's 7077
 verify in master ui(usually port 8080) how many cores are there(depends on 
other configs, but usually workers connect to master with all their cores)   
 On 7 January 2016 at 23:46, Michael Pisula  wrote:
 
  Hi,
 
 I start the cluster using the spark-ec2 scripts, so the cluster is in 
stand-alone mode.
 Here is how I submit my job:
 spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master 
spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar
 
 Cheers,
 Michael  
 
 On 07.01.2016 22:41, Igor Berman wrote:
  
 share how you submit your job what cluster(yarn, standalone)  
 On 7 January 2016 at 23:24, Michael Pisula  wrote:
 
Hi there,
 
 I ran a simple Batch Application on a Spark Cluster on EC2. Despite having 3
 Worker Nodes, I could not get the application processed on more than one
 node, regardless if I submitted the Application in Cluster or Client mode.
 I also tried manually increasing the number of partitions in the code, no
 effect. I also pass the master into the application.
 I verified on the nodes themselves that only one node was active while the
 job was running.
 I pass enough data to make the job take 6 minutes to process.
 The job is simple enough, reading data from two S3 files, joining records on
 a shared field, filtering out some records and writing the result back to
 S3.
 
 Tried all kinds of stuff, but could not make it work. I did find similar
 questions, but had already tried the solutions that worked in those cases.
 Would be really happy about any pointers.
 
 Cheers,
 Michael
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
-
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
  
  
 
-- 
Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082
  
  
  
 
 -- 
Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke
Sitz: Unterföhring * Amtsgericht München * HRB 135082
 

  

Re: Date Time Regression as Feature

2016-01-07 Thread Annabel Melongo
Or he can also transform the whole date into a string 

On Thursday, January 7, 2016 2:25 PM, Sujit Pal  
wrote:
 

 Hi Jorge,
Maybe extract things like dd, mm, day of week, time of day from the datetime 
string and use them as features?
-sujit

On Thu, Jan 7, 2016 at 11:09 AM, Jorge Machado  
wrote:

Hello all,

I'm new to machine learning. I'm trying to predict some electric usage  with a 
decision  Free
The data is :
2015-12-10-10:00, 1200
2015-12-11-10:00, 1150

My question is : What is the best way to turn date and time into feature on my 
Vector ?

Something like this :  Vector (1200, [2015,12,10,10,10] )?
I could not fine any example with value prediction where features had dates in 
it.

Thanks

Jorge Machado

Jorge Machado
jo...@jmachado.me


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





  

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-05 Thread Annabel Melongo
Vijay,
Are you closing the fileinputstream at the end of each loop ( in.close())? My 
guess is those streams aren't close and thus the "too many open files" 
exception. 

On Tuesday, January 5, 2016 8:03 AM, Priya Ch 
 wrote:
 

 Can some one throw light on this ?
Regards,Padma Ch
On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch  wrote:

Chris, we are using spark 1.3.0 version. we have not set  
spark.streaming.concurrentJobs this parameter. It takes the default value.
Vijay,
  From the tack trace it is evident that 
org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
 is throwing the exception. I opened the spark source code and visited the line 
which is throwing this exception i.e  


The lie which is marked in red is throwing the exception. The file is 
ExternalSorter.scala in org.apache.spark.util.collection package.
i went through the following blog 
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
 and understood that there is merge factor which decide the number of on-disk 
files that could be merged. Is it some way related to this ?
Regards,Padma CH
On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:

and which version of Spark/Spark Streaming are you using?
are you explicitly setting the spark.streaming.concurrentJobs to something 
larger than the default of 1?  
if so, please try setting that back to 1 and see if the problem still exists.  
this is a dangerous parameter to modify from the default - which is why it's 
not well-documented.

On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge  wrote:

Few indicators -
1) during execution time - check total number of open files using lsof command. 
Need root permissions. If it is cluster not sure much !2) which exact line in 
the code is triggering this error ? Can you paste that snippet ?

On Wednesday 23 December 2015, Priya Ch  wrote:

ulimit -n 65000
fs.file-max = 65000 ( in etc/sysctl.conf file)
Thanks,Padma Ch
On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:

Could you share the ulimit for your setup please ? - Thanks, via mobile,  
excuse brevity. On Dec 22, 2015 6:39 PM, "Priya Ch" 
 wrote:

Jakob,     Increased the settings like fs.file-max in /etc/sysctl.conf and also 
increased user limit in /etc/security/limits.conf. But still see the same issue.
On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky  wrote:

It might be a good idea to see how many files are open and try increasing the 
open file limit (this is done on an os level). In some application use-cases it 
is actually a legitimate need.

If that doesn't help, make sure you close any unused files and streams in your 
code. It will also be easier to help diagnose the issue if you send an 
error-reproducing snippet.








-- 
Regards,Vijay Gharge







-- 

Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San 
Francisco, CAhttp://spark.tc | http://advancedspark.com





  
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is Spark 1.6 released?

2016-01-04 Thread Annabel Melongo
[1] http://spark.apache.org/releases/spark-release-1-6-0.html[2] 
http://spark.apache.org/downloads.html
 

On Monday, January 4, 2016 2:59 PM, "saif.a.ell...@wellsfargo.com" 
 wrote:
 

 Where can I read more about the dataset api on a user layer? I am failing to 
get an API doc or understand when to use DataFrame or DataSet, advantages, etc.

Thanks,
Saif

-Original Message-
From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] 
Sent: Monday, January 04, 2016 2:01 PM
To: user@spark.apache.org
Subject: Re: Is Spark 1.6 released?

It's now OK: Michael published and announced the release.

Sorry for the delay.

Regards
JB

On 01/04/2016 10:06 AM, Jung wrote:
> Hi
> There were Spark 1.6 jars in maven central and github.
> I found it 5 days ago. But it doesn't appear on Spark website now.
> May I regard Spark 1.6 zip file in github as a stable release?
>
> Thanks
> Jung
>

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



  

Re: Stuck with DataFrame df.select("select * from table");

2015-12-29 Thread Annabel Melongo
Eugene,
The example I gave you was in Python. I used it on my end and it works fine. 
Sorry, I don't know Scala.
Thanks 

On Tuesday, December 29, 2015 5:24 AM, Eugene Morozov 
<evgeny.a.moro...@gmail.com> wrote:
 

 Annabel, 
That might work in Scala, but I use Java. Three quotes just don't compile =)If 
your example is in Scala, then, I believe, semicolon is not required.
--
Be well!
Jean Morozov
On Mon, Dec 28, 2015 at 8:49 PM, Annabel Melongo <melongo_anna...@yahoo.com> 
wrote:

Jean,
Try this:df.select("""select * from tmptable where x1 = '3.0'""").show();
Note: you have to use 3 double quotes as marked  

On Friday, December 25, 2015 11:30 AM, Eugene Morozov 
<evgeny.a.moro...@gmail.com> wrote:
 

 Thanks for the comments, although the issue is not in limit() predicate. It's 
something with spark being unable to resolve the expression.

I can do smth like this. It works as it suppose to:  
df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5);
But I think old fashioned sql style have to work also. I have 
df.registeredTempTable("tmptable") and then df.select("select * from tmptable 
where x1 = '3.0'").show();org.apache.spark.sql.AnalysisException: cannot 
resolve 'select * from tmp where x1 = '1.0'' given input columns x1, x4, x5, 
x3, x2;
 at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.sca

>From the first statement I conclude that my custom datasource is perfectly 
>fine.Just wonder how to fix / workaround that. --
Be well!
Jean Morozov
On Fri, Dec 25, 2015 at 6:13 PM, Igor Berman <igor.ber...@gmail.com> wrote:

sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 
supported)

or use Dmitriy's solution. select() defines your projection when you've 
specified entire query
On 25 December 2015 at 15:42, Василец Дмитрий <pronix.serv...@gmail.com> wrote:

hello
you can try to use df.limit(5).show()
just trick :)

On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov <evgeny.a.moro...@gmail.com> 
wrote:

Hello, I'm basically stuck as I have no idea where to look;
Following simple code, given that my Datasource is working gives me an 
exception.DataFrame df = sqlc.load(filename, 
"com.epam.parso.spark.ds.DefaultSource");
df.cache();
df.printSchema();   <-- prints the schema perfectly fine!

df.show();  <-- Works perfectly fine (shows table with 20 
lines)!
df.registerTempTable("table");
df.select("select * from table limit 5").show(); <-- gives weird 
exceptionException is:AnalysisException: cannot resolve 'select * from table 
limit 5' given input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS
I can do a collect on a dataframe, but cannot select any specific columns 
either "select * from table" or "select VER, CREATED from table".
I use spark 1.5.2.The same code perfectly works through Zeppelin 0.5.5.
Thanks.
--
Be well!
Jean Morozov







   



  

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Thanks Andrew for this awesome explanation  

On Tuesday, December 29, 2015 5:30 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 Let me clarify a few things for everyone:
There are three cluster managers: standalone, YARN, and Mesos. Each cluster 
manager can run in two deploy modes, client or cluster. In client mode, the 
driver runs on the machine that submitted the application (the client). In 
cluster mode, the driver runs on one of the worker machines in the cluster.
When I say "standalone cluster mode" I am referring to the standalone cluster 
manager running in cluster deploy mode.
Here's how the resources are distributed in each mode (omitting Mesos):

Standalone / YARN client mode. The driver runs on the client machine (i.e. 
machine that ran Spark submit) so it should already have access to the jars. 
The executors then pull the jars from an HTTP server started in the driver.
Standalone cluster mode. Spark submit does not upload your jars to the cluster, 
so all the resources you need must already be on all of the worker machines. 
The executors, however, actually just pull the jars from the driver as in 
client mode instead of finding it in their own local file systems.
YARN cluster mode. Spark submit does upload your jars to the cluster. In 
particular, it puts the jars in HDFS so your driver can just read from there. 
As in other deployments, the executors pull the jars from the driver.

When the docs say "If your application is launched through Spark submit, then 
the application jar is automatically distributed to all worker nodes," it is 
actually saying that your executors get their jars from the driver. This is 
true whether you're running in client mode or cluster mode.
If the docs are unclear (and they seem to be), then we should update them. I 
have filed SPARK-12565 to track this.
Please let me know if there's anything else I can help clarify.
Cheers,-Andrew



2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:

Andrew,
Now I see where the confusion lays. Standalone cluster mode, your link, is 
nothing but a combination of client-mode and standalone mode, my link, without 
YARN.
But I'm confused by this paragraph in your link:
        If your application is launched through Spark submit, then the 
application jar is automatically distributed to all worker nodes. For any 
additional jars that your          application depends on, you should specify 
them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).
That can't be true; this is only the case when Spark runs on top of YARN. 
Please correct me, if I'm wrong.
Thanks   

On Tuesday, December 29, 2015 2:54 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:

Greg,
Can you please send me a doc describing the standalone cluster mode? Honestly, 
I never heard about it.
The three different modes, I've listed appear in the last paragraph of this 
doc: Running Spark Applications
|   |
|   |   |   |   |   |
| Running Spark Applications--class The FQCN of the class containing the main 
method of the application. For example, org.apache.spark.examples.SparkPi. 
--conf  |
|  |
| View on www.cloudera.com | Preview by Yahoo |
|  |
|   |


 

On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.

@Annabel That's not true. There is a standalone cluster mode where driver runs 
on one of the workers instead of on the client machine. What you're describing 
is standalone client mode.
2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:

Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add t

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or  
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I have 
filed https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill :



On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





  

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Greg,
Can you please send me a doc describing the standalone cluster mode? Honestly, 
I never heard about it.
The three different modes, I've listed appear in the last paragraph of this 
doc: Running Spark Applications
|   |
|   |   |   |   |   |
| Running Spark Applications--class The FQCN of the class containing the main 
method of the application. For example, org.apache.spark.examples.SparkPi. 
--conf  |
|  |
| View on www.cloudera.com | Preview by Yahoo |
|  |
|   |


 

On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.

@Annabel That's not true. There is a standalone cluster mode where driver runs 
on one of the workers instead of on the client machine. What you're describing 
is standalone client mode.
2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:

Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I have 
filed https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>:



On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





   



  

Re: Can't submit job to stand alone cluster

2015-12-29 Thread Annabel Melongo
Andrew,
Now I see where the confusion lays. Standalone cluster mode, your link, is 
nothing but a combination of client-mode and standalone mode, my link, without 
YARN.
But I'm confused by this paragraph in your link:
        If your application is launched through Spark submit, then the 
application jar is automatically distributed to all worker nodes. For any 
additional jars that your          application depends on, you should specify 
them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2).
That can't be true; this is only the case when Spark runs on top of YARN. 
Please correct me, if I'm wrong.
Thanks   

On Tuesday, December 29, 2015 2:54 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

2015-12-29 11:48 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:

Greg,
Can you please send me a doc describing the standalone cluster mode? Honestly, 
I never heard about it.
The three different modes, I've listed appear in the last paragraph of this 
doc: Running Spark Applications
|   |
|   |   |   |   |   |
| Running Spark Applications--class The FQCN of the class containing the main 
method of the application. For example, org.apache.spark.examples.SparkPi. 
--conf  |
|  |
| View on www.cloudera.com | Preview by Yahoo |
|  |
|   |


 

On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.

@Annabel That's not true. There is a standalone cluster mode where driver runs 
on one of the workers instead of on the client machine. What you're describing 
is standalone client mode.
2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>:

Greg,
The confusion here is the expression "standalone cluster mode". Either it's 
stand-alone or it's cluster mode but it can't be both.
 With this in mind, here's how jars are uploaded:    1. Spark Stand-alone mode: 
client and driver run on the same machine; use --packages option to submit a 
jar    2. Yarn Cluster-mode: client and driver run on separate machines; 
additionally driver runs as a thread in ApplicationMaster; use --jars option 
with a globally visible path to said jar    3. Yarn Client-mode: client and 
driver run on the same machine. driver is NOT a thread in ApplicationMaster; 
use --packages to submit a jar 

On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> 
wrote:
 

 Hi Greg,
It's actually intentional for standalone cluster mode to not upload jars. One 
of the reasons why YARN takes at least 10 seconds before running any simple 
application is because there's a lot of random overhead (e.g. putting jars in 
HDFS). If this missing functionality is not documented somewhere then we should 
add that.

Also, the packages problem seems legitimate. Thanks for reporting it. I have 
filed https://issues.apache.org/jira/browse/SPARK-12559.
-Andrew
2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>:



On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





   



   



  

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Annabel Melongo
Additionally, if you already have some legal sql statements to process said 
data, instead of reinventing the wheel using rdd's functions, you can speed up 
implementation by using dataframes along with these existing sql statements. 

On Monday, December 28, 2015 5:37 PM, Darren Govoni  
wrote:
 

  I'll throw a thought in here.
Dataframes are nice if your data is uniform and clean with consistent schema.
However in many big data problems this is seldom the case. 


Sent from my Verizon Wireless 4G LTE smartphone

 Original message 
From: Chris Fregly  
Date: 12/28/2015 5:22 PM (GMT-05:00) 
To: Richard Eggert  
Cc: Daniel Siegmann , Divya Gehlot 
, "user @spark"  
Subject: Re: DataFrame Vs RDDs ... Which one to use When ? 

here's a good article that sums it up, in my opinion: 
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
basically, building apps with RDDs is like building with apps with primitive 
JVM bytecode.  haha.
@richard:  remember that even if you're currently writing RDDs in Java/Scala, 
you're not gaining the code gen/rewrite performance benefits of the Catalyst 
optimizer.
i agree with @daniel who suggested that you start with DataFrames and revert to 
RDDs only when DataFrames don't give you what you need.
the only time i use RDDs directly these days is when i'm dealing with a Spark 
library that has not yet moved to DataFrames - ie. GraphX - and it's kind of 
annoying switching back and forth.
almost everything you need should be in the DataFrame API.
Datasets are similar to RDDs, but give you strong compile-time typing, tabular 
structure, and Catalyst optimizations.
hopefully Datasets is the last API we see from Spark SQL...  i'm getting tired 
of re-writing slides and book chapters!  :)
On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert  
wrote:

One advantage of RDD's over DataFrames is that RDD's allow you to use your own 
data types, whereas DataFrames are backed by RDD's of Record objects, which are 
pretty flexible but don't give you much in the way of compile-time type 
checking. If you have an RDD of case class elements or JSON, then Spark SQL can 
automatically figure out how to convert it into an RDD of Record objects (and 
therefore a DataFrame), but there's no way to automatically go the other way 
(from DataFrame/Record back to custom types).
In general, you can ultimately do more with RDDs than DataFrames, but 
DataFrames give you a lot of niceties (automatic query optimization, table 
joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime 
overhead associated with writing RDD code in a non-JVM language (such as Python 
or R), since the query optimizer is effectively creating the required JVM code 
under the hood. There's little to no performance benefit if you're already 
writing Java or Scala code, however (and RDD-based code may actually perform 
better in some cases, if you're willing to carefully tune your code).
On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann  
wrote:

DataFrames are a higher level API for working with tabular data - RDDs are used 
underneath. You can use either and easily convert between them in your code as 
necessary.

DataFrames provide a nice abstraction for many cases, so it may be easier to 
code against them. Though if you're used to thinking in terms of collections 
rather than tables, you may find RDDs more natural. Data frames can also be 
faster, since Spark will do some optimizations under the hood - if you are 
using PySpark, this will avoid the overhead. Data frames may also perform 
better if you're reading structured data, such as a Hive table or Parquet files.

I recommend you prefer data frames, switching over to RDDs as necessary (when 
you need to perform an operation not supported by data frames / Spark SQL).

HOWEVER (and this is a big one), Spark 1.6 will have yet another API - 
datasets. The release of Spark 1.6 is currently being finalized and I would 
expect it in the next few days. You will probably want to use the new API once 
it's available.


On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot  wrote:

Hi,
I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
Can somebody explain me with the use cases which one to use when ?

Would really appreciate the clarification .

Thanks,
Divya 






-- 
Rich



-- 

Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San 
Francisco, CAhttp://spark.tc | http://advancedspark.com

  

Re: Stuck with DataFrame df.select("select * from table");

2015-12-28 Thread Annabel Melongo
Jean,
Try this:df.select("""select * from tmptable where x1 = '3.0'""").show();
Note: you have to use 3 double quotes as marked  

On Friday, December 25, 2015 11:30 AM, Eugene Morozov 
 wrote:
 

 Thanks for the comments, although the issue is not in limit() predicate. It's 
something with spark being unable to resolve the expression.

I can do smth like this. It works as it suppose to:  
df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5);
But I think old fashioned sql style have to work also. I have 
df.registeredTempTable("tmptable") and then df.select("select * from tmptable 
where x1 = '3.0'").show();org.apache.spark.sql.AnalysisException: cannot 
resolve 'select * from tmp where x1 = '1.0'' given input columns x1, x4, x5, 
x3, x2;
 at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56)
 at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.sca

>From the first statement I conclude that my custom datasource is perfectly 
>fine.Just wonder how to fix / workaround that. --
Be well!
Jean Morozov
On Fri, Dec 25, 2015 at 6:13 PM, Igor Berman  wrote:

sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 
supported)

or use Dmitriy's solution. select() defines your projection when you've 
specified entire query
On 25 December 2015 at 15:42, Василец Дмитрий  wrote:

hello
you can try to use df.limit(5).show()
just trick :)

On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov  
wrote:

Hello, I'm basically stuck as I have no idea where to look;
Following simple code, given that my Datasource is working gives me an 
exception.DataFrame df = sqlc.load(filename, 
"com.epam.parso.spark.ds.DefaultSource");
df.cache();
df.printSchema();   <-- prints the schema perfectly fine!

df.show();  <-- Works perfectly fine (shows table with 20 
lines)!
df.registerTempTable("table");
df.select("select * from table limit 5").show(); <-- gives weird 
exceptionException is:AnalysisException: cannot resolve 'select * from table 
limit 5' given input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS
I can do a collect on a dataframe, but cannot select any specific columns 
either "select * from table" or "select VER, CREATED from table".
I use spark 1.5.2.The same code perfectly works through Zeppelin 0.5.5.
Thanks.
--
Be well!
Jean Morozov







  

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Robin,
Maybe you didn't read my post in which I stated that Spark works on top of 
HDFS. What Jia wants is to have Spark interacts with a C++ process to read and 
write data.
I've never heard about Jia's use case in Spark. If you know one, please share 
that with me.
Thanks 


On Monday, December 7, 2015 1:57 PM, Robin East <robin.e...@xense.co.uk> 
wrote:
 

 Annabel
Spark works very well with data stored in HDFS but is certainly not tied to it. 
Have a look at the wide variety of connectors to things like Cassandra, HBase, 
etc.
Robin

Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_anna...@yahoo.com> wrote:


Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
What you're requesting, reading and writing to a C++ process, is not part of 
that requirement.

 


On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org









   



   


  

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Robin,
To prove my point, this is an unresolved issue still in the implementation 
stage. 


On Monday, December 7, 2015 2:49 PM, Robin East <robin.e...@xense.co.uk> 
wrote:
 

 Hi Annabel
I certainly did read your post. My point was that Spark can read from HDFS but 
is in no way tied to that storage layer . A very interesting use case that 
sounds very similar to Jia's (as mentioned by another poster) is contained in 
https://issues.apache.org/jira/browse/SPARK-10399. The comments section 
provides a specific example of processing very large images using a 
pre-existing c++ library.
Robin
Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_anna...@yahoo.com.INVALID> 
wrote:


Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
What you're requesting, reading and writing to a C++ process, is not part of 
that requirement.

 


On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org









   



   


  

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia  wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful  wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East  wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia  wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org









  

Re: Shared memory between C++ process and Spark

2015-12-07 Thread Annabel Melongo
Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. 
What you're requesting, reading and writing to a C++ process, is not part of 
that requirement.

 


On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write 
and run Spark UDF in C++, I'm just wondering whether Spark can read and write 
data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm 
afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the 
data created by said application to do manipulation within Spark. 


On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can 
connect to multiple storages.However, because our data is also hold in memory, 
I suspect that connecting to Spark directly may be more efficient in 
performance.But definitely I need to look at Tachyon more carefully, in case it 
has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ 
bindings, not sure how much of the current functionality they support...Hi, 
Robin, Thanks for your reply and thanks for copying my question to user mailing 
list.Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy analytics 
on those data. But we need high performance, that’s why we want shared 
memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll 
get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure 
it would be possible with enough tinkering but it’s not clear what you are 
trying to achieve. Spark is a distributed processing system, it has multiple 
JVMs running on different machines that each run a small part of the overall 
processing. Unless you have some sort of idea to have multiple C++ processes 
collocated with the distributed JVMs using named memory mapped files doesn’t 
make architectural sense. 
---Robin
 EastSpark GraphX in Action Michael Malak and Robin EastManning Publications 
Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data 
from a C++ process. 
To provide high performance, I really hope to implement this through shared 
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org