Re: Unsubscribe
Thanks, Stephen On Wednesday, August 26, 2020, 07:07:05 PM PDT, Stephen Coy wrote: The instructions for all Apache mail lists are in the mail headers: List-Unsubscribe: <mailto:user-unsubscr...@spark.apache.org> On 27 Aug 2020, at 7:49 am, Jeff Evans wrote: That is not how you unsubscribe. See here for instructions: https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e On Wed, Aug 26, 2020, 4:22 PM Annabel Melongo wrote: Please remove me from the mailing list This email contains confidential information of and is the copyright of Infomedia. It must not be forwarded, amended or disclosed without consent of the sender. If you received this message by mistake, please advise the sender and delete all copies. Security of transmission on the internet cannot be guaranteed, could be infected, intercepted, or corrupted and you should ensure you have suitable antivirus protection in place. By sending us your or any third party personal details, you consent to (or confirm you have obtained consent from such third parties) to Infomedia’s privacy policy. http://www.infomedia.com.au/privacy-policy/
Unsubscribe
Please remove me from the mailing list
unsubscribe
unsubscribe
Re: DataFrame to read json and include raw Json in DataFrame
Richard, In the provided documentation, under the paragraph "Schema Merging", you can actually perform what you want this way: 1. Create a schema that read the raw json, line by line 2. Create another schema that read the json file and structure it in ("id", "ln", "fn") 3. Merge the two schemas and you'll get what you want. Thanks On Thursday, December 29, 2016 7:18 PM, Richard Xin <richardxin...@yahoo.com> wrote: thanks, I have seen this, but this doesn't cover my question. What I need is read json and include raw json as part of my dataframe. On Friday, December 30, 2016 10:23 AM, Annabel Melongo <melongo_anna...@yahoo.com.INVALID> wrote: Richard, Below documentation will show you how to create a sparkSession and how to programmatically load data: Spark SQL and DataFrames - Spark 2.1.0 Documentation | | | Spark SQL and DataFrames - Spark 2.1.0 Documentation | | | On Thursday, December 29, 2016 5:16 PM, Richard Xin <richardxin...@yahoo.com.INVALID> wrote: Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25} {"id":1235,"ln":"Doe","fn":"Jane","age":22} java code snippet: final SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("json_test"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); HiveContext hc = new HiveContext(ctx.sc()); DataFrame df = hc.read().json("files/json/example2.json"); what I need is a DataFrame with columns id, ln, fn, age as well as raw_json string any advice on the best practice in java?Thanks, Richard
Re: DataFrame to read json and include raw Json in DataFrame
Richard, Below documentation will show you how to create a sparkSession and how to programmatically load data: Spark SQL and DataFrames - Spark 2.1.0 Documentation | | | Spark SQL and DataFrames - Spark 2.1.0 Documentation | | | On Thursday, December 29, 2016 5:16 PM, Richard Xinwrote: Say I have following data in file:{"id":1234,"ln":"Doe","fn":"John","age":25} {"id":1235,"ln":"Doe","fn":"Jane","age":22} java code snippet: final SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("json_test"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); HiveContext hc = new HiveContext(ctx.sc()); DataFrame df = hc.read().json("files/json/example2.json"); what I need is a DataFrame with columns id, ln, fn, age as well as raw_json string any advice on the best practice in java?Thanks, Richard
Re: trouble using eclipse to view spark source code
Andy, This has nothing to do with Spark but I guess you don't have the proper Scala version. The version you're currently running doesn't recognize a method in Scala ArrayOps, namely: scala.collection.mutable.ArrayOps.$colon$plus On Monday, January 18, 2016 7:53 PM, Andy Davidsonwrote: Many thanks. I was using a different scala plug in. this one seems to work better I no longer get compile error how ever I get the following stack trace when I try to run my unit tests with mllib open I am still using eclipse luna. Andy java.lang.NoSuchMethodError: scala.collection.mutable.ArrayOps.$colon$plus(Ljava/lang/Object;Lscala/reflect/ClassTag;)Ljava/lang/Object; at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:73) at org.apache.spark.ml.feature.HashingTF.transformSchema(HashingTF.scala:76) at org.apache.spark.ml.feature.HashingTF.transform(HashingTF.scala:64) at com.pws.fantasySport.ml.TDIDFTest.runPipleLineTF_IDF(TDIDFTest.java:52) at com.pws.fantasySport.ml.TDIDFTest.test(TDIDFTest.java:36) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) From: Jakob Odersky Date: Monday, January 18, 2016 at 3:20 PM To: Andrew Davidson Cc: "user @spark" Subject: Re: trouble using eclipse to view spark source code Have you followed the guide on how to import spark into eclipse https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse ? On 18 January 2016 at 13:04, Andy Davidson wrote: Hi My project is implemented using Java 8 and Python. Some times its handy to look at the spark source code. For unknown reason if I open a spark project my java projects show tons of compiler errors. I think it may have something to do with Scala. If I close the projects my java code is fine. I typically I only want to import the machine learning and streaming projects. I am not sure if this is an issue or not but my java projects are built using gradel In eclipse preferences -> scala -> installations I selected Scala: 2.10.6 (built in) Any suggestions would be greatly appreciate Andy - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pre-install 3-party Python package on spark cluster
When you run spark submit in either client or cluster mode, you can either use the options --packages or -jars to automatically copy your packages to the worker machines. Thanks On Monday, January 11, 2016 12:52 PM, Andy Davidsonwrote: I use https://code.google.com/p/parallel-ssh/ to upgrade all my slaves From: "taotao.li" Date: Sunday, January 10, 2016 at 9:50 PM To: "user @spark" Subject: pre-install 3-party Python package on spark cluster I have a spark cluster, from machine-1 to machine 100, and machine-1 acts asthe master. Then one day my program need use a 3-party python package which is notinstalled on every machine of the cluster. so here comes my problem: to make that 3-party python package usable onmaster and slaves, should I manually ssh to every machine and use pip toinstall that package? I believe there should be some deploy scripts or other things to make thisgrace, but I can't find anything after googling. --View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pre-install-3-party-Python-package-on-spark-cluster-tp25930.htmlSent from the Apache Spark User List mailing list archive at Nabble.com. -To unsubscribe, e-mail: user-unsubscribe@spark.apache.orgFor additional commands, e-mail: user-h...@spark.apache.org
Re: Spark job uses only one Worker
Michael, I don't know what's your environment but if it's Cloudera, you should be able to see the link to your master in the Hue. Thanks On Thursday, January 7, 2016 5:03 PM, Michael Pisulawrote: I had tried several parameters, including --total-executor-cores, no effect. As for the port, I tried 7077, but if I remember correctly I got some kind of error that suggested to try 6066, with which it worked just fine (apart from this issue here). Each worker has two cores. I also tried increasing cores, again no effect. I was able to increase the number of cores the job was using on one worker, but it would not use any other worker (and it would not start if the number of cores the job wanted was higher than the number available on one worker). On 07.01.2016 22:51, Igor Berman wrote: read about --total-executor-cores not sure why you specify port 6066 in master...usually it's 7077 verify in master ui(usually port 8080) how many cores are there(depends on other configs, but usually workers connect to master with all their cores) On 7 January 2016 at 23:46, Michael Pisula wrote: Hi, I start the cluster using the spark-ec2 scripts, so the cluster is in stand-alone mode. Here is how I submit my job: spark/bin/spark-submit --class demo.spark.StaticDataAnalysis --master spark://:6066 --deploy-mode cluster demo/Demo-1.0-SNAPSHOT-all.jar Cheers, Michael On 07.01.2016 22:41, Igor Berman wrote: share how you submit your job what cluster(yarn, standalone) On 7 January 2016 at 23:24, Michael Pisula wrote: Hi there, I ran a simple Batch Application on a Spark Cluster on EC2. Despite having 3 Worker Nodes, I could not get the application processed on more than one node, regardless if I submitted the Application in Cluster or Client mode. I also tried manually increasing the number of partitions in the code, no effect. I also pass the master into the application. I verified on the nodes themselves that only one node was active while the job was running. I pass enough data to make the job take 6 minutes to process. The job is simple enough, reading data from two S3 files, joining records on a shared field, filtering out some records and writing the result back to S3. Tried all kinds of stuff, but could not make it work. I did find similar questions, but had already tried the solutions that worked in those cases. Would be really happy about any pointers. Cheers, Michael -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-job-uses-only-one-Worker-tp25909.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082 -- Michael Pisula * michael.pis...@tngtech.com * +49-174-3180084 TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring Geschäftsführer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke Sitz: Unterföhring * Amtsgericht München * HRB 135082
Re: Date Time Regression as Feature
Or he can also transform the whole date into a string On Thursday, January 7, 2016 2:25 PM, Sujit Palwrote: Hi Jorge, Maybe extract things like dd, mm, day of week, time of day from the datetime string and use them as features? -sujit On Thu, Jan 7, 2016 at 11:09 AM, Jorge Machado wrote: Hello all, I'm new to machine learning. I'm trying to predict some electric usage with a decision Free The data is : 2015-12-10-10:00, 1200 2015-12-11-10:00, 1150 My question is : What is the best way to turn date and time into feature on my Vector ? Something like this : Vector (1200, [2015,12,10,10,10] )? I could not fine any example with value prediction where features had dates in it. Thanks Jorge Machado Jorge Machado jo...@jmachado.me - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.io.FileNotFoundException(Too many open files) in Spark streaming
Vijay, Are you closing the fileinputstream at the end of each loop ( in.close())? My guess is those streams aren't close and thus the "too many open files" exception. On Tuesday, January 5, 2016 8:03 AM, Priya Chwrote: Can some one throw light on this ? Regards,Padma Ch On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch wrote: Chris, we are using spark 1.3.0 version. we have not set spark.streaming.concurrentJobs this parameter. It takes the default value. Vijay, From the tack trace it is evident that org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730) is throwing the exception. I opened the spark source code and visited the line which is throwing this exception i.e The lie which is marked in red is throwing the exception. The file is ExternalSorter.scala in org.apache.spark.util.collection package. i went through the following blog http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/ and understood that there is merge factor which decide the number of on-disk files that could be merged. Is it some way related to this ? Regards,Padma CH On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly wrote: and which version of Spark/Spark Streaming are you using? are you explicitly setting the spark.streaming.concurrentJobs to something larger than the default of 1? if so, please try setting that back to 1 and see if the problem still exists. this is a dangerous parameter to modify from the default - which is why it's not well-documented. On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge wrote: Few indicators - 1) during execution time - check total number of open files using lsof command. Need root permissions. If it is cluster not sure much !2) which exact line in the code is triggering this error ? Can you paste that snippet ? On Wednesday 23 December 2015, Priya Ch wrote: ulimit -n 65000 fs.file-max = 65000 ( in etc/sysctl.conf file) Thanks,Padma Ch On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma wrote: Could you share the ulimit for your setup please ? - Thanks, via mobile, excuse brevity. On Dec 22, 2015 6:39 PM, "Priya Ch" wrote: Jakob, Increased the settings like fs.file-max in /etc/sysctl.conf and also increased user limit in /etc/security/limits.conf. But still see the same issue. On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky wrote: It might be a good idea to see how many files are open and try increasing the open file limit (this is done on an os level). In some application use-cases it is actually a legitimate need. If that doesn't help, make sure you close any unused files and streams in your code. It will also be easier to help diagnose the issue if you send an error-reproducing snippet. -- Regards,Vijay Gharge -- Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San Francisco, CAhttp://spark.tc | http://advancedspark.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is Spark 1.6 released?
[1] http://spark.apache.org/releases/spark-release-1-6-0.html[2] http://spark.apache.org/downloads.html On Monday, January 4, 2016 2:59 PM, "saif.a.ell...@wellsfargo.com"wrote: Where can I read more about the dataset api on a user layer? I am failing to get an API doc or understand when to use DataFrame or DataSet, advantages, etc. Thanks, Saif -Original Message- From: Jean-Baptiste Onofré [mailto:j...@nanthrax.net] Sent: Monday, January 04, 2016 2:01 PM To: user@spark.apache.org Subject: Re: Is Spark 1.6 released? It's now OK: Michael published and announced the release. Sorry for the delay. Regards JB On 01/04/2016 10:06 AM, Jung wrote: > Hi > There were Spark 1.6 jars in maven central and github. > I found it 5 days ago. But it doesn't appear on Spark website now. > May I regard Spark 1.6 zip file in github as a stable release? > > Thanks > Jung > -- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Stuck with DataFrame df.select("select * from table");
Eugene, The example I gave you was in Python. I used it on my end and it works fine. Sorry, I don't know Scala. Thanks On Tuesday, December 29, 2015 5:24 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: Annabel, That might work in Scala, but I use Java. Three quotes just don't compile =)If your example is in Scala, then, I believe, semicolon is not required. -- Be well! Jean Morozov On Mon, Dec 28, 2015 at 8:49 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote: Jean, Try this:df.select("""select * from tmptable where x1 = '3.0'""").show(); Note: you have to use 3 double quotes as marked On Friday, December 25, 2015 11:30 AM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: Thanks for the comments, although the issue is not in limit() predicate. It's something with spark being unable to resolve the expression. I can do smth like this. It works as it suppose to: df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5); But I think old fashioned sql style have to work also. I have df.registeredTempTable("tmptable") and then df.select("select * from tmptable where x1 = '3.0'").show();org.apache.spark.sql.AnalysisException: cannot resolve 'select * from tmp where x1 = '1.0'' given input columns x1, x4, x5, x3, x2; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.sca >From the first statement I conclude that my custom datasource is perfectly >fine.Just wonder how to fix / workaround that. -- Be well! Jean Morozov On Fri, Dec 25, 2015 at 6:13 PM, Igor Berman <igor.ber...@gmail.com> wrote: sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 supported) or use Dmitriy's solution. select() defines your projection when you've specified entire query On 25 December 2015 at 15:42, Василец Дмитрий <pronix.serv...@gmail.com> wrote: hello you can try to use df.limit(5).show() just trick :) On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov <evgeny.a.moro...@gmail.com> wrote: Hello, I'm basically stuck as I have no idea where to look; Following simple code, given that my Datasource is working gives me an exception.DataFrame df = sqlc.load(filename, "com.epam.parso.spark.ds.DefaultSource"); df.cache(); df.printSchema(); <-- prints the schema perfectly fine! df.show(); <-- Works perfectly fine (shows table with 20 lines)! df.registerTempTable("table"); df.select("select * from table limit 5").show(); <-- gives weird exceptionException is:AnalysisException: cannot resolve 'select * from table limit 5' given input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS I can do a collect on a dataframe, but cannot select any specific columns either "select * from table" or "select VER, CREATED from table". I use spark 1.5.2.The same code perfectly works through Zeppelin 0.5.5. Thanks. -- Be well! Jean Morozov
Re: Can't submit job to stand alone cluster
Thanks Andrew for this awesome explanation On Tuesday, December 29, 2015 5:30 PM, Andrew Or <and...@databricks.com> wrote: Let me clarify a few things for everyone: There are three cluster managers: standalone, YARN, and Mesos. Each cluster manager can run in two deploy modes, client or cluster. In client mode, the driver runs on the machine that submitted the application (the client). In cluster mode, the driver runs on one of the worker machines in the cluster. When I say "standalone cluster mode" I am referring to the standalone cluster manager running in cluster deploy mode. Here's how the resources are distributed in each mode (omitting Mesos): Standalone / YARN client mode. The driver runs on the client machine (i.e. machine that ran Spark submit) so it should already have access to the jars. The executors then pull the jars from an HTTP server started in the driver. Standalone cluster mode. Spark submit does not upload your jars to the cluster, so all the resources you need must already be on all of the worker machines. The executors, however, actually just pull the jars from the driver as in client mode instead of finding it in their own local file systems. YARN cluster mode. Spark submit does upload your jars to the cluster. In particular, it puts the jars in HDFS so your driver can just read from there. As in other deployments, the executors pull the jars from the driver. When the docs say "If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes," it is actually saying that your executors get their jars from the driver. This is true whether you're running in client mode or cluster mode. If the docs are unclear (and they seem to be), then we should update them. I have filed SPARK-12565 to track this. Please let me know if there's anything else I can help clarify. Cheers,-Andrew 2015-12-29 13:07 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: Andrew, Now I see where the confusion lays. Standalone cluster mode, your link, is nothing but a combination of client-mode and standalone mode, my link, without YARN. But I'm confused by this paragraph in your link: If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes. For any additional jars that your application depends on, you should specify them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2). That can't be true; this is only the case when Spark runs on top of YARN. Please correct me, if I'm wrong. Thanks On Tuesday, December 29, 2015 2:54 PM, Andrew Or <and...@databricks.com> wrote: http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications 2015-12-29 11:48 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: Greg, Can you please send me a doc describing the standalone cluster mode? Honestly, I never heard about it. The three different modes, I've listed appear in the last paragraph of this doc: Running Spark Applications | | | | | | | | | Running Spark Applications--class The FQCN of the class containing the main method of the application. For example, org.apache.spark.examples.SparkPi. --conf | | | | View on www.cloudera.com | Preview by Yahoo | | | | | On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> wrote: The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both. @Annabel That's not true. There is a standalone cluster mode where driver runs on one of the workers instead of on the client machine. What you're describing is standalone client mode. 2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: Greg, The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both. With this in mind, here's how jars are uploaded: 1. Spark Stand-alone mode: client and driver run on the same machine; use --packages option to submit a jar 2. Yarn Cluster-mode: client and driver run on separate machines; additionally driver runs as a thread in ApplicationMaster; use --jars option with a globally visible path to said jar 3. Yarn Client-mode: client and driver run on the same machine. driver is NOT a thread in ApplicationMaster; use --packages to submit a jar On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> wrote: Hi Greg, It's actually intentional for standalone cluster mode to not upload jars. One of the reasons why YARN takes at least 10 seconds before running any simple application is because there's a lot of random overhead (e.g. putting jars in HDFS). If this missing functionality is not documented somewhere then we should add t
Re: Can't submit job to stand alone cluster
Greg, The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both. With this in mind, here's how jars are uploaded: 1. Spark Stand-alone mode: client and driver run on the same machine; use --packages option to submit a jar 2. Yarn Cluster-mode: client and driver run on separate machines; additionally driver runs as a thread in ApplicationMaster; use --jars option with a globally visible path to said jar 3. Yarn Client-mode: client and driver run on the same machine. driver is NOT a thread in ApplicationMaster; use --packages to submit a jar On Tuesday, December 29, 2015 1:54 PM, Andrew Orwrote: Hi Greg, It's actually intentional for standalone cluster mode to not upload jars. One of the reasons why YARN takes at least 10 seconds before running any simple application is because there's a lot of random overhead (e.g. putting jars in HDFS). If this missing functionality is not documented somewhere then we should add that. Also, the packages problem seems legitimate. Thanks for reporting it. I have filed https://issues.apache.org/jira/browse/SPARK-12559. -Andrew 2015-12-29 4:18 GMT-08:00 Greg Hill : On 12/28/15, 5:16 PM, "Daniel Valdivia" wrote: >Hi, > >I'm trying to submit a job to a small spark cluster running in stand >alone mode, however it seems like the jar file I'm submitting to the >cluster is "not found" by the workers nodes. > >I might have understood wrong, but I though the Driver node would send >this jar file to the worker nodes, or should I manually send this file to >each worker node before I submit the job? Yes, you have misunderstood, but so did I. So the problem is that --deploy-mode cluster runs the Driver on the cluster as well, and you don't know which node it's going to run on, so every node needs access to the JAR. spark-submit does not pass the JAR along to the Driver, but the Driver will pass it to the executors. I ended up putting the JAR in HDFS and passing an hdfs:// path to spark-submit. This is a subtle difference from Spark on YARN which does pass the JAR along to the Driver automatically, and IMO should probably be fixed in spark-submit. It's really confusing for newcomers. Another problem I ran into that you also might is that --packages doesn't work with --deploy-mode cluster. It downloads the packages to a temporary location on the node running spark-submit, then passes those paths to the node that is running the Driver, but since that isn't the same machine, it can't find anything and fails. The driver process *should* be the one doing the downloading, but it isn't. I ended up having to create a fat JAR with all of the dependencies to get around that one. Greg - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can't submit job to stand alone cluster
Greg, Can you please send me a doc describing the standalone cluster mode? Honestly, I never heard about it. The three different modes, I've listed appear in the last paragraph of this doc: Running Spark Applications | | | | | | | | | Running Spark Applications--class The FQCN of the class containing the main method of the application. For example, org.apache.spark.examples.SparkPi. --conf | | | | View on www.cloudera.com | Preview by Yahoo | | | | | On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> wrote: The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both. @Annabel That's not true. There is a standalone cluster mode where driver runs on one of the workers instead of on the client machine. What you're describing is standalone client mode. 2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: Greg, The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both. With this in mind, here's how jars are uploaded: 1. Spark Stand-alone mode: client and driver run on the same machine; use --packages option to submit a jar 2. Yarn Cluster-mode: client and driver run on separate machines; additionally driver runs as a thread in ApplicationMaster; use --jars option with a globally visible path to said jar 3. Yarn Client-mode: client and driver run on the same machine. driver is NOT a thread in ApplicationMaster; use --packages to submit a jar On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> wrote: Hi Greg, It's actually intentional for standalone cluster mode to not upload jars. One of the reasons why YARN takes at least 10 seconds before running any simple application is because there's a lot of random overhead (e.g. putting jars in HDFS). If this missing functionality is not documented somewhere then we should add that. Also, the packages problem seems legitimate. Thanks for reporting it. I have filed https://issues.apache.org/jira/browse/SPARK-12559. -Andrew 2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>: On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote: >Hi, > >I'm trying to submit a job to a small spark cluster running in stand >alone mode, however it seems like the jar file I'm submitting to the >cluster is "not found" by the workers nodes. > >I might have understood wrong, but I though the Driver node would send >this jar file to the worker nodes, or should I manually send this file to >each worker node before I submit the job? Yes, you have misunderstood, but so did I. So the problem is that --deploy-mode cluster runs the Driver on the cluster as well, and you don't know which node it's going to run on, so every node needs access to the JAR. spark-submit does not pass the JAR along to the Driver, but the Driver will pass it to the executors. I ended up putting the JAR in HDFS and passing an hdfs:// path to spark-submit. This is a subtle difference from Spark on YARN which does pass the JAR along to the Driver automatically, and IMO should probably be fixed in spark-submit. It's really confusing for newcomers. Another problem I ran into that you also might is that --packages doesn't work with --deploy-mode cluster. It downloads the packages to a temporary location on the node running spark-submit, then passes those paths to the node that is running the Driver, but since that isn't the same machine, it can't find anything and fails. The driver process *should* be the one doing the downloading, but it isn't. I ended up having to create a fat JAR with all of the dependencies to get around that one. Greg - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can't submit job to stand alone cluster
Andrew, Now I see where the confusion lays. Standalone cluster mode, your link, is nothing but a combination of client-mode and standalone mode, my link, without YARN. But I'm confused by this paragraph in your link: If your application is launched through Spark submit, then the application jar is automatically distributed to all worker nodes. For any additional jars that your application depends on, you should specify them through the --jars flag using comma as a delimiter (e.g. --jars jar1,jar2). That can't be true; this is only the case when Spark runs on top of YARN. Please correct me, if I'm wrong. Thanks On Tuesday, December 29, 2015 2:54 PM, Andrew Or <and...@databricks.com> wrote: http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications 2015-12-29 11:48 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: Greg, Can you please send me a doc describing the standalone cluster mode? Honestly, I never heard about it. The three different modes, I've listed appear in the last paragraph of this doc: Running Spark Applications | | | | | | | | | Running Spark Applications--class The FQCN of the class containing the main method of the application. For example, org.apache.spark.examples.SparkPi. --conf | | | | View on www.cloudera.com | Preview by Yahoo | | | | | On Tuesday, December 29, 2015 2:42 PM, Andrew Or <and...@databricks.com> wrote: The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both. @Annabel That's not true. There is a standalone cluster mode where driver runs on one of the workers instead of on the client machine. What you're describing is standalone client mode. 2015-12-29 11:32 GMT-08:00 Annabel Melongo <melongo_anna...@yahoo.com>: Greg, The confusion here is the expression "standalone cluster mode". Either it's stand-alone or it's cluster mode but it can't be both. With this in mind, here's how jars are uploaded: 1. Spark Stand-alone mode: client and driver run on the same machine; use --packages option to submit a jar 2. Yarn Cluster-mode: client and driver run on separate machines; additionally driver runs as a thread in ApplicationMaster; use --jars option with a globally visible path to said jar 3. Yarn Client-mode: client and driver run on the same machine. driver is NOT a thread in ApplicationMaster; use --packages to submit a jar On Tuesday, December 29, 2015 1:54 PM, Andrew Or <and...@databricks.com> wrote: Hi Greg, It's actually intentional for standalone cluster mode to not upload jars. One of the reasons why YARN takes at least 10 seconds before running any simple application is because there's a lot of random overhead (e.g. putting jars in HDFS). If this missing functionality is not documented somewhere then we should add that. Also, the packages problem seems legitimate. Thanks for reporting it. I have filed https://issues.apache.org/jira/browse/SPARK-12559. -Andrew 2015-12-29 4:18 GMT-08:00 Greg Hill <greg.h...@rackspace.com>: On 12/28/15, 5:16 PM, "Daniel Valdivia" <h...@danielvaldivia.com> wrote: >Hi, > >I'm trying to submit a job to a small spark cluster running in stand >alone mode, however it seems like the jar file I'm submitting to the >cluster is "not found" by the workers nodes. > >I might have understood wrong, but I though the Driver node would send >this jar file to the worker nodes, or should I manually send this file to >each worker node before I submit the job? Yes, you have misunderstood, but so did I. So the problem is that --deploy-mode cluster runs the Driver on the cluster as well, and you don't know which node it's going to run on, so every node needs access to the JAR. spark-submit does not pass the JAR along to the Driver, but the Driver will pass it to the executors. I ended up putting the JAR in HDFS and passing an hdfs:// path to spark-submit. This is a subtle difference from Spark on YARN which does pass the JAR along to the Driver automatically, and IMO should probably be fixed in spark-submit. It's really confusing for newcomers. Another problem I ran into that you also might is that --packages doesn't work with --deploy-mode cluster. It downloads the packages to a temporary location on the node running spark-submit, then passes those paths to the node that is running the Driver, but since that isn't the same machine, it can't find anything and fails. The driver process *should* be the one doing the downloading, but it isn't. I ended up having to create a fat JAR with all of the dependencies to get around that one. Greg - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: DataFrame Vs RDDs ... Which one to use When ?
Additionally, if you already have some legal sql statements to process said data, instead of reinventing the wheel using rdd's functions, you can speed up implementation by using dataframes along with these existing sql statements. On Monday, December 28, 2015 5:37 PM, Darren Govoniwrote: I'll throw a thought in here. Dataframes are nice if your data is uniform and clean with consistent schema. However in many big data problems this is seldom the case. Sent from my Verizon Wireless 4G LTE smartphone Original message From: Chris Fregly Date: 12/28/2015 5:22 PM (GMT-05:00) To: Richard Eggert Cc: Daniel Siegmann , Divya Gehlot , "user @spark" Subject: Re: DataFrame Vs RDDs ... Which one to use When ? here's a good article that sums it up, in my opinion: https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ basically, building apps with RDDs is like building with apps with primitive JVM bytecode. haha. @richard: remember that even if you're currently writing RDDs in Java/Scala, you're not gaining the code gen/rewrite performance benefits of the Catalyst optimizer. i agree with @daniel who suggested that you start with DataFrames and revert to RDDs only when DataFrames don't give you what you need. the only time i use RDDs directly these days is when i'm dealing with a Spark library that has not yet moved to DataFrames - ie. GraphX - and it's kind of annoying switching back and forth. almost everything you need should be in the DataFrame API. Datasets are similar to RDDs, but give you strong compile-time typing, tabular structure, and Catalyst optimizations. hopefully Datasets is the last API we see from Spark SQL... i'm getting tired of re-writing slides and book chapters! :) On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert wrote: One advantage of RDD's over DataFrames is that RDD's allow you to use your own data types, whereas DataFrames are backed by RDD's of Record objects, which are pretty flexible but don't give you much in the way of compile-time type checking. If you have an RDD of case class elements or JSON, then Spark SQL can automatically figure out how to convert it into an RDD of Record objects (and therefore a DataFrame), but there's no way to automatically go the other way (from DataFrame/Record back to custom types). In general, you can ultimately do more with RDDs than DataFrames, but DataFrames give you a lot of niceties (automatic query optimization, table joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime overhead associated with writing RDD code in a non-JVM language (such as Python or R), since the query optimizer is effectively creating the required JVM code under the hood. There's little to no performance benefit if you're already writing Java or Scala code, however (and RDD-based code may actually perform better in some cases, if you're willing to carefully tune your code). On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann wrote: DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to thinking in terms of collections rather than tables, you may find RDDs more natural. Data frames can also be faster, since Spark will do some optimizations under the hood - if you are using PySpark, this will avoid the overhead. Data frames may also perform better if you're reading structured data, such as a Hive table or Parquet files. I recommend you prefer data frames, switching over to RDDs as necessary (when you need to perform an operation not supported by data frames / Spark SQL). HOWEVER (and this is a big one), Spark 1.6 will have yet another API - datasets. The release of Spark 1.6 is currently being finalized and I would expect it in the next few days. You will probably want to use the new API once it's available. On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot wrote: Hi, I am new bee to spark and a bit confused about RDDs and DataFames in Spark. Can somebody explain me with the use cases which one to use when ? Would really appreciate the clarification . Thanks, Divya -- Rich -- Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San Francisco, CAhttp://spark.tc | http://advancedspark.com
Re: Stuck with DataFrame df.select("select * from table");
Jean, Try this:df.select("""select * from tmptable where x1 = '3.0'""").show(); Note: you have to use 3 double quotes as marked On Friday, December 25, 2015 11:30 AM, Eugene Morozovwrote: Thanks for the comments, although the issue is not in limit() predicate. It's something with spark being unable to resolve the expression. I can do smth like this. It works as it suppose to: df.select(df.col("*")).where(df.col("x1").equalTo(3.0)).show(5); But I think old fashioned sql style have to work also. I have df.registeredTempTable("tmptable") and then df.select("select * from tmptable where x1 = '3.0'").show();org.apache.spark.sql.AnalysisException: cannot resolve 'select * from tmp where x1 = '1.0'' given input columns x1, x4, x5, x3, x2; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:56) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.sca >From the first statement I conclude that my custom datasource is perfectly >fine.Just wonder how to fix / workaround that. -- Be well! Jean Morozov On Fri, Dec 25, 2015 at 6:13 PM, Igor Berman wrote: sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 supported) or use Dmitriy's solution. select() defines your projection when you've specified entire query On 25 December 2015 at 15:42, Василец Дмитрий wrote: hello you can try to use df.limit(5).show() just trick :) On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov wrote: Hello, I'm basically stuck as I have no idea where to look; Following simple code, given that my Datasource is working gives me an exception.DataFrame df = sqlc.load(filename, "com.epam.parso.spark.ds.DefaultSource"); df.cache(); df.printSchema(); <-- prints the schema perfectly fine! df.show(); <-- Works perfectly fine (shows table with 20 lines)! df.registerTempTable("table"); df.select("select * from table limit 5").show(); <-- gives weird exceptionException is:AnalysisException: cannot resolve 'select * from table limit 5' given input columns VER, CREATED, SOC, SOCC, HLTC, HLGTC, STATUS I can do a collect on a dataframe, but cannot select any specific columns either "select * from table" or "select VER, CREATED from table". I use spark 1.5.2.The same code perfectly works through Zeppelin 0.5.5. Thanks. -- Be well! Jean Morozov
Re: Shared memory between C++ process and Spark
Robin, Maybe you didn't read my post in which I stated that Spark works on top of HDFS. What Jia wants is to have Spark interacts with a C++ process to read and write data. I've never heard about Jia's use case in Spark. If you know one, please share that with me. Thanks On Monday, December 7, 2015 1:57 PM, Robin East <robin.e...@xense.co.uk> wrote: Annabel Spark works very well with data stored in HDFS but is certainly not tied to it. Have a look at the wide variety of connectors to things like Cassandra, HBase, etc. Robin Sent from my iPhone On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_anna...@yahoo.com> wrote: Jia, I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement. On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote: Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy. Best Regards,Jia On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote: My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R. The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote: Thanks, Dewful! My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism. Best Regards,Jia On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote: Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated! Best Regards,Jia On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote: -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list) First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. ---Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote: Dears, for one project, I need to implement something so Spark can read data from a C++ process. To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process. It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this? Thank you very much! Best Regards, Jia - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Shared memory between C++ process and Spark
Robin, To prove my point, this is an unresolved issue still in the implementation stage. On Monday, December 7, 2015 2:49 PM, Robin East <robin.e...@xense.co.uk> wrote: Hi Annabel I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399. The comments section provides a specific example of processing very large images using a pre-existing c++ library. Robin Sent from my iPhone On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_anna...@yahoo.com.INVALID> wrote: Jia, I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement. On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote: Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy. Best Regards,Jia On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote: My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R. The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote: Thanks, Dewful! My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism. Best Regards,Jia On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote: Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated! Best Regards,Jia On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote: -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list) First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. ---Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote: Dears, for one project, I need to implement something so Spark can read data from a C++ process. To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process. It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this? Thank you very much! Best Regards, Jia - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Shared memory between C++ process and Spark
My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R. The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. On Monday, December 7, 2015 1:15 PM, Jiawrote: Thanks, Dewful! My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism. Best Regards,Jia On Dec 7, 2015, at 11:46 AM, Dewful wrote: Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated! Best Regards,Jia On Dec 7, 2015, at 10:54 AM, Robin East wrote: -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list) First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. ---Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action On 6 Dec 2015, at 20:43, Jia wrote: Dears, for one project, I need to implement something so Spark can read data from a C++ process. To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process. It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this? Thank you very much! Best Regards, Jia - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Shared memory between C++ process and Spark
Jia, I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement. On Monday, December 7, 2015 1:42 PM, Jia <jacqueline...@gmail.com> wrote: Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy. Best Regards,Jia On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_anna...@yahoo.com> wrote: My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R. The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. On Monday, December 7, 2015 1:15 PM, Jia <jacqueline...@gmail.com> wrote: Thanks, Dewful! My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism. Best Regards,Jia On Dec 7, 2015, at 11:46 AM, Dewful <dew...@gmail.com> wrote: Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated! Best Regards,Jia On Dec 7, 2015, at 10:54 AM, Robin East <robin.e...@xense.co.uk> wrote: -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list) First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. ---Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action On 6 Dec 2015, at 20:43, Jia <jacqueline...@gmail.com> wrote: Dears, for one project, I need to implement something so Spark can read data from a C++ process. To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process. It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this? Thank you very much! Best Regards, Jia - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org