Re: SparkLauncher is blocked until mail process is killed.
Could you please provide the jstack output? That would help the devs identify the blocking operation more easily. On Thu, Oct 29, 2015 at 6:54 PM, 陈宇航wrote: > I tried to use SparkLauncher (org.apache.spark.launcher.SparkLauncher) to > submit a Spark Streaming job, however, in my test, the SparkSubmit process > got stuck in the "addJar" procedure. Only when the main process (the caller > of SparkLauncher) is killed, the submit procedeure continues to run. I ran > jstack for the process, it seems jetty was blocking it, and I'm pretty sure > there was no port conflicts. > > The environment is RHEL(RedHot Enterprise Linux) 6u3 x64, Spark runs in > standalone mode. > > Did this happen to any of you? > >
Re: Running Spark 1.4.1 without Hadoop
Actually, Hadoop InputFormats can still be used to read and write from file://, s3n://, and similar schemes. You just won't be able to read/write to HDFS without installing Hadoop and setting up an HDFS cluster. To summarize: Sourav, you can use any of the prebuilt packages (i.e. anything other than source code). Hope that helps, -Jey On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote: Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go. Yes it may complain while launching master and workers, safely ignore them. The only problem is while writing to a directory. Of course you will not be able to use any hadoop inputformat etc. out of the box. ** I am assuming its a learning question :) For production, I would suggest build it from source. If you are using python and need some help, please drop me a note off line. Best Ayan On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the source code version and compile the same ? 2. Which installation/quick start guideline I should use for the same. So far I didn't see any documentation which specifically addresses the Spark without Hadoop installation/setup unless I'm missing out one. Regards, Sourav -- Best Regards, Ayan Guha
Re: Running Spark 1.4.1 without Hadoop
(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 83 more On Mon, Jun 29, 2015 at 10:02 AM, Jey Kottalam j...@cs.berkeley.edu wrote: Actually, Hadoop InputFormats can still be used to read and write from file://, s3n://, and similar schemes. You just won't be able to read/write to HDFS without installing Hadoop and setting up an HDFS cluster. To summarize: Sourav, you can use any of the prebuilt packages (i.e. anything other than source code). Hope that helps, -Jey On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote: Hi You really donot need hadoop installation. You can dowsload a pre-built version with any hadoop and unzip it and you are good to go. Yes it may complain while launching master and workers, safely ignore them. The only problem is while writing to a directory. Of course you will not be able to use any hadoop inputformat etc. out of the box. ** I am assuming its a learning question :) For production, I would suggest build it from source. If you are using python and need some help, please drop me a note off line. Best Ayan On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi, I'm trying to run Spark without Hadoop where the data would be read and written to local disk. For this I have few Questions - 1. Which download I need to use ? In the download option I don't see any binary download which does not need Hadoop. Is the only way to do this to download the source code version and compile the same ? 2. Which installation/quick start guideline I should use for the same. So far I didn't see any documentation which specifically addresses the Spark without Hadoop installation/setup unless I'm missing out one. Regards, Sourav -- Best Regards, Ayan Guha
Re: Running Spark 1.4.1 without Hadoop
The format is still com.databricks.spark.csv, but the parameter passed to spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0. On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: HI Jey, Not much of luck. If I use the class com.databricks:spark-csv_2. 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found error. With com.databricks.spark.csv I don't get the class not found error but I still get the previous error even after using file:/// in the URI. Regards, Sourav On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu wrote: Hi Sourav, The error seems to be caused by the fact that your URL starts with file:// instead of file:///. Also, I believe the current version of the package for Spark 1.4 with Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0. -Jey On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is what I did - Ran the spark-shell as bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3. Then in the shell I ran : val df = sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv) Regards, Sourav 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from textFile at CsvRelation.scala:114 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.take(RDD.scala:1246) at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.first(RDD.scala:1285) at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114) at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112) at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95) at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39) at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28) at $iwC$$iwC$$iwC$$iwC.init(console:30) at $iwC$$iwC$$iwC.init(console:32) at $iwC$$iwC.init(console:34) at $iwC.init(console:36) at init(console:38) at .init(console:42) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at .init(console:7) at .clinit(console) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java
Re: Running Spark 1.4.1 without Hadoop
) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 83 more Regards, Sourav On Mon, Jun 29, 2015 at 6:53 PM, Jey Kottalam j...@cs.berkeley.edu wrote: The format is still com.databricks.spark.csv, but the parameter passed to spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0. On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: HI Jey, Not much of luck. If I use the class com.databricks:spark-csv_2. 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found error. With com.databricks.spark.csv I don't get the class not found error but I still get the previous error even after using file:/// in the URI. Regards, Sourav On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu wrote: Hi Sourav, The error seems to be caused by the fact that your URL starts with file:// instead of file:///. Also, I believe the current version of the package for Spark 1.4 with Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0. -Jey On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder sourav.mazumde...@gmail.com wrote: Hi Jey, Thanks for your inputs. Probably I'm getting error as I'm trying to read a csv file from local file using com.databricks.spark.csv package. Probably this package has hard coded dependency on Hadoop as it is trying to read input format from HadoopRDD. Can you please confirm ? Here is what I did - Ran the spark-shell as bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3. Then in the shell I ran : val df = sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv) Regards, Sourav 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from textFile at CsvRelation.scala:114 java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109
Re: Get importerror when i run pyspark with ipython=1
Hi Sourabh, could you try it with the stable 2.4 version of IPython? On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha sourabh.g...@hotmail.com wrote: http://apache-spark-user-list.1001560.n3.nabble.com/file/n21843/pyspark_error.jpg I get the above error when I try to run pyspark with the ipython option. I do not get this error when I run it without the ipython option. I have Java 8, Scala 2.10.4 and Enthought Canopy Python on my box. OS Win 8.1 Desktop. Thanks, Sourabh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Get-importerror-when-i-run-pyspark-with-ipython-1-tp21843.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: reduceByKey vs countByKey
Hi Sathish, The current implementation of countByKey uses reduceByKey: https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332 It seems that countByKey is mostly deprecated: https://issues.apache.org/jira/browse/SPARK-3994 -Jey On Tue, Feb 24, 2015 at 3:53 PM, Sathish Kumaran Vairavelu vsathishkuma...@gmail.com wrote: Hello, Quick question. I am trying to understand difference between reduceByKey vs countByKey? Which one gives better performance reduceByKey or countByKey? While we can perform same count operation using reduceByKey why we need countByKey/countByValue? Thanks Sathish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?
Hi Aris, A simple approach to gaining some of the benefits of an RBF kernel is to add synthetic features to your training set. For example, if your original data consists of 3-dimensional vectors [x, y, z], you could compute a new 9-dimensional feature vector containing [x, y, z, x^2, y^2, z^2, xy, xz, y*z]. This basic idea can be taken much further: 1. http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf 2. http://arxiv.org/pdf/1109.4603.pdf Hope that helps, -Jey On Thu, Sep 18, 2014 at 11:10 AM, Aris arisofala...@gmail.com wrote: Sorry to bother you guys, but does anybody have any ideas about the status of MLlib with a Radial Basis Function kernel for SVM? Thank you! On Tue, Sep 16, 2014 at 3:27 PM, Aris wrote: Hello Spark Community - I am using the support vector machine / SVM implementation in MLlib with the standard linear kernel; however, I noticed in the Spark documentation for StandardScaler is *specifically* mentions that SVMs which use the RBF kernel work really well when you have standardized data... which begs the question, is there some kind of support for RBF kernels rather than linear kernels? In small data tests using R the RBF kernel worked really well, and linear kernel never converged...so I would really like to use RBF. Thank you folks for any help! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: EC2 instances missing SSD drives randomly?
I think you have to explicitly list the ephemeral disks in the device map when launching the EC2 instance. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html On Tue, Aug 19, 2014 at 11:54 AM, Andras Barjak andras.bar...@lynxanalytics.com wrote: Hi, Using the spark 1.0.1 ec2 script I launched 35 m3.2xlarge instances. (I was using Singapore region.) Some of the instances we got without the ephemeral internal (non-EBS) SSD devices that are supposed to be connected to them. Some of them have these drives but not all, and there is no sign from the outside, one can only notice this by ssh-ing into the instances and typing `df -l` thus it seems to be a bug to me. I am not sure if Amazon is not providing the drives or the Spark AMI configures something wrong. Do you have any idea what is going on? I neved faced this issue before. It is not like the drive is not formatted/mounted (as it was the case with the new r3 instances), they are not present physically. (Though the mnt and mnt2 are configured properly in fstab.) I did several tries and the result was the same: some of the instances launched with the drives, some without. Please, let me know if you have some ideas what to do with this inconsistent behaviour. András - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Anaconda Spark AMI
Hi Ben, Has the PYSPARK_PYTHON environment variable been set in spark/conf/spark-env.sh to the path of the new python binary? FYI, there's a /root/copy-dirs script that can be handy when updating files on an already-running cluster. You'll want to restart the spark cluster for the changes to take effect, as described at https://spark.apache.org/docs/latest/ec2-scripts.html Hope that helps, -Jey On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com wrote: Hi All, I'm a dev a Continuum and we are developing a fair amount of tooling around Spark. A few days ago someone expressed interest in numpy+pyspark and Anaconda came up as a reasonable solution. I spent a number of hours yesterday trying to rework the base Spark AMI on EC2 but sadly was defeated by a number of errors. Aggregations seemed to choke -- where as small takes executed as aspected (errors are linked to the gist): sc.appName u'PySparkShell' sc._conf.getAll() [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'), (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u'spark.app.name', u' PySparkShell'), (u'spark.executor.extraClassPath', u'/root/ephemeral-hdfs/conf'), (u'spark.master', u'spark://.compute-1.amazonaws.com:7077')] file = sc.textFile(hdfs:///user/root/chekhov.txt) file.take(2) [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov, u''] lines = file.filter(lambda x: len(x) 0) lines.count() VARIOUS ERROS DISCUSSED BELOW My first thought was that I could simply get away with including anaconda on the base AMI, point the path at /dir/anaconda/bin, and bake a new one. Doing so resulted in some strange py4j errors like the following: Py4JError: An error occurred while calling o17.partitions. Trace: py4j.Py4JException: Method partitions([]) does not exist At some point I also saw: SystemError: Objects/cellobject.c:24: bad argument to internal function which is really strange, possibly the result of a version mismatch? I had another thought of building spark from master on the AMI, leaving the spark directory in place, and removing the spark call from the modules list in spark-ec2 launch script. Unfortunately, this resulted in the following errors: https://gist.github.com/quasiben/da0f4778fbc87d02c088 If a spark dev was willing to make some time in the near future, I'm sure she/he and I could sort out these issues and give the Spark community a python distro ready to go for numerical computing. For instance, I'm not sure how pyspark calls out to launching a python session on a slave? Is this done as root or as the hadoop user? (i believe i changed /etc/bashrc to point to my anaconda bin directory so it shouldn't really matter. Is there something special about the py4j zip include in spark dir compared with the py4j in pypi? Thoughts? --Ben
Re: Executors not utilized properly.
Hi Abhishek, Where mapreduce is taking 2 mins, spark is taking 5 min to complete the job. Interesting. Could you tell us more about your program? A code skeleton would certainly be helpful. Thanks! -Jey On Tue, Jun 17, 2014 at 3:21 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I did try creating more partitions by overriding the default number of partitions determined by HDFS splits. Problem is, in this case program will run for ever. I have same set of inputs for map reduce and spark. Where map reduce is taking 2 mins, spark is taking 5 min to complete the job. I thought because all of the executors are not being utilized properly my spark program is running slower than map reduce. I can provide you my code skeleton for your reference. Please help me with this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7759.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Local file being refrenced in mapper function
Hi Rahul, Marcelo's explanation is correct. Here's a possible approach to your program, in pseudo-Python: # connect to Spark cluster sc = SparkContext(...) # load input data input_data = load_xls(file(input.xls)) input_rows = input_data['Sheet1'].rows # create RDD on cluster input_rdd = sc.parallelize(input_rows) # munge RDD result_rdd = input_rdd.map(munge_row) # collect result RDD to local process result_rows = result_rdd.collect() # write output file write_xls(file(output.xls, w), result_rows) Hope that helps, -Jey On Fri, May 30, 2014 at 9:44 AM, Marcelo Vanzin van...@cloudera.com wrote: Hello there, On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote: workbook = xlsxwriter.Workbook('output_excel.xlsx') worksheet = workbook.add_worksheet() data = sc.textFile(xyz.txt) # xyz.txt is a file whose each line contains string delimited by SPACE row=0 def mapperFunc(x): for i in range(0,4): worksheet.write(row, i , x.split( )[i]) row++ return len(x.split()) data2 = data.map(mapperFunc) Is using row in 'mapperFunc' like this is a correct way? Will it increment row each time? No. mapperFunc will be executed somewhere else, not in the same process running this script. I'm not familiar with how serializing closures works in Spark/Python, but you'll most certainly be updating the local copy of row in the executor, and your driver's copy will remain at 0. In general, in a distributed execution environment like Spark you want to avoid as much as possible using state. row in your code is state, so to do what you want you'd have to use other means (like Spark's accumulators). But those are generally expensive in a distributed system, and to be avoided if possible. Is writing in the excel file using worksheet.write() in side the mapper function a correct way? No, for the same reasons. Your executor will have a copy of your workbook variable. So the write() will happen locally to the executor, and after the mapperFunc() returns, that will be discarded - so your driver won't see anything. As a rule of thumb, your closures should try to use only their arguments as input, or at most use local variables as read-only, and only produce output in the form of return values. There are cases where you might want to break these rules, of course, but in general that's the mindset you should be in. Also note that you're not actually executing anything here. data.map() is a transformation, so you're just building the execution graph for the computation. You need to execute an action (like collect() or take()) if you want the computation to actually occur. -- Marcelo
Re: help
Sorry, but I don't know where Cloudera puts the executor log files. Maybe their docs give the correct path? On Fri, Apr 25, 2014 at 12:32 PM, Joe L selme...@yahoo.com wrote: hi thank you for your reply but I could not find it. it says that no such file or directory http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the Apache Spark User List mailing list archive at Nabble.com.