Re: SparkLauncher is blocked until mail process is killed.

2015-10-29 Thread Jey Kottalam
Could you please provide the jstack output? That would help the devs
identify the blocking operation more easily.

On Thu, Oct 29, 2015 at 6:54 PM, 陈宇航  wrote:

> I tried to use SparkLauncher (org.apache.spark.launcher.SparkLauncher) to
> submit a Spark Streaming job, however, in my test, the SparkSubmit process
> got stuck in the "addJar" procedure. Only when the main process (the caller
> of SparkLauncher) is killed, the submit procedeure continues to run. I ran
> jstack for the process, it seems jetty was blocking it, and I'm pretty sure
> there was no port conflicts.
>
> The environment is RHEL(RedHot Enterprise Linux) 6u3 x64, Spark runs in
> standalone mode.
>
> Did this happen to any of you?
>
>


Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Actually, Hadoop InputFormats can still be used to read and write from
file://, s3n://, and similar schemes. You just won't be able to
read/write to HDFS without installing Hadoop and setting up an HDFS cluster.

To summarize: Sourav, you can use any of the prebuilt packages (i.e.
anything other than source code).

Hope that helps,
-Jey

On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote:

 Hi

 You really donot need hadoop installation. You can dowsload a pre-built
 version with any hadoop and unzip it and you are good to go. Yes it may
 complain while launching master and workers, safely ignore them. The only
 problem is while writing to a directory. Of course you will not be able to
 use any hadoop inputformat etc. out of the box.

 ** I am assuming its a learning question :) For production, I would
 suggest build it from source.

 If you are using python and need some help, please drop me a note off line.

 Best
 Ayan

 On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi,

 I'm trying to run Spark without Hadoop where the data would be read and
 written to local disk.

 For this I have few Questions -

 1. Which download I need to use ? In the download option I don't see any
 binary download which does not need Hadoop. Is the only way to do this to
 download the source code version and compile the same ?

 2. Which installation/quick start guideline I should use for the same. So
 far I didn't see any documentation which specifically addresses the Spark
 without Hadoop installation/setup unless I'm missing out one.

 Regards,
 Sourav




 --
 Best Regards,
 Ayan Guha



Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
(SparkIMain.scala:871)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
 at org.apache.spark.repl.SparkILoop.org
 $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.org
 $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
 ... 83 more



 On Mon, Jun 29, 2015 at 10:02 AM, Jey Kottalam j...@cs.berkeley.edu
 wrote:

 Actually, Hadoop InputFormats can still be used to read and write from
 file://, s3n://, and similar schemes. You just won't be able to
 read/write to HDFS without installing Hadoop and setting up an HDFS cluster.

 To summarize: Sourav, you can use any of the prebuilt packages (i.e.
 anything other than source code).

 Hope that helps,
 -Jey

 On Mon, Jun 29, 2015 at 7:33 AM, ayan guha guha.a...@gmail.com wrote:

 Hi

 You really donot need hadoop installation. You can dowsload a pre-built
 version with any hadoop and unzip it and you are good to go. Yes it may
 complain while launching master and workers, safely ignore them. The only
 problem is while writing to a directory. Of course you will not be able to
 use any hadoop inputformat etc. out of the box.

 ** I am assuming its a learning question :) For production, I would
 suggest build it from source.

 If you are using python and need some help, please drop me a note off
 line.

 Best
 Ayan

 On Tue, Jun 30, 2015 at 12:24 AM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi,

 I'm trying to run Spark without Hadoop where the data would be read and
 written to local disk.

 For this I have few Questions -

 1. Which download I need to use ? In the download option I don't see
 any binary download which does not need Hadoop. Is the only way to do this
 to download the source code version and compile the same ?

 2. Which installation/quick start guideline I should use for the same.
 So far I didn't see any documentation which specifically addresses the
 Spark without Hadoop installation/setup unless I'm missing out one.

 Regards,
 Sourav




 --
 Best Regards,
 Ayan Guha






Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
The format is still com.databricks.spark.csv, but the parameter passed to
spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0.

On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder 
sourav.mazumde...@gmail.com wrote:

 HI Jey,

 Not much of luck.

 If I use the class com.databricks:spark-csv_2.
 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found
 error. With com.databricks.spark.csv I don't get the class not found error
 but I still get the previous error even after using file:/// in the URI.

 Regards,
 Sourav

 On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

 Hi Sourav,

 The error seems to be caused by the fact that your URL starts with
 file:// instead of file:///.

 Also, I believe the current version of the package for Spark 1.4 with
 Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0.

 -Jey

 On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi Jey,

 Thanks for your inputs.

 Probably I'm getting error as I'm trying to read a csv file from local
 file using com.databricks.spark.csv package. Probably this package has hard
 coded dependency on Hadoop as it is trying to read input format from
 HadoopRDD.

 Can you please confirm ?

 Here is what I did -

 Ran the spark-shell as

 bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.

 Then in the shell I ran :
 val df = 
 sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv)



 Regards,
 Sourav

 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from
 textFile at CsvRelation.scala:114
 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.take(RDD.scala:1246)
 at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1286)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
 at org.apache.spark.rdd.RDD.first(RDD.scala:1285)
 at
 com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:114)
 at
 com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:112)
 at
 com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:95)
 at com.databricks.spark.csv.CsvRelation.init(CsvRelation.scala:53)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:89)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:39)
 at
 com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:27)
 at
 org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
 at
 org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:26)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:28)
 at $iwC$$iwC$$iwC$$iwC.init(console:30)
 at $iwC$$iwC$$iwC.init(console:32)
 at $iwC$$iwC.init(console:34)
 at $iwC.init(console:36)
 at init(console:38)
 at .init(console:42)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java:200)
 at .init(console:7)
 at .clinit(console)
 at java.lang.J9VMInternals.initializeImpl(Native Method)
 at java.lang.J9VMInternals.initialize(J9VMInternals.java

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.org
 $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
 at
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
 at java.lang.reflect.Method.invoke(Method.java:611)
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
 ... 83 more

 Regards,
 Sourav


 On Mon, Jun 29, 2015 at 6:53 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

 The format is still com.databricks.spark.csv, but the parameter passed
 to spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0.

 On Mon, Jun 29, 2015 at 2:59 PM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 HI Jey,

 Not much of luck.

 If I use the class com.databricks:spark-csv_2.
 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found
 error. With com.databricks.spark.csv I don't get the class not found error
 but I still get the previous error even after using file:/// in the URI.

 Regards,
 Sourav

 On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j...@cs.berkeley.edu
 wrote:

 Hi Sourav,

 The error seems to be caused by the fact that your URL starts with
 file:// instead of file:///.

 Also, I believe the current version of the package for Spark 1.4 with
 Scala 2.11 should be com.databricks:spark-csv_2.11:1.1.0.

 -Jey

 On Mon, Jun 29, 2015 at 12:23 PM, Sourav Mazumder 
 sourav.mazumde...@gmail.com wrote:

 Hi Jey,

 Thanks for your inputs.

 Probably I'm getting error as I'm trying to read a csv file from local
 file using com.databricks.spark.csv package. Probably this package has 
 hard
 coded dependency on Hadoop as it is trying to read input format from
 HadoopRDD.

 Can you please confirm ?

 Here is what I did -

 Ran the spark-shell as

 bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3.

 Then in the shell I ran :
 val df = 
 sqlContext.read.format(com.databricks.spark.csv).load(file://home/biadmin/DataScience/PlutoMN.csv)



 Regards,
 Sourav

 15/06/29 15:14:59 INFO spark.SparkContext: Created broadcast 0 from
 textFile at CsvRelation.scala:114
 java.lang.RuntimeException: Error in configuring object
 at
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
 at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at
 org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190)
 at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1251)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109

Re: Get importerror when i run pyspark with ipython=1

2015-02-26 Thread Jey Kottalam
Hi Sourabh, could you try it with the stable 2.4 version of IPython?

On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha sourabh.g...@hotmail.com wrote:
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n21843/pyspark_error.jpg

 I get the above error when I try to run pyspark with the ipython option. I
 do not get this error when I run it without the ipython option.

 I have Java 8, Scala 2.10.4 and Enthought Canopy Python on my box. OS Win
 8.1 Desktop.

 Thanks,
 Sourabh



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Get-importerror-when-i-run-pyspark-with-ipython-1-tp21843.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: reduceByKey vs countByKey

2015-02-24 Thread Jey Kottalam
Hi Sathish,

The current implementation of countByKey uses reduceByKey:
https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332

It seems that countByKey is mostly deprecated:
https://issues.apache.org/jira/browse/SPARK-3994

-Jey

On Tue, Feb 24, 2015 at 3:53 PM, Sathish Kumaran Vairavelu
vsathishkuma...@gmail.com wrote:
 Hello,

 Quick question. I am trying to understand difference between reduceByKey vs
 countByKey? Which one gives better performance reduceByKey or countByKey?
 While we can perform same count operation using reduceByKey why we need
 countByKey/countByValue?

 Thanks

 Sathish

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Jey Kottalam
Hi Aris,

A simple approach to gaining some of the benefits of an RBF kernel is
to add synthetic features to your training set. For example, if your
original data consists of 3-dimensional vectors [x, y, z], you could
compute a new 9-dimensional feature vector containing [x, y, z, x^2,
y^2, z^2, xy, xz, y*z].

This basic idea can be taken much further:
  1. http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf
  2. http://arxiv.org/pdf/1109.4603.pdf

Hope that helps,
-Jey

On Thu, Sep 18, 2014 at 11:10 AM, Aris arisofala...@gmail.com wrote:
 Sorry to bother you guys, but does anybody have any ideas about the status
 of MLlib with a Radial Basis Function kernel for SVM?

 Thank you!

 On Tue, Sep 16, 2014 at 3:27 PM, Aris  wrote:

 Hello Spark Community -

 I am using the support vector machine / SVM implementation in MLlib with
 the standard linear kernel; however, I noticed in the Spark documentation
 for StandardScaler is *specifically* mentions that SVMs which use the RBF
 kernel work really well when you have standardized data...

 which begs the question, is there some kind of support for RBF kernels
 rather than linear kernels? In small data tests using R the RBF kernel
 worked really well, and linear kernel never converged...so I would really
 like to use RBF.

 Thank you folks for any help!

 Aris



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: EC2 instances missing SSD drives randomly?

2014-08-19 Thread Jey Kottalam
I think you have to explicitly list the ephemeral disks in the device
map when launching the EC2 instance.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html

On Tue, Aug 19, 2014 at 11:54 AM, Andras Barjak
andras.bar...@lynxanalytics.com wrote:
 Hi,

 Using the spark 1.0.1 ec2 script I launched 35 m3.2xlarge instances. (I was
 using Singapore region.) Some of the instances we got without the ephemeral
 internal (non-EBS) SSD devices that are supposed to be connected to them.
 Some of them have these drives but not all, and there is no sign from the
 outside, one can only notice this by ssh-ing into the instances and typing
 `df -l` thus it seems to be a bug to me.
 I am not sure if Amazon is not providing the drives or the Spark AMI
 configures something wrong. Do you have any idea what is going on? I neved
 faced this issue before. It is not like the drive is not formatted/mounted
 (as it was the case with the new r3 instances), they are not present
 physically. (Though the mnt and mnt2 are configured properly in fstab.)

 I did several tries and the result was the same: some of the instances
 launched with the drives, some without.

 Please, let me know if you have some ideas what to do with this inconsistent
 behaviour.

 András

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Anaconda Spark AMI

2014-07-03 Thread Jey Kottalam
Hi Ben,

Has the PYSPARK_PYTHON environment variable been set in
spark/conf/spark-env.sh to the path of the new python binary?

FYI, there's a /root/copy-dirs script that can be handy when updating
files on an already-running cluster. You'll want to restart the spark
cluster for the changes to take effect, as described at
https://spark.apache.org/docs/latest/ec2-scripts.html

Hope that helps,
-Jey

On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com wrote:
 Hi All,

 I'm a dev a Continuum and we are developing a fair amount of tooling around
 Spark.  A few days ago someone expressed interest in numpy+pyspark and
 Anaconda came up as a reasonable solution.

 I spent a number of hours yesterday trying to rework the base Spark AMI on
 EC2 but sadly was defeated by a number of errors.

 Aggregations seemed to choke -- where as small takes executed as aspected
 (errors are linked to the gist):

 sc.appName
 u'PySparkShell'
 sc._conf.getAll()
 [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
 (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
 (u'spark.app.name', u'
 PySparkShell'), (u'spark.executor.extraClassPath',
 u'/root/ephemeral-hdfs/conf'), (u'spark.master',
 u'spark://.compute-1.amazonaws.com:7077')]
 file = sc.textFile(hdfs:///user/root/chekhov.txt)
 file.take(2)
 [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov,
 u'']

 lines = file.filter(lambda x: len(x)  0)
 lines.count()
 VARIOUS ERROS DISCUSSED BELOW

 My first thought was that I could simply get away with including anaconda on
 the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
 Doing so resulted in some strange py4j errors like the following:

 Py4JError: An error occurred while calling o17.partitions. Trace:
 py4j.Py4JException: Method partitions([]) does not exist

 At some point I also saw:
 SystemError: Objects/cellobject.c:24: bad argument to internal function

 which is really strange, possibly the result of a version mismatch?

 I had another thought of building spark from master on the AMI, leaving the
 spark directory in place, and removing the spark call from the modules list
 in spark-ec2 launch script. Unfortunately, this resulted in the following
 errors:

 https://gist.github.com/quasiben/da0f4778fbc87d02c088

 If a spark dev was willing to make some time in the near future, I'm sure
 she/he and I could sort out these issues and give the Spark community a
 python distro ready to go for numerical computing.  For instance, I'm not
 sure how pyspark calls out to launching a python session on a slave?  Is
 this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
 point to my anaconda bin directory so it shouldn't really matter.  Is there
 something special about the py4j zip include in spark dir compared with the
 py4j in pypi?

 Thoughts?

 --Ben




Re: Executors not utilized properly.

2014-06-17 Thread Jey Kottalam
Hi Abhishek,

 Where mapreduce is taking 2 mins, spark is taking 5 min to complete the
job.

Interesting. Could you tell us more about your program? A code skeleton
would certainly be helpful.

Thanks!

-Jey


On Tue, Jun 17, 2014 at 3:21 PM, abhiguruvayya sharath.abhis...@gmail.com
wrote:

 I did try creating more partitions by overriding the default number of
 partitions determined by HDFS splits. Problem is, in this case program will
 run for ever. I have same set of inputs for map reduce and spark. Where map
 reduce is taking 2 mins, spark is taking 5 min to complete the job. I
 thought because all of the executors are not being utilized properly my
 spark program is running slower than map reduce. I can provide you my code
 skeleton for your reference. Please help me with this.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Executors-not-utilized-properly-tp7744p7759.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: Local file being refrenced in mapper function

2014-05-30 Thread Jey Kottalam
Hi Rahul,

Marcelo's explanation is correct. Here's a possible approach to your
program, in pseudo-Python:


# connect to Spark cluster
sc = SparkContext(...)

# load input data
input_data = load_xls(file(input.xls))
input_rows = input_data['Sheet1'].rows

# create RDD on cluster
input_rdd = sc.parallelize(input_rows)

# munge RDD
result_rdd = input_rdd.map(munge_row)

# collect result RDD to local process
result_rows = result_rdd.collect()

# write output file
write_xls(file(output.xls, w), result_rows)



Hope that helps,
-Jey

On Fri, May 30, 2014 at 9:44 AM, Marcelo Vanzin van...@cloudera.com wrote:
 Hello there,

 On Fri, May 30, 2014 at 9:36 AM, Marcelo Vanzin van...@cloudera.com wrote:
 workbook = xlsxwriter.Workbook('output_excel.xlsx')
 worksheet = workbook.add_worksheet()

 data = sc.textFile(xyz.txt)
 # xyz.txt is a file whose each line contains string delimited by SPACE

 row=0

 def mapperFunc(x):
 for i in range(0,4):
 worksheet.write(row, i , x.split( )[i])
 row++
 return len(x.split())

 data2 = data.map(mapperFunc)

 Is using row in 'mapperFunc' like this is a correct way? Will it
 increment row each time?

 No. mapperFunc will be executed somewhere else, not in the same
 process running this script. I'm not familiar with how serializing
 closures works in Spark/Python, but you'll most certainly be updating
 the local copy of row in the executor, and your driver's copy will
 remain at 0.

 In general, in a distributed execution environment like Spark you want
 to avoid as much as possible using state. row in your code is state,
 so to do what you want you'd have to use other means (like Spark's
 accumulators). But those are generally expensive in a distributed
 system, and to be avoided if possible.

 Is writing in the excel file using worksheet.write() in side the
 mapper function a correct way?

 No, for the same reasons. Your executor will have a copy of your
 workbook variable. So the write() will happen locally to the
 executor, and after the mapperFunc() returns, that will be discarded -
 so your driver won't see anything.

 As a rule of thumb, your closures should try to use only their
 arguments as input, or at most use local variables as read-only, and
 only produce output in the form of return values. There are cases
 where you might want to break these rules, of course, but in general
 that's the mindset you should be in.

 Also note that you're not actually executing anything here.
 data.map() is a transformation, so you're just building the
 execution graph for the computation. You need to execute an action
 (like collect() or take()) if you want the computation to actually
 occur.

 --
 Marcelo


Re: help

2014-04-25 Thread Jey Kottalam
Sorry, but I don't know where Cloudera puts the executor log files.
Maybe their docs give the correct path?

On Fri, Apr 25, 2014 at 12:32 PM, Joe L selme...@yahoo.com wrote:
 hi thank you for your reply but I could not find it. it says that no such
 file or directory


 http://apache-spark-user-list.1001560.n3.nabble.com/file/n4848/Capture.png



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.