Re: SparkLauncher is blocked until mail process is killed.

2015-10-29 Thread Jey Kottalam
Could you please provide the jstack output? That would help the devs identify the blocking operation more easily. On Thu, Oct 29, 2015 at 6:54 PM, 陈宇航 wrote: > I tried to use SparkLauncher (org.apache.spark.launcher.SparkLauncher) to > submit a Spark Streaming job,

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
Actually, Hadoop InputFormats can still be used to read and write from file://, s3n://, and similar schemes. You just won't be able to read/write to HDFS without installing Hadoop and setting up an HDFS cluster. To summarize: Sourav, you can use any of the prebuilt packages (i.e. anything other

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
(ReflectionUtils.java:106) ... 83 more On Mon, Jun 29, 2015 at 10:02 AM, Jey Kottalam j...@cs.berkeley.edu wrote: Actually, Hadoop InputFormats can still be used to read and write from file://, s3n://, and similar schemes. You just won't be able to read/write to HDFS without installing Hadoop

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
-csv_2. 11:1.1.0 or com.databricks.spark.csv_2.11.1.1.0 I get class not found error. With com.databricks.spark.csv I don't get the class not found error but I still get the previous error even after using file:/// in the URI. Regards, Sourav On Mon, Jun 29, 2015 at 1:13 PM, Jey Kottalam j

Re: Running Spark 1.4.1 without Hadoop

2015-06-29 Thread Jey Kottalam
(ReflectionUtils.java:106) ... 83 more Regards, Sourav On Mon, Jun 29, 2015 at 6:53 PM, Jey Kottalam j...@cs.berkeley.edu wrote: The format is still com.databricks.spark.csv, but the parameter passed to spark-shell is --packages com.databricks:spark-csv_2.11:1.1.0. On Mon, Jun 29, 2015 at 2:59 PM

Re: Get importerror when i run pyspark with ipython=1

2015-02-26 Thread Jey Kottalam
Hi Sourabh, could you try it with the stable 2.4 version of IPython? On Thu, Feb 26, 2015 at 8:54 PM, sourabhguha sourabh.g...@hotmail.com wrote: http://apache-spark-user-list.1001560.n3.nabble.com/file/n21843/pyspark_error.jpg I get the above error when I try to run pyspark with the ipython

Re: reduceByKey vs countByKey

2015-02-24 Thread Jey Kottalam
Hi Sathish, The current implementation of countByKey uses reduceByKey: https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332 It seems that countByKey is mostly deprecated: https://issues.apache.org/jira/browse/SPARK-3994 -Jey On Tue,

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Jey Kottalam
Hi Aris, A simple approach to gaining some of the benefits of an RBF kernel is to add synthetic features to your training set. For example, if your original data consists of 3-dimensional vectors [x, y, z], you could compute a new 9-dimensional feature vector containing [x, y, z, x^2, y^2, z^2,

Re: EC2 instances missing SSD drives randomly?

2014-08-19 Thread Jey Kottalam
I think you have to explicitly list the ephemeral disks in the device map when launching the EC2 instance. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/block-device-mapping-concepts.html On Tue, Aug 19, 2014 at 11:54 AM, Andras Barjak andras.bar...@lynxanalytics.com wrote: Hi, Using the

Re: Anaconda Spark AMI

2014-07-03 Thread Jey Kottalam
Hi Ben, Has the PYSPARK_PYTHON environment variable been set in spark/conf/spark-env.sh to the path of the new python binary? FYI, there's a /root/copy-dirs script that can be handy when updating files on an already-running cluster. You'll want to restart the spark cluster for the changes to

Re: Executors not utilized properly.

2014-06-17 Thread Jey Kottalam
Hi Abhishek, Where mapreduce is taking 2 mins, spark is taking 5 min to complete the job. Interesting. Could you tell us more about your program? A code skeleton would certainly be helpful. Thanks! -Jey On Tue, Jun 17, 2014 at 3:21 PM, abhiguruvayya sharath.abhis...@gmail.com wrote: I did

Re: Local file being refrenced in mapper function

2014-05-30 Thread Jey Kottalam
Hi Rahul, Marcelo's explanation is correct. Here's a possible approach to your program, in pseudo-Python: # connect to Spark cluster sc = SparkContext(...) # load input data input_data = load_xls(file(input.xls)) input_rows = input_data['Sheet1'].rows # create RDD on cluster input_rdd =

Re: help

2014-04-25 Thread Jey Kottalam
Sorry, but I don't know where Cloudera puts the executor log files. Maybe their docs give the correct path? On Fri, Apr 25, 2014 at 12:32 PM, Joe L selme...@yahoo.com wrote: hi thank you for your reply but I could not find it. it says that no such file or directory