PySpark + virtualenv: Using a different python path on the driver and on the executors

2017-02-25 Thread Tomer Benyamini
Hello, I'm trying to run pyspark using the following setup: - spark 1.6.1 standalone cluster on ec2 - virtualenv installed on master - app is run using the following command: export PYSPARK_DRIVER_PYTHON=/path_to_virtualenv/bin/python export PYSPARK_PYTHON=/usr/bin/python

Driver zombie process (standalone cluster)

2016-06-29 Thread Tomer Benyamini
Hi, I'm trying to run spark applications on a standalone cluster, running on top of AWS. Since my slaves are spot instances, in some cases they are being killed and lost due to bid prices. When apps are running during this event, sometimes the spark application dies - and the driver process just

question about resource allocation on the spark standalone cluster

2015-07-01 Thread Tomer Benyamini
Hello spark-users, I would like to use the spark standalone cluster for multi-tenants, to run multiple apps at the same time. The issue is, when submitting an app to the spark standalone cluster, you cannot pass --num-executors like on yarn, but only --total-executor-cores. *This may cause

running 2 spark applications in parallel on yarn

2015-02-01 Thread Tomer Benyamini
Hi all, I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that whenever I'm running a heavy computation job in parallel to other jobs running, I'm getting these kind of exceptions: * [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager- Lost task 820.0 in

Re: custom spark app name in yarn-cluster mode

2014-12-15 Thread Tomer Benyamini
there. I believe giving it with the --name property to spark-submit should work. -Sandy On Thu, Dec 11, 2014 at 10:28 AM, Tomer Benyamini tomer@gmail.com wrote: On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to set a custom spark app name

custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty(spark.master)); sparkConf.setAppName(myCustomName); sparkConf.set(spark.logConf, true); JavaSparkContext sc =

Re: custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty(spark.master

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-29 Thread Tomer Benyamini
, org.apache.hadoop.fs.s3native.NativeS3FileSystem) On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini tomer@gmail.com wrote: Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem

S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Hello, I'm building a spark app required to read large amounts of log files from s3. I'm doing so in the code by constructing the file list, and passing it to the context as following: val myRDD = sc.textFile(s3n://mybucket/file1, s3n://mybucket/file2, ... , s3n://mybucket/fileN) When running

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini
Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303

Rdd of Rdds

2014-10-22 Thread Tomer Benyamini
Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a foreach on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(testapp);

Spark-jobserver for java apps

2014-10-20 Thread Tomer Benyamini
Hi, I'm working on the problem of remotely submitting apps to the spark master. I'm trying to use the spark-jobserver project (https://github.com/ooyala/spark-jobserver) for that purpose. For scala apps looks like things are working smoothly, but for java apps, I have an issue with implementing

Cannot read from s3 using sc.textFile

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName(TestApp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);

Fwd: Cannot read from s3 using sc.textFile

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName(TestApp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);

MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation error: Bound mismatch: The generic method

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
, 2014 at 10:53 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation

Upgrading a standalone cluster on ec2 from 1.0.2 to 1.1.0

2014-09-15 Thread Tomer Benyamini
Hi, I would like to upgrade a standalone cluster to 1.1.0. What's the best way to do it? Should I just replace the existing /root/spark folder with the uncompressed folder from http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz ? What about hdfs and other installations? I have spark

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) Any idea? Thanks! Tomer On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote: If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote: ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error when trying to run distcp

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
) running with datanode process -- Ye Xianjin Sent with Sparrow On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote: Still no luck, even when running stop-all.sh followed by start-all.sh. On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote

Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Hi, I would like to make sure I'm not exceeding the quota on the local cluster's hdfs. I have a couple of questions: 1. How do I know the quota? Here's the output of hadoop fs -count -q which essentially does not tell me a lot root@ip-172-31-7-49 ~]$ hadoop fs -count -q / 2147483647

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/. It shows 1 node hdfs though, although I have 4 slaves on my cluster. Any idea why? On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On 9/7/2014 7:27 AM, Tomer Benyamini wrote: 2. What should

distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
you have a mapreduce cluster on your hdfs? And from the error message, it seems that you didn't specify your jobtracker address. -- Ye Xianjin Sent with Sparrow On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote: Hi, I would like to copy log files from s3 to the cluster's