PySpark + virtualenv: Using a different python path on the driver and on the executors
Hello, I'm trying to run pyspark using the following setup: - spark 1.6.1 standalone cluster on ec2 - virtualenv installed on master - app is run using the following command: export PYSPARK_DRIVER_PYTHON=/path_to_virtualenv/bin/python export PYSPARK_PYTHON=/usr/bin/python /root/spark/bin/spark-submit --py-files mypackage.tar.gz myapp.py I'm getting the following error: java.io.IOException: Cannot run program "/path_to_virtualenv/bin/python": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) --> Looks like the executor process did not account for the PYSPARK_PYTHON setting, but used the same python executable it had on the driver (the virtualenv python), rather than using " /usr/bin/python" What am I doing wrong here? Thanks, Tomer
Driver zombie process (standalone cluster)
Hi, I'm trying to run spark applications on a standalone cluster, running on top of AWS. Since my slaves are spot instances, in some cases they are being killed and lost due to bid prices. When apps are running during this event, sometimes the spark application dies - and the driver process just hangs, and stays up forever (zombie process), capturing memory / cpu resources on the master machine. Then we have to manually kill -9 to free these resources. Has anyone seen this kind of problem before? Any suggested solution to work around this problem? Thanks, Tomer
question about resource allocation on the spark standalone cluster
Hello spark-users, I would like to use the spark standalone cluster for multi-tenants, to run multiple apps at the same time. The issue is, when submitting an app to the spark standalone cluster, you cannot pass --num-executors like on yarn, but only --total-executor-cores. *This may cause starvation when submitting multiple apps*. Here's an example: Say I have a cluster of 4 machines with 20GB RAM and 4 cores. In case I submit using --total-executor-cores=4 and --executor-memory=20GB, I may get these 2 extreme resource allocations: - 4 workers (on 4 machines) with 1 core each, 20GB each, blocking the entire cluster - 1 worker (on 1 machine) with 4 cores, 20GB for this machine, leaving 3 free machines to be used by other apps. Is there a way to restrict / push the standalone cluster towards the 2nd strategy (use all cores of a given worker before using a second worker)? A workaround that we did is to set SPARK_WORKER_CORES to 1, SPARK_WORKER_MEMORY to 5gb and SPARK_WORKER_INSTANCES to 4, but this is suboptimal since it runs 4 worker instances on 1 machine, which has the JVM overhead, and does not allow to share memory across partitions on the same worker. Thanks, Tomer
running 2 spark applications in parallel on yarn
Hi all, I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that whenever I'm running a heavy computation job in parallel to other jobs running, I'm getting these kind of exceptions: * [task-result-getter-2] INFO org.apache.spark.scheduler.TaskSetManager- Lost task 820.0 in stage 175.0 (TID 11327) on executor xxx: java.io.IOException (Failed to connect to xx:35194) [duplicate 12] * org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 12 * org.apache.spark.shuffle.FetchFailedException: Failed to connect to x:35194 Caused by: java.io.IOException: Failed to connect to x:35194 when running the heavy job alone on the cluster, I'm not getting any errors. My guess is that spark contexts from different apps do not share information about taken ports, and therefore collide on specific ports, causing the job/stage to fail. Is there a way to assign a specific set of executors to a specific spark job via spark-submit, or is there a way to define a range of ports to be used by the application? Thanks! Tomer
Re: custom spark app name in yarn-cluster mode
Thanks Sandy, passing --name works fine :) Tomer On Fri, Dec 12, 2014 at 9:35 AM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Tomer, In yarn-cluster mode, the application has already been submitted to YARN by the time the SparkContext is created, so it's too late to set the app name there. I believe giving it with the --name property to spark-submit should work. -Sandy On Thu, Dec 11, 2014 at 10:28 AM, Tomer Benyamini tomer@gmail.com wrote: On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty(spark.master)); sparkConf.setAppName(myCustomName); sparkConf.set(spark.logConf, true); JavaSparkContext sc = new JavaSparkContext(sparkConf); Apparently this only works when running in yarn-client mode; in yarn-cluster mode the app name is the class name, when viewing the app in the cluster manager UI. Any idea? Thanks, Tomer
custom spark app name in yarn-cluster mode
Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty(spark.master)); sparkConf.setAppName(myCustomName); sparkConf.set(spark.logConf, true); JavaSparkContext sc = new JavaSparkContext(sparkConf); Apparently this only works when running in yarn-client mode; in yarn-cluster mode the app name is the class name, when viewing the app in the cluster manager UI. Any idea? Thanks, Tomer
Re: custom spark app name in yarn-cluster mode
On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty(spark.master)); sparkConf.setAppName(myCustomName); sparkConf.set(spark.logConf, true); JavaSparkContext sc = new JavaSparkContext(sparkConf); Apparently this only works when running in yarn-client mode; in yarn-cluster mode the app name is the class name, when viewing the app in the cluster manager UI. Any idea? Thanks, Tomer
Re: S3NativeFileSystem inefficient implementation when calling sc.textFile
Thanks - this is very helpful! On Thu, Nov 27, 2014 at 5:20 AM, Michael Armbrust mich...@databricks.com wrote: In the past I have worked around this problem by avoiding sc.textFile(). Instead I read the data directly inside of a Spark job. Basically, you start with an RDD where each entry is a file in S3 and then flatMap that with something that reads the files and returns the lines. Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe Using this class you can do something like: sc.parallelize(s3n://mybucket/file1 :: s3n://mybucket/file1 ... :: Nil).flatMap(new ReadLinesSafe(_)) You can also build up the list of files by running a Spark job: https://gist.github.com/marmbrus/15e72f7bc22337cf6653 Michael On Wed, Nov 26, 2014 at 9:23 AM, Aaron Davidson ilike...@gmail.com wrote: Spark has a known problem where it will do a pass of metadata on a large number of small files serially, in order to find the partition information prior to starting the job. This will probably not be repaired by switching the FS impl. However, you can change the FS being used like so (prior to the first usage): sc.hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini tomer@gmail.com wrote: Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com wrote: you can try creating hadoop Configuration and set s3 configuration i.e. access keys etc. Now, for reading files from s3 use newAPIHadoopFile and pass the config object here along with key, value classes. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
S3NativeFileSystem inefficient implementation when calling sc.textFile
Hello, I'm building a spark app required to read large amounts of log files from s3. I'm doing so in the code by constructing the file list, and passing it to the context as following: val myRDD = sc.textFile(s3n://mybucket/file1, s3n://mybucket/file2, ... , s3n://mybucket/fileN) When running it locally there are no issues, but when running it on the yarn-cluster (running spark 1.1.0, hadoop 2.4), I'm seeing an inefficient linear piece of code running, which could probably be easily parallelized: [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/file1 [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/file2 [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/file3 [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/fileN I believe there are some difference between my local classpath and the cluster's classpath - locally I see that *org.apache.hadoop.fs.s3native.NativeS3FileSystem* is being used, whereas on the cluster *com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem* is being used. Any suggestions? Thanks, Tomer
Re: S3NativeFileSystem inefficient implementation when calling sc.textFile
Thanks Lalit; Setting the access + secret keys in the configuration works even when calling sc.textFile. Is there a way to select which hadoop s3 native filesystem implementation would be used at runtime using the hadoop configuration? Thanks, Tomer On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com wrote: you can try creating hadoop Configuration and set s3 configuration i.e. access keys etc. Now, for reading files from s3 use newAPIHadoopFile and pass the config object here along with key, value classes. - Lalit Yadav la...@sigmoidanalytics.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Rdd of Rdds
Hello, I would like to parallelize my work on multiple RDDs I have. I wanted to know if spark can support a foreach on an RDD of RDDs. Here's a java example: public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName(testapp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); ListString list = Arrays.asList(new String[] {1, 2, 3}); JavaRDDString rdd = sc.parallelize(list); ListString list1 = Arrays.asList(new String[] {a, b, c}); JavaRDDString rdd1 = sc.parallelize(list1); ListJavaRDDString rddList = new ArrayListJavaRDDString(); rddList.add(rdd); rddList.add(rdd1); JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList); System.out.println(rddOfRdds.count()); rddOfRdds.foreach(new VoidFunctionJavaRDDString() { @Override public void call(JavaRDDString t) throws Exception { System.out.println(t.count()); } }); } From this code I'm getting a NullPointerException on the internal count method: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:0 failed 1 times, most recent failure: Exception failure in TID 1 on host localhost: java.lang.NullPointerException org.apache.spark.rdd.RDD.count(RDD.scala:861) org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) Help will be appreciated. Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark-jobserver for java apps
Hi, I'm working on the problem of remotely submitting apps to the spark master. I'm trying to use the spark-jobserver project (https://github.com/ooyala/spark-jobserver) for that purpose. For scala apps looks like things are working smoothly, but for java apps, I have an issue with implementing the scala trait SparkJob in java. Specifically, I'm trying to implement the validate method like this: @Override public SparkJobValidation validate(SparkContext sc, Config conf) { return new SparkJobValid(); } I'm getting the following compilation error: Type mismatch: cannot convert from SparkJobValid to SparkJobValidation Would love for some advice / working example. Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Cannot read from s3 using sc.textFile
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName(TestApp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX); sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX); String path = s3://bucket/test/testdata; JavaRDDString textFile = sc.textFile(path); System.out.println(textFile.count()); - But getting this error: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/test/testdata at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097) at org.apache.spark.rdd.RDD.count(RDD.scala:861) at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) at org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) Looking at the debug log I see that org.jets3t.service.impl.rest.httpclient.RestS3Service returned 404 error trying to locate the file. Using a simple java program with com.amazonaws.services.s3.AmazonS3Client works just fine. Any idea? Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Fwd: Cannot read from s3 using sc.textFile
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName(TestApp); sparkConf.setMaster(local); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX); sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX); String path = s3://bucket/test/testdata; JavaRDDString textFile = sc.textFile(path); System.out.println(textFile.count()); - But getting this error: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: s3://bucket/test/testdata at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097) at org.apache.spark.rdd.RDD.count(RDD.scala:861) at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365) at org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29) Looking at the debug log I see that org.jets3t.service.impl.rest.httpclient.RestS3Service returned 404 error trying to locate the file. Using a simple java program with com.amazonaws.services.s3.AmazonS3Client works just fine. Any idea? Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
MultipleTextOutputFormat with new hadoop API
Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation error: Bound mismatch: The generic method saveAsNewAPIHadoopFile(String, Class?, Class?, ClassF) of type JavaPairRDDK,V is not applicable for the arguments (String, ClassString, ClassString, ClassMultipleTextOutputFormat). The inferred type MultipleTextOutputFormat is not a valid substitute for the bounded parameter F extends OutputFormat?,? I bumped into some discussions suggesting to use MultipleOutputs (http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html), but this also fails from the same reason. Would love some assistance :) Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: MultipleTextOutputFormat with new hadoop API
Yes exactly.. so I guess this is still an open request. Any workaround? On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Are you trying to do something along the lines of what's described here? https://issues.apache.org/jira/browse/SPARK-3533 On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation error: Bound mismatch: The generic method saveAsNewAPIHadoopFile(String, Class?, Class?, ClassF) of type JavaPairRDDK,V is not applicable for the arguments (String, ClassString, ClassString, ClassMultipleTextOutputFormat). The inferred type MultipleTextOutputFormat is not a valid substitute for the bounded parameter F extends OutputFormat?,? I bumped into some discussions suggesting to use MultipleOutputs (http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html), but this also fails from the same reason. Would love some assistance :) Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Upgrading a standalone cluster on ec2 from 1.0.2 to 1.1.0
Hi, I would like to upgrade a standalone cluster to 1.1.0. What's the best way to do it? Should I just replace the existing /root/spark folder with the uncompressed folder from http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz ? What about hdfs and other installations? I have spark 1.0.2 with cdh4 hadoop 2.0 installed currently. Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: distcp on ec2 standalone spark cluster
~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error when trying to run distcp: ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) Any idea? Thanks! Tomer On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote: If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : XXX:9001 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: distcp on ec2 standalone spark cluster
Still no luck, even when running stop-all.sh followed by start-all.sh. On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Tomer, Did you try start-all.sh? It worked for me the last time I tried using distcp, and it worked for this guy too. Nick On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote: ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error when trying to run distcp: ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) Any idea? Thanks! Tomer On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote: If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : XXX:9001 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: distcp on ec2 standalone spark cluster
No tasktracker or nodemanager. This is what I see: On the master: org.apache.hadoop.yarn.server.resourcemanager.ResourceManager org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode org.apache.hadoop.hdfs.server.namenode.NameNode On the data node (slave): org.apache.hadoop.hdfs.server.datanode.DataNode On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote: what did you see in the log? was there anything related to mapreduce? can you log into your hdfs (data) node, use jps to list all java process and confirm whether there is a tasktracker process (or nodemanager) running with datanode process -- Ye Xianjin Sent with Sparrow On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote: Still no luck, even when running stop-all.sh followed by start-all.sh. On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Tomer, Did you try start-all.sh? It worked for me the last time I tried using distcp, and it worked for this guy too. Nick On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote: ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2; I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error when trying to run distcp: ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) Any idea? Thanks! Tomer On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote: If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : XXX:9001 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2
Hi, I would like to make sure I'm not exceeding the quota on the local cluster's hdfs. I have a couple of questions: 1. How do I know the quota? Here's the output of hadoop fs -count -q which essentially does not tell me a lot root@ip-172-31-7-49 ~]$ hadoop fs -count -q / 2147483647 2147482006none inf 4 163725412205559 / 2. What should I do to increase the quota? Should I bring down the existing slaves and upgrade to ones with more storage? Is there a way to add disks to existing slaves? I'm using the default m1.large slaves set up using the spark-ec2 script. Thanks, Tomer - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/. It shows 1 node hdfs though, although I have 4 slaves on my cluster. Any idea why? On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski ognen.duzlev...@gmail.com wrote: On 9/7/2014 7:27 AM, Tomer Benyamini wrote: 2. What should I do to increase the quota? Should I bring down the existing slaves and upgrade to ones with more storage? Is there a way to add disks to existing slaves? I'm using the default m1.large slaves set up using the spark-ec2 script. Take a look at: http://www.ec2instances.info/ There you will find the available EC2 instances with their associated costs and how much ephemeral space they come with. Once you pick an instance you get only so much ephemeral space. You can always add drives but they will be EBS and not physically attached to the instance. Ognen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
distcp on ec2 standalone spark cluster
Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : XXX:9001 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: distcp on ec2 standalone spark cluster
I've installed a spark standalone cluster on ec2 as defined here - https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if mr1/2 is part of this installation. On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin advance...@gmail.com wrote: Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce cluster on your hdfs? And from the error message, it seems that you didn't specify your jobtracker address. -- Ye Xianjin Sent with Sparrow On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess mapred is not running on the cluster - I'm getting the exception below. Is there a way to activate it, or is there a spark alternative to distcp? Thanks, Tomer mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use org.apache.hadoop.mapred.LocalClientProtocolProvider due to error: Invalid mapreduce.jobtracker.address configuration value for LocalJobRunner : XXX:9001 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76) at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352) at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146) at org.apache.hadoop.tools.DistCp.run(DistCp.java:118) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.tools.DistCp.main(DistCp.java:374) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org