from:"Tomer Benyamini"

PySpark + virtualenv: Using a different python path on the driver and on the executors

2017-02-25 Thread Tomer Benyamini

Hello,

I'm trying to run pyspark using the following setup:

- spark 1.6.1 standalone cluster on ec2
- virtualenv installed on master

- app is run using the following command:

export PYSPARK_DRIVER_PYTHON=/path_to_virtualenv/bin/python
export PYSPARK_PYTHON=/usr/bin/python
/root/spark/bin/spark-submit --py-files mypackage.tar.gz myapp.py

I'm getting the following error:

java.io.IOException: Cannot run program
"/path_to_virtualenv/bin/python": error=2, No such file or directory
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)

--> Looks like the executor process did not account for the PYSPARK_PYTHON
setting, but used the same python executable it had on the driver (the
virtualenv python), rather than using "
/usr/bin/python"

What am I doing wrong here?

Thanks,
Tomer

Driver zombie process (standalone cluster)

2016-06-29 Thread Tomer Benyamini

Hi,

I'm trying to run spark applications on a standalone cluster, running on
top of AWS. Since my slaves are spot instances, in some cases they are
being killed and lost due to bid prices. When apps are running during this
event, sometimes the spark application dies - and the driver process just
hangs, and stays up forever (zombie process), capturing memory / cpu
resources on the master machine. Then we have to manually kill -9 to free
these resources.

Has anyone seen this kind of problem before? Any suggested solution to work
around this problem?

Thanks,
Tomer

question about resource allocation on the spark standalone cluster

2015-07-01 Thread Tomer Benyamini

Hello spark-users,

I would like to use the spark standalone cluster for multi-tenants, to run
multiple apps at the same time. The issue is, when submitting an app to the
spark standalone cluster, you cannot pass --num-executors like on yarn,
but only --total-executor-cores. *This may cause starvation when
submitting multiple apps*. Here's an example: Say I have a cluster of 4
machines with 20GB RAM and 4 cores. In case I submit using
--total-executor-cores=4 and --executor-memory=20GB, I may get these 2
extreme resource allocations:
- 4 workers (on 4 machines) with 1 core each, 20GB each, blocking the
entire cluster
- 1 worker (on 1 machine) with 4 cores, 20GB for this machine, leaving 3
free machines to be used by other apps.

Is there a way to restrict / push the standalone cluster towards the 2nd
strategy (use all cores of a given worker before using a second worker)? A
workaround that we did is to set SPARK_WORKER_CORES to
1, SPARK_WORKER_MEMORY to 5gb and SPARK_WORKER_INSTANCES to 4, but this is
suboptimal since it runs 4 worker instances on 1 machine, which has the JVM
overhead, and does not allow to share memory across partitions on the same
worker.

Thanks,
Tomer

running 2 spark applications in parallel on yarn

2015-02-01 Thread Tomer Benyamini

Hi all,

I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that
whenever I'm running a heavy computation job in parallel to other jobs
running, I'm getting these kind of exceptions:

* [task-result-getter-2] INFO  org.apache.spark.scheduler.TaskSetManager-
Lost task 820.0 in stage 175.0 (TID 11327) on executor xxx:
java.io.IOException (Failed to connect to xx:35194) [duplicate 12]

* org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 12

* org.apache.spark.shuffle.FetchFailedException: Failed to connect to
x:35194
Caused by: java.io.IOException: Failed to connect to x:35194

when running the heavy job alone on the cluster, I'm not getting any
errors. My guess is that spark contexts from different apps do not share
information about taken ports, and therefore collide on specific ports,
causing the job/stage to fail. Is there a way to assign a specific set of
executors to a specific spark job via spark-submit, or is there a way to
define a range of ports to be used by the application?

Thanks!
Tomer

Re: custom spark app name in yarn-cluster mode

2014-12-15 Thread Tomer Benyamini

Thanks Sandy, passing --name works fine :)

Tomer

On Fri, Dec 12, 2014 at 9:35 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 Hi Tomer,

 In yarn-cluster mode, the application has already been submitted to YARN
 by the time the SparkContext is created, so it's too late to set the app
 name there.  I believe giving it with the --name property to spark-submit
 should work.

 -Sandy

 On Thu, Dec 11, 2014 at 10:28 AM, Tomer Benyamini tomer@gmail.com
 wrote:



 On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com
 wrote:

 Hi,

 I'm trying to set a custom spark app name when running a java spark app
 in yarn-cluster mode.

  SparkConf sparkConf = new SparkConf();

  sparkConf.setMaster(System.getProperty(spark.master));

  sparkConf.setAppName(myCustomName);

  sparkConf.set(spark.logConf, true);

  JavaSparkContext sc = new JavaSparkContext(sparkConf);


 Apparently this only works when running in yarn-client mode; in
 yarn-cluster mode the app name is the class name, when viewing the app in
 the cluster manager UI. Any idea?


 Thanks,

 Tomer

custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini

Hi,

I'm trying to set a custom spark app name when running a java spark app in
yarn-cluster mode.

 SparkConf sparkConf = new SparkConf();

 sparkConf.setMaster(System.getProperty(spark.master));

 sparkConf.setAppName(myCustomName);

 sparkConf.set(spark.logConf, true);

 JavaSparkContext sc = new JavaSparkContext(sparkConf);


Apparently this only works when running in yarn-client mode; in
yarn-cluster mode the app name is the class name, when viewing the app in
the cluster manager UI. Any idea?


Thanks,

Tomer

Re: custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini

On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com
wrote:

 Hi,

 I'm trying to set a custom spark app name when running a java spark app in
 yarn-cluster mode.

  SparkConf sparkConf = new SparkConf();

  sparkConf.setMaster(System.getProperty(spark.master));

  sparkConf.setAppName(myCustomName);

  sparkConf.set(spark.logConf, true);

  JavaSparkContext sc = new JavaSparkContext(sparkConf);


 Apparently this only works when running in yarn-client mode; in
 yarn-cluster mode the app name is the class name, when viewing the app in
 the cluster manager UI. Any idea?


 Thanks,

 Tomer

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-29 Thread Tomer Benyamini

Thanks - this is very helpful!

On Thu, Nov 27, 2014 at 5:20 AM, Michael Armbrust mich...@databricks.com
wrote:

 In the past I have worked around this problem by avoiding sc.textFile().
 Instead I read the data directly inside of a Spark job.  Basically, you
 start with an RDD where each entry is a file in S3 and then flatMap that
 with something that reads the files and returns the lines.

 Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe

 Using this class you can do something like:

 sc.parallelize(s3n://mybucket/file1 :: s3n://mybucket/file1 ... ::
 Nil).flatMap(new ReadLinesSafe(_))

 You can also build up the list of files by running a Spark job:
 https://gist.github.com/marmbrus/15e72f7bc22337cf6653

 Michael

 On Wed, Nov 26, 2014 at 9:23 AM, Aaron Davidson ilike...@gmail.com
 wrote:

 Spark has a known problem where it will do a pass of metadata on a large
 number of small files serially, in order to find the partition information
 prior to starting the job. This will probably not be repaired by switching
 the FS impl.

 However, you can change the FS being used like so (prior to the first
 usage):
 sc.hadoopConfiguration.set(fs.s3n.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem)

 On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini tomer@gmail.com
 wrote:

 Thanks Lalit; Setting the access + secret keys in the configuration
 works even when calling sc.textFile. Is there a way to select which hadoop
 s3 native filesystem implementation would be used at runtime using the
 hadoop configuration?

 Thanks,
 Tomer

 On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com
 wrote:


 you can try creating hadoop Configuration and set s3 configuration i.e.
 access keys etc.
 Now, for reading files from s3 use newAPIHadoopFile and pass the config
 object here along with key, value classes.





 -
 Lalit Yadav
 la...@sigmoidanalytics.com
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini

Hello,

I'm building a spark app required to read large amounts of log files from
s3. I'm doing so in the code by constructing the file list, and passing it
to the context as following:

val myRDD = sc.textFile(s3n://mybucket/file1, s3n://mybucket/file2, ... ,
s3n://mybucket/fileN)

When running it locally there are no issues, but when running it on the
yarn-cluster (running spark 1.1.0, hadoop 2.4), I'm seeing an inefficient
linear piece of code running, which could probably be easily parallelized:


[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file1

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file2

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file3



[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/fileN


I believe there are some difference between my local classpath and the
cluster's classpath - locally I see that
*org.apache.hadoop.fs.s3native.NativeS3FileSystem* is being used, whereas
on the cluster *com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem* is
being used. Any suggestions?


Thanks,

Tomer

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-26 Thread Tomer Benyamini

Thanks Lalit; Setting the access + secret keys in the configuration works
even when calling sc.textFile. Is there a way to select which hadoop s3
native filesystem implementation would be used at runtime using the hadoop
configuration?

Thanks,
Tomer

On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 la...@sigmoidanalytics.com
wrote:


 you can try creating hadoop Configuration and set s3 configuration i.e.
 access keys etc.
 Now, for reading files from s3 use newAPIHadoopFile and pass the config
 object here along with key, value classes.





 -
 Lalit Yadav
 la...@sigmoidanalytics.com
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Rdd of Rdds

2014-10-22 Thread Tomer Benyamini

Hello,

I would like to parallelize my work on multiple RDDs I have. I wanted
to know if spark can support a foreach on an RDD of RDDs. Here's a
java example:

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf().setAppName(testapp);
sparkConf.setMaster(local);

JavaSparkContext sc = new JavaSparkContext(sparkConf);


ListString list = Arrays.asList(new String[] {1, 2, 3});
JavaRDDString rdd = sc.parallelize(list);

ListString list1 = Arrays.asList(new String[] {a, b, c});
   JavaRDDString rdd1 = sc.parallelize(list1);

ListJavaRDDString rddList = new ArrayListJavaRDDString();
rddList.add(rdd);
rddList.add(rdd1);


JavaRDDJavaRDDString rddOfRdds = sc.parallelize(rddList);
System.out.println(rddOfRdds.count());


rddOfRdds.foreach(new VoidFunctionJavaRDDString() {

   @Override
public void call(JavaRDDString t) throws Exception {
 System.out.println(t.count());
}

   });
}

From this code I'm getting a NullPointerException on the internal count method:

Exception in thread main org.apache.spark.SparkException: Job
aborted due to stage failure: Task 1.0:0 failed 1 times, most recent
failure: Exception failure in TID 1 on host localhost:
java.lang.NullPointerException

org.apache.spark.rdd.RDD.count(RDD.scala:861)

org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)

org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)

Help will be appreciated.

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark-jobserver for java apps

2014-10-20 Thread Tomer Benyamini

Hi,

I'm working on the problem of remotely submitting apps to the spark
master. I'm trying to use the spark-jobserver project
(https://github.com/ooyala/spark-jobserver) for that purpose.

For scala apps looks like things are working smoothly, but for java
apps, I have an issue with implementing the scala trait SparkJob in
java. Specifically, I'm trying to implement the validate method like
this:

@Override
public SparkJobValidation validate(SparkContext sc, Config conf) {
return new SparkJobValid();
}

I'm getting the following compilation error:
Type mismatch: cannot convert from SparkJobValid to SparkJobValidation

Would love for some advice / working example.

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Cannot read from s3 using sc.textFile

2014-10-07 Thread Tomer Benyamini

Hello,

I'm trying to read from s3 using a simple spark java app:

-

SparkConf sparkConf = new SparkConf().setAppName(TestApp);
sparkConf.setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);
sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX);

String path = s3://bucket/test/testdata;
JavaRDDString textFile = sc.textFile(path);
System.out.println(textFile.count());

-
But getting this error:

org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: s3://bucket/test/testdata
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097)
at org.apache.spark.rdd.RDD.count(RDD.scala:861)
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)
at org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)


Looking at the debug log I see that
org.jets3t.service.impl.rest.httpclient.RestS3Service returned 404
error trying to locate the file.

Using a simple java program with
com.amazonaws.services.s3.AmazonS3Client works just fine.

Any idea?

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Fwd: Cannot read from s3 using sc.textFile

2014-10-07 Thread Tomer Benyamini

Hello,

I'm trying to read from s3 using a simple spark java app:

-

SparkConf sparkConf = new SparkConf().setAppName(TestApp);
sparkConf.setMaster(local);
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.hadoopConfiguration().set(fs.s3.awsAccessKeyId, XX);
sc.hadoopConfiguration().set(fs.s3.awsSecretAccessKey, XX);

String path = s3://bucket/test/testdata;
JavaRDDString textFile = sc.textFile(path);
System.out.println(textFile.count());

-
But getting this error:

org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: s3://bucket/test/testdata
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1097)
at org.apache.spark.rdd.RDD.count(RDD.scala:861)
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:365)
at org.apache.spark.api.java.JavaRDD.count(JavaRDD.scala:29)


Looking at the debug log I see that
org.jets3t.service.impl.rest.httpclient.RestS3Service returned 404
error trying to locate the file.

Using a simple java program with
com.amazonaws.services.s3.AmazonS3Client works just fine.

Any idea?

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini

Hi,

I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with
MultipleTextOutputFormat,:

outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class,
MultipleTextOutputFormat.class);

but I'm getting this compilation error:

Bound mismatch: The generic method saveAsNewAPIHadoopFile(String,
Class?, Class?, ClassF) of type JavaPairRDDK,V is not
applicable for the arguments (String, ClassString, ClassString,
ClassMultipleTextOutputFormat). The inferred type
MultipleTextOutputFormat is not a valid substitute for the bounded
parameter F extends OutputFormat?,?

I bumped into some discussions suggesting to use MultipleOutputs
(http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html),
but this also fails from the same reason.

Would love some assistance :)

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini

Yes exactly.. so I guess this is still an open request. Any workaround?

On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Are you trying to do something along the lines of what's described here?
 https://issues.apache.org/jira/browse/SPARK-3533

 On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini tomer@gmail.com
 wrote:

 Hi,

 I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with
 MultipleTextOutputFormat,:

 outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class,
 MultipleTextOutputFormat.class);

 but I'm getting this compilation error:

 Bound mismatch: The generic method saveAsNewAPIHadoopFile(String,
 Class?, Class?, ClassF) of type JavaPairRDDK,V is not
 applicable for the arguments (String, ClassString, ClassString,
 ClassMultipleTextOutputFormat). The inferred type
 MultipleTextOutputFormat is not a valid substitute for the bounded
 parameter F extends OutputFormat?,?

 I bumped into some discussions suggesting to use MultipleOutputs

 (http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html),
 but this also fails from the same reason.

 Would love some assistance :)

 Thanks,
 Tomer

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Upgrading a standalone cluster on ec2 from 1.0.2 to 1.1.0

2014-09-15 Thread Tomer Benyamini

Hi,

I would like to upgrade a standalone cluster to 1.1.0. What's the best
way to do it? Should I just replace the existing /root/spark folder
with the uncompressed folder from
http://d3kbcqa49mib13.cloudfront.net/spark-1.1.0-bin-cdh4.tgz ? What
about hdfs and other installations?

I have spark 1.0.2 with cdh4 hadoop 2.0 installed currently.

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini

~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;

I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
when trying to run distcp:

ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.

at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

Any idea?

Thanks!
Tomer

On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.

 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote:

 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini

Still no luck, even when running stop-all.sh followed by start-all.sh.

On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Tomer,

 Did you try start-all.sh? It worked for me the last time I tried using
 distcp, and it worked for this guy too.

 Nick


 On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:

 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;

 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 Any idea?

 Thanks!
 Tomer

 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
  If I recall, you should be able to start Hadoop MapReduce using
  ~/ephemeral-hdfs/sbin/start-mapred.sh.
 
  On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
  wrote:
 
  Hi,
 
  I would like to copy log files from s3 to the cluster's
  ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
  running on the cluster - I'm getting the exception below.
 
  Is there a way to activate it, or is there a spark alternative to
  distcp?
 
  Thanks,
  Tomer
 
  mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
  org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
  Invalid mapreduce.jobtracker.address configuration value for
  LocalJobRunner : XXX:9001
 
  ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
  java.io.IOException: Cannot initialize Cluster. Please check your
  configuration for mapreduce.framework.name and the correspond server
  addresses.
 
  at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
  at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
  at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
  at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
  at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini

No tasktracker or nodemanager. This is what I see:

On the master:

org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
org.apache.hadoop.hdfs.server.namenode.NameNode

On the data node (slave):

org.apache.hadoop.hdfs.server.datanode.DataNode



On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote:
 what did you see in the log? was there anything related to mapreduce?
 can you log into your hdfs (data) node, use jps to list all java process and
 confirm whether there is a tasktracker process (or nodemanager) running with
 datanode process

 --
 Ye Xianjin
 Sent with Sparrow

 On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:

 Still no luck, even when running stop-all.sh followed by start-all.sh.

 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Tomer,

 Did you try start-all.sh? It worked for me the last time I tried using
 distcp, and it worked for this guy too.

 Nick


 On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:


 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;

 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 Any idea?

 Thanks!
 Tomer

 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:

 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.

 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
 wrote:


 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to
 distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini

Hi,

I would like to make sure I'm not exceeding the quota on the local
cluster's hdfs. I have a couple of questions:

1. How do I know the quota? Here's the output of hadoop fs -count -q
which essentially does not tell me a lot

root@ip-172-31-7-49 ~]$ hadoop fs -count -q /
  2147483647  2147482006none inf
 4 163725412205559 /

2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini

Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/.
It shows 1 node hdfs though, although I have 4 slaves on my cluster.
Any idea why?

On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski
ognen.duzlev...@gmail.com wrote:

 On 9/7/2014 7:27 AM, Tomer Benyamini wrote:

 2. What should I do to increase the quota? Should I bring down the
 existing slaves and upgrade to ones with more storage? Is there a way
 to add disks to existing slaves? I'm using the default m1.large slaves
 set up using the spark-ec2 script.

 Take a look at: http://www.ec2instances.info/

 There you will find the available EC2 instances with their associated costs
 and how much ephemeral space they come with. Once you pick an instance you
 get only so much ephemeral space. You can always add drives but they will be
 EBS and not physically attached to the instance.

 Ognen

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini

Hi,

I would like to copy log files from s3 to the cluster's
ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
running on the cluster - I'm getting the exception below.

Is there a way to activate it, or is there a spark alternative to distcp?

Thanks,
Tomer

mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
Invalid mapreduce.jobtracker.address configuration value for
LocalJobRunner : XXX:9001

ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.

at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini

I've installed a spark standalone cluster on ec2 as defined here -
https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
mr1/2 is part of this installation.


On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin advance...@gmail.com wrote:
 Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce
 cluster on your hdfs?
 And from the error message, it seems that you didn't specify your jobtracker
 address.

 --
 Ye Xianjin
 Sent with Sparrow

 On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:

 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

PySpark + virtualenv: Using a different python path on the driver and on the executors

Driver zombie process (standalone cluster)

question about resource allocation on the spark standalone cluster

running 2 spark applications in parallel on yarn

Re: custom spark app name in yarn-cluster mode

custom spark app name in yarn-cluster mode

Re: custom spark app name in yarn-cluster mode

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

S3NativeFileSystem inefficient implementation when calling sc.textFile

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

Rdd of Rdds

Spark-jobserver for java apps

Cannot read from s3 using sc.textFile

Fwd: Cannot read from s3 using sc.textFile

MultipleTextOutputFormat with new hadoop API

Re: MultipleTextOutputFormat with new hadoop API

Upgrading a standalone cluster on ec2 from 1.0.2 to 1.1.0

Re: distcp on ec2 standalone spark cluster

Re: distcp on ec2 standalone spark cluster

Re: distcp on ec2 standalone spark cluster

Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

distcp on ec2 standalone spark cluster

Re: distcp on ec2 standalone spark cluster

24 matches

Site Navigation

Mail list logo

Footer information