Re: Invalid signature file digest for Manifest main attributes with spark job built using maven

2014-09-16 Thread Kevin Peng
Sean, Thanks. That worked. Kevin On Mon, Sep 15, 2014 at 3:37 PM, Sean Owen so...@cloudera.com wrote: This is more of a Java / Maven issue than Spark per se. I would use the shade plugin to remove signature files in your final META-INF/ dir. As Spark does, in its configuration: filters

Re: spark and mesos issue

2014-09-16 Thread Gurvinder Singh
It might not be related only to memory issue. Memory issue is also there as you mentioned. I have seen that one too. The fine mode issue is mainly spark considering that it got two different block manager for same ID, whereas if I search for the ID in the mesos slave, it exist only on the one

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread linkpatrickliu
Seems like the thriftServer cannot connect to Zookeeper, so it cannot get lock. This is how it the log looks when I run SparkSQL: load data inpath kv1.txt into table src; log: 14/09/16 14:40:47 INFO Driver: PERFLOG method=acquireReadWriteLocks 14/09/16 14:40:47 INFO ClientCnxn: Opening socket

SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
hi, I am trying to write some unit test, following spark programming guide http://spark.apache.org/docs/latest/programming-guide.html#unit-testing. but I observed unit test runs very slow(the code is just a SparkPi), so I turn log level to trace and look through the log output. and found

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread linkpatrickliu
Hi, Hao Cheng. I have done other tests. And the result shows the thriftServer can connect to Zookeeper. However, I found some more interesting things. And I think I have found a bug! Test procedure: Test1: (0) Use beeline to connect to thriftServer. (1) Switch database use dw_op1; (OK) The logs

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
Thanks Christian! I tried compiling from source but am still getting the same hadoop client version error when reading from HDFS. Will have to poke deeper... perhaps I've got some classpath issues. FWIW I compiled using: $ MAVEN_OPTS=-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m

Re: SchemaRDD saveToCassandra

2014-09-16 Thread lmk
Hi Michael, Please correct me if I am wrong. The error seems to originate from spark only. Please have a look at the stack trace of the error which is as follows: [error] (run-main-0) java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for class

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
I connect my sample project to a hosted CI service, it only takes 3 seconds to run there...while the same tests takes 2minutes on my macbook pro. so maybe this is a mac os specific problem? On Tue, Sep 16, 2014 at 3:06 PM, 诺铁 noty...@gmail.com wrote: hi, I am trying to write some unit test,

Re: Broadcast error

2014-09-16 Thread Chengi Liu
Cool.. While let me try that.. any other suggestion(s) on things I can try? On Mon, Sep 15, 2014 at 9:59 AM, Davies Liu dav...@databricks.com wrote: I think the 1.1 will be really helpful for you, it's all compatitble with 1.0, so it's not hard to upgrade to 1.1. On Mon, Sep 15, 2014 at

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Christian Chua
Is 1.0.8 working for you ? You indicated your last known good version is 1.0.0 Maybe we can track down where it broke. On Sep 16, 2014, at 12:25 AM, Paul Wais pw...@yelp.com wrote: Thanks Christian! I tried compiling from source but am still getting the same hadoop client version

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-16 Thread vasiliy
it works, thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947p14345.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SparkContext creation slow down unit tests

2014-09-16 Thread 诺铁
sorry for disturb, please ignore this mail in the end, I find it slow because lack of memory in my machine.. sorry again. On Tue, Sep 16, 2014 at 3:26 PM, 诺铁 noty...@gmail.com wrote: I connect my sample project to a hosted CI service, it only takes 3 seconds to run there...while the same

Re: Serving data

2014-09-16 Thread Marius Soutier
Writing to Parquet and querying the result via SparkSQL works great (except for some strange SQL parser errors). However the problem remains, how do I get that data back to a dashboard. So I guess I’ll have to use a database after all. You can batch up data store into parquet partitions as

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Cheng, Hao
Thank you for pasting the steps, I will look at this, hopefully come out with a solution soon. -Original Message- From: linkpatrickliu [mailto:linkpatrick...@live.com] Sent: Tuesday, September 16, 2014 3:17 PM To: u...@spark.incubator.apache.org Subject: RE: SparkSQL 1.1 hang when DROP

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Yifan LI
Dear Ankur, Thanks! :) - from [1], and my understanding, the existing inactive feature in graphx pregel api is “if there is no in-edges, from active vertex, to this vertex, then we will say this one is inactive”, right? For instance, there is a graph in which every vertex has at least one

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 10:55:37 +0200, Yifan LI iamyifa...@gmail.com wrote: - from [1], and my understanding, the existing inactive feature in graphx pregel api is “if there is no in-edges, from active vertex, to this vertex, then we will say this one is inactive”, right? Well, that's true when

Re: vertex active/inactive feature in Pregel API ?

2014-09-16 Thread Ankur Dave
At 2014-09-16 12:23:10 +0200, Yifan LI iamyifa...@gmail.com wrote: but I am wondering if there is a message(none?) sent to the target vertex(the rank change is less than tolerance) in below dynamic page rank implementation, def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = {

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Sean Owen
From the caller / application perspective, you don't care what version of Hadoop Spark is running on on the cluster. The Spark API you compile against is the same. When you spark-submit the app, at runtime, Spark is using the Hadoop libraries from the cluster, which are the right version. So when

Re: How to set executor num on spark on yarn

2014-09-16 Thread Sean Owen
How many cores do your machines have? --executor-cores should be the number of cores each executor uses. Fewer cores means more executors in general. From your data, it sounds like, for example, there are 7 nodes with 4+ cores available to YARN, and 2 more nodes with 2-3 cores available. Hence

Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
I have a standalone spark cluster and from within the same scala application I'm creating 2 different spark context to run two different spark streaming jobs as SparkConf is different for each of them. I'm getting this error that... I don't really understand: 14/09/16 11:51:35 ERROR

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
It seems that, as I have a single scala application, the scheduler is the same and there is a collision between executors of both spark context. Is there a way to change how the executor ID is generated (maybe an uuid instead of a sequential number..?) 2014-09-16 13:07 GMT+01:00 Luis Ángel

Re: PySpark on Yarn - how group by data properly

2014-09-16 Thread Oleg Ruchovets
I am expand my data set and executing pyspark on yarn: I payed attention that only 2 processes processed the data: 14210 yarn 20 0 2463m 2.0g 9708 R 100.0 4.3 8:22.63 python2.7 32467 yarn 20 0 2519m 2.1g 9720 R 99.3 4.4 7:16.97 python2.7 *Question:* *how to configure

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
When I said scheduler I meant executor backend. 2014-09-16 13:26 GMT+01:00 Luis Ángel Vicente Sánchez langel.gro...@gmail.com: It seems that, as I have a single scala application, the scheduler is the same and there is a collision between executors of both spark context. Is there a way to

Re: Spark Streaming: CoarseGrainedExecutorBackend: Slave registration failed: Duplicate executor ID

2014-09-16 Thread Luis Ángel Vicente Sánchez
I dug a bit more and the executor ID is a number so it's seems there is not possible workaround. Looking at the code of the CoarseGrainedSchedulerBackend.scala:

Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Tom
From my map function I create Tuple2Integer, Integer pairs. Now I want to reduce them, and get something like Tuple2Integer, Listlt;Integer. The only way I found to do this was by treating all variables as String, and in the reduceByKey do /return a._2 + , + b._2/ //in which both are numeric

Re: Reduce Tuple2Integer, Integer to Tuple2Integer,ListInteger

2014-09-16 Thread Sean Owen
If you mean you have (key,value) pairs, and want pairs with key, and all values for that key, then you're looking for groupByKey On Tue, Sep 16, 2014 at 2:42 PM, Tom thubregt...@gmail.com wrote: From my map function I create Tuple2Integer, Integer pairs. Now I want to reduce them, and get

Re: Serving data

2014-09-16 Thread Yana Kadiyska
If your dashboard is doing ajax/pull requests against say a REST API you can always create a Spark context in your rest service and use SparkSQL to query over the parquet files. The parquet files are already on disk so it seems silly to write both to parquet and to a DB...unless I'm missing

java.util.NoSuchElementException: key not found

2014-09-16 Thread Brad Miller
Hi All, I suspect I am experiencing a bug. I've noticed that while running larger jobs, they occasionally die with the exception java.util.NoSuchElementException: key not found xyz, where xyz denotes the ID of some particular task. I've excerpted the log from one job that died in this way below

org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Hui Li
Hi, I am new to SPARK. I just set up a small cluster and wanted to run some simple MLLIB examples. By following the instructions of https://spark.apache.org/docs/0.9.0/mllib-guide.html#binary-classification-1, I could successfully run everything until the step of SVMWithSGD, I got error the

Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A
Hello, Suppose I want to use Spark from an application that I already submit to run in another container (e.g. Tomcat). Is this at all possible? Or do I have to split the app into two components, and submit one to Spark and one to the other container? In that case, what is the

collect on hadoopFile RDD returns wrong results

2014-09-16 Thread vasiliy
Hello. I have a hadoopFile RDD and i tried to collect items to driver program, but it returns me an array of identical records (equals to last record of my file). My code is like this: val rdd = sc.hadoopFile( hdfs:///data.avro,

HBase and non-existent TableInputFormat

2014-09-16 Thread Y. Dong
Hello, I’m currently using spark-core 1.1 and hbase 0.98.5 and I want to simply read from hbase. The Java code is attached. However the problem is TableInputFormat does not even exist in hbase-client API, is there any other way I can read from hbase? Thanks SparkConf sconf = new

Re: combineByKey throws ClassCastException

2014-09-16 Thread Tao Xiao
This problem was caused by the fact that I used a package jar with a Spark version (0.9.1) different from that of the cluster (0.9.0). When I used the correct package jar (spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar) instead the application can run as expected. 2014-09-15 14:57

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu
bq. TableInputFormat does not even exist in hbase-client API It is in hbase-server module. Take a look at http://hbase.apache.org/book.html#mapreduce.example.read On Tue, Sep 16, 2014 at 8:18 AM, Y. Dong tq00...@gmail.com wrote: Hello, I’m currently using spark-core 1.1 and hbase 0.98.5 and

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Ted Yu
hbase-client module serves client facing APIs. hbase-server module is supposed to host classes used on server side. There is still some work to be done so that the above goal is achieved. On Tue, Sep 16, 2014 at 9:06 AM, Y. Dong tq00...@gmail.com wrote: Thanks Ted. It is indeed in

Re: scala 2.11?

2014-09-16 Thread Mohit Jaggi
Can I load that plugin in spark-shell? Or perhaps due the 2-phase compilation quasiquotes won't work in shell? On Mon, Sep 15, 2014 at 7:15 PM, Mark Hamstra m...@clearstorydata.com wrote: Okay, that's consistent with what I was expecting. Thanks, Matei. On Mon, Sep 15, 2014 at 5:20 PM, Matei

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob
Hi, I had a similar situation in which I needed to read data from HBase and work with the data inside of a spark context. After much ggling, I finally got mine to work. There are a bunch of steps that you need to do get this working - The problem is that the spark context does not know

Re: HBase and non-existent TableInputFormat

2014-09-16 Thread Nicholas Chammas
Btw, there are some examples in the Spark GitHub repo that you may find helpful. Here's one https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala related to HBase. On Tue, Sep 16, 2014 at 1:22 PM, abraham.ja...@thomsonreuters.com wrote:

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia
If you want to run the computation on just one machine (using Spark's local mode), it can probably run in a container. Otherwise you can create a SparkContext there and connect it to a cluster outside. Note that I haven't tried this though, so the security policies of the container might be too

Re: Spark as a Library

2014-09-16 Thread Soumya Simanta
It depends on what you want to do with Spark. The following has worked for me. Let the container handle the HTTP request and then talk to Spark using another HTTP/REST interface. You can use the Spark Job Server for this. Embedding Spark inside the container is not a great long term solution IMO

RE: HBase and non-existent TableInputFormat

2014-09-16 Thread abraham.jacob
Yes that was very helpful… ☺ Here are a few more I found on my quest to get HBase working with Spark – This one details about Hbase dependencies and spark classpaths http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html This one has a code overview –

Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote: Thank you for pasting the steps, I will look at this, hopefully come out with a solution soon. -Original Message- From: linkpatrickliu

Re: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Yin Huai
I meant it may be a Hive bug since we also call Hive's drop table internally. On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai huaiyin@gmail.com wrote: Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao hao.ch...@intel.com wrote: Thank

RDD projection and sorting

2014-09-16 Thread Sameer Tilak
Hi All, I have data in for following format:L 1st column is userid and the second column onward are class ids for various products. I want to save this in Libsvm format and an intermediate step is to sort (in ascending manner) the class ids. For example: I/Puid1 12433580 2670122

RE: Spark as a Library

2014-09-16 Thread Ruebenacker, Oliver A
Hello, Thanks for the response and great to hear it is possible. But how do I connect to Spark without using the submit script? I know how to start up a master and some workers and then connect to the master by packaging the app that contains the SparkContext and then submitting the

Re: NullWritable not serializable

2014-09-16 Thread Du Li
Hi, The test case is separated out as follows. The call to rdd2.first() breaks when spark version is changed to 1.1.0, reporting exception NullWritable not serializable. However, the same test passed with spark 1.0.2. The pom.xml file is attached. The test data README.md was copied from spark.

Re: Spark as a Library

2014-09-16 Thread Daniel Siegmann
You can create a new SparkContext inside your container pointed to your master. However, for your script to run you must call addJars to put the code on your workers' classpaths (except when running locally). Hopefully your webapp has some lib folder which you can point to as a source for the

R: Spark as a Library

2014-09-16 Thread Paolo Platter
Hi, Spark job server by ooyala is the right tool for the job. It exposes rest api so calling it from a web app is suitable. Is open source, you can find it on github Best Paolo Platter Da: Ruebenacker, Oliver Amailto:oliver.ruebenac...@altisource.com Inviato:

Re: Spark 1.1 / cdh4 stuck using old hadoop client?

2014-09-16 Thread Paul Wais
Hi Sean, Great catch! Yes I was including Spark as a dependency and it was making its way into my uber jar. Following the advice I just found at Stackoverflow[1], I marked Spark as a provided dependency and that appeared to fix my Hadoop client issue. Thanks for your help!!! Perhaps they

Spark processing small files.

2014-09-16 Thread cem
Hi all, Spark is taking too much time to start the first stage with many small files in HDFS. I am reading a folder that contains RC files: sc.hadoopFile(hdfs://hostname :8020/test_data2gb/, classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable],

Indexed RDD

2014-09-16 Thread Akshat Aranya
Hi, I'm trying to implement a custom RDD that essentially works as a distributed hash table, i.e. the key space is split up into partitions and within a partition, an element can be looked up efficiently by the key. However, the RDD lookup() function (in PairRDDFunctions) is implemented in a way

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread st553
Does MLlib provide utility functions to do this kind of encoding? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Categorical-Features-for-K-Means-Clustering-tp9416p14394.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Categorical Features for K-Means Clustering

2014-09-16 Thread Sean Owen
I think it's on the table but not yet merged? https://issues.apache.org/jira/browse/SPARK-1216 On Tue, Sep 16, 2014 at 10:04 PM, st553 sthompson...@gmail.com wrote: Does MLlib provide utility functions to do this kind of encoding? -- View this message in context:

Memory under-utilization

2014-09-16 Thread francisco
Hi, I'm a Spark newbie. We had installed spark-1.0.2-bin-cdh4 on a 'super machine' with 256gb memory and 48 cores. Tried to allocate a task with 64gb memory but for whatever reason Spark is only using around 9gb max. Submitted spark job with the following command: /bin/spark-submit -class

Re: Memory under-utilization

2014-09-16 Thread Boromir Widas
Perhaps your job does not use more than 9g. Even though the dashboard shows 64g the process only uses whats needed and grows to 64g max. On Tue, Sep 16, 2014 at 5:40 PM, francisco ftanudj...@nextag.com wrote: Hi, I'm a Spark newbie. We had installed spark-1.0.2-bin-cdh4 on a 'super machine'

Questions about Spark speculation

2014-09-16 Thread Nicolas Mai
Hi, guys My current project is using Spark 0.9.1, and after increasing the level of parallelism and partitions in our RDDs, stages and tasks seem to complete much faster. However it also seems that our cluster becomes more unstable after some time: - stalled stages still showing under active

Re: Memory under-utilization

2014-09-16 Thread francisco
Thanks for the reply. I doubt that's the case though ... the executor kept having to do a file dump because memory is full. ... 14/09/16 15:00:18 WARN ExternalAppendOnlyMap: Spilling in-memory map of 67 MB to disk (668 times so far) 14/09/16 15:00:21 WARN ExternalAppendOnlyMap: Spilling

How do I manipulate values outside of a GraphX loop?

2014-09-16 Thread crockpotveggies
Brand new to Apache Spark and I'm a little confused how to make updates to a value that sits outside of a .mapTriplets iteration in GraphX. I'm aware mapTriplets is really only for modifying values inside the graph. What about using it in conjunction with other computations? See below: def

Re: Memory under-utilization

2014-09-16 Thread Boromir Widas
I see, what does http://localhost:4040/executors/ show for memory usage? I personally find it easier to work with a standalone cluster with a single worker by using the sbin/start-master.sh and then connecting to the master. On Tue, Sep 16, 2014 at 6:04 PM, francisco ftanudj...@nextag.com wrote:

MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-16 Thread Aris
Hello Spark Community - I am using the support vector machine / SVM implementation in MLlib with the standard linear kernel; however, I noticed in the Spark documentation for StandardScaler is *specifically* mentions that SVMs which use the RBF kernel work really well when you have standardized

Re: org.apache.spark.SparkException: java.io.FileNotFoundException: does not exist)

2014-09-16 Thread Aris
This should be a really simple problem, but you haven't shared enough code to determine what's going on here. On Tue, Sep 16, 2014 at 8:08 AM, Hui Li littleleave...@gmail.com wrote: Hi, I am new to SPARK. I just set up a small cluster and wanted to run some simple MLLIB examples. By

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.
Hello friends: Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution. Everything went fine, and everything seems to work, but for the following. Following are two invocations of the 'pyspark' script, one with enclosing quotes around the options passed to

partitioned groupBy

2014-09-16 Thread Akshat Aranya
I have a use case where my RDD is set up such: Partition 0: K1 - [V1, V2] K2 - [V2] Partition 1: K3 - [V1] K4 - [V3] I want to invert this RDD, but only within a partition, so that the operation does not require a shuffle. It doesn't matter if the partitions of the inverted RDD have non unique

Re: partitioned groupBy

2014-09-16 Thread Patrick Wendell
If each partition can fit in memory, you can do this using mapPartitions and then building an inverse mapping within each partition. You'd need to construct a hash map within each partition yourself. On Tue, Sep 16, 2014 at 4:27 PM, Akshat Aranya aara...@gmail.com wrote: I have a use case where

Re: Memory under-utilization

2014-09-16 Thread francisco
Thanks for the tip. http://localhost:4040/executors/ is showing Executors(1) Memory: 0.0 B used (294.9 MB Total) Disk: 0.0 B Used However, running as standalone cluster does resolve the problem. I can see a worker process running w/ the allocated memory. My conclusion (I may be wrong) is for

how to report documentation bug?

2014-09-16 Thread Andy Davidson
http://spark.apache.org/docs/latest/quick-start.html#standalone-applications Click on java tab There is a bug in the maven section version1.1.0-SNAPSHOT/version Should be version1.1.0/version Hope this helps Andy

RE: SparkSQL 1.1 hang when DROP or LOAD

2014-09-16 Thread Cheng, Hao
Thank you Yin Huai. This is probably true. I saw in the hive-site.xml, Liu has changed the entry, which is default should be false. property namehive.support.concurrency/name descriptionEnable Hive's Table Lock Manager Service/description valuetrue/value /property Someone is

Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread daijia
Is there some way to ship textfile just like ship python libraries? Thanks in advance Daijia -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p14412.html Sent from the Apache Spark User List mailing

Re: how to report documentation bug?

2014-09-16 Thread Nicholas Chammas
You can send an email like you just did or open an issue in the Spark issue tracker http://issues.apache.org/jira/. This looks like a problem with how the version is generated in this file https://github.com/apache/spark/blob/branch-1.1/docs/quick-start.md. On Tue, Sep 16, 2014 at 8:55 PM, Andy

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-16 Thread Dimension Data, LLC.
Hi Sandy: Thank you. I have not tried that mechanism (I wasn't are of it). I will try that instead. Is it possible to also represent '--driver-memory' and '--executor-memory' (and basically all properties) using the '--conf' directive? The Reason: I actually discovered the below issue

CPU RAM

2014-09-16 Thread VJ Shalish
Hi I need to get the CPU utilisation, RAM usage, Network IO and other metrics using Java program. Can anyone help me on this? Thanks Shalish.

Re: Unable to ship external Python libraries in PYSPARK

2014-09-16 Thread Davies Liu
Yes, sc.addFile() is what you want: | addFile(self, path) | Add a file to be downloaded with this Spark job on every node. | The C{path} passed can be either a local file, a file in HDFS | (or other Hadoop-supported filesystems), or an HTTP, HTTPS or | FTP URI. | |

Re: CPU RAM

2014-09-16 Thread Amit
Not particularly related to Spark, but you can check out SIGAR API. It let's you get CPU, Memory, Network, Filesystem and process based metrics. Amit On Sep 16, 2014, at 20:14, VJ Shalish vjshal...@gmail.com wrote: Hi I need to get the CPU utilisation, RAM usage, Network IO and other

Re: CPU RAM

2014-09-16 Thread VJ Shalish
Thank u for the response Amit. So is it that, we cannot measure the CPU consumption, RAM usage of a spark job through a Java program? On Tue, Sep 16, 2014 at 11:23 PM, Amit kumarami...@gmail.com wrote: Not particularly related to Spark, but you can check out SIGAR API. It let's you get CPU,

Re: CPU RAM

2014-09-16 Thread VJ Shalish
Sorry for the confusion Team. My requirement is to measure the CPU utilisation, RAM usage, Network IO and other metrics of a SPARK JOB using Java program. Please help on the same. On Tue, Sep 16, 2014 at 11:23 PM, Amit kumarami...@gmail.com wrote: Not particularly related to Spark, but you can

YARN mode not available error

2014-09-16 Thread Barrington
Hi, I am running Spark in cluster mode with Hadoop YARN as the underlying cluster manager. I get this error when trying to initialize the SparkContext. Exception in thread main org.apache.spark.SparkException: YARN mode not available ? at

The difference between pyspark.rdd.PipelinedRDD and pyspark.rdd.RDD

2014-09-16 Thread edmond_huo
Hi, I am a freshman about spark. I tried to run a job like wordcount example in python. But when I tried to get the top 10 popular words in the file, I got the message:AttributeError: 'PipelinedRDD' object has no attribute 'sortByKey'. So my question is what is the difference between

permission denied on local dir

2014-09-16 Thread style95
I am running spark on shared yarn cluster. My user ID is online, but I found that when I run my spark application, local directories are created by yarn user ID. So I am unable to delete local directories and finally application failed. Please refer to my log below: 14/09/16 21:59:02 ERROR