Re: Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

2014-12-11 Thread Cheng Lian
There are several overloaded versions of both |jsonFile| and |jsonRDD|. Schema inferring is kinda expensive since it requires an extra Spark job. You can avoid schema inferring by storing the inferred schema and then use it together with the following two methods: * |def jsonFile(path:

Re: Decision Tree with libsvmtools datasets

2014-12-11 Thread Sean Owen
The implementation assumes classes are 0-indexed, not 1-indexed. You should set numClasses = 3 and change your labels to 0, 1, 2. On Thu, Dec 11, 2014 at 3:40 AM, Ge, Yao (Y.) y...@ford.com wrote: I am testing decision tree using iris.scale data set

Spark steaming : work with collect() but not without collect()

2014-12-11 Thread david
Hi, We use the following Spark Streaming code to collect and process Kafka event : kafkaStream.foreachRDD(rdd = { rdd.collect().foreach(event = { process(event._1, event._2) }) }) This work fine. But without /collect()/ function, the following exception is

Re: Spark steaming : work with collect() but not without collect()

2014-12-11 Thread Gerard Maas
Have you tried with kafkaStream.foreachRDD(rdd = {rdd.foreach(...)} ? Would that make a difference? On Thu, Dec 11, 2014 at 10:24 AM, david david...@free.fr wrote: Hi, We use the following Spark Streaming code to collect and process Kafka event : kafkaStream.foreachRDD(rdd = {

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Gerard Maas
If the timestamps in the logs are to be trusted It looks like your driver is dying with that *java.io.FileNotFoundException*: and therefore the workers loose their connection and close down. -kr, Gerard. On Thu, Dec 11, 2014 at 7:39 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Try to add

Error on JavaSparkContext.stop()

2014-12-11 Thread Taeyun Kim
Hi, When my spark program calls JavaSparkContext.stop(), the following errors occur. 14/12/11 16:24:19 INFO Main: sc.stop { 14/12/11 16:24:20 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(cluster02,38918) not found

spark logging issue

2014-12-11 Thread Sourav Chandra
Hi, I am using spark 1.1.0 and setting below properties while creating spark context. *spark.executor.logs.rolling.maxRetainedFiles = 10* *spark.executor.logs.rolling.size.maxBytes = 104857600* *spark.executor.logs.rolling.strategy = size* Even though I am setting to rollover after 100 MB,

Can spark job have sideeffects (write files to FileSystem)

2014-12-11 Thread Paweł Szulc
Imagine simple Spark job, that will store each line of the RDD to a separate file val lines = sc.parallelize(1 to 100).map(n = sthis is line $n) lines.foreach(line = writeToFile(line)) def writeToFile(line: String) = { def filePath = file://... val file = new File(new URI(path).getPath)

Re: SchemaRDD partition on specific column values?

2014-12-11 Thread nitin
Can we take this as a performance improvement task in Spark-1.2.1? I can help contribute for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html Sent from the Apache Spark User List mailing

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Tathagata Das
Following Gerard's thoughts, here are possible things that could be happening. 1. Is there another process in the background that is deleting files in the directory where you are trying to write? Seems like the temporary file generated by one of the tasks is getting delete before it is renamed to

RE: Spark-SQL JDBC driver

2014-12-11 Thread Anas Mosaad
Actually I came to a conclusion that RDDs has to be persisted in hive in order to be able to access through thrift. Hope I didn't end up with incorrect conclusion. Please someone correct me if I am wrong. On Dec 11, 2014 8:53 AM, Judy Nash judyn...@exchange.microsoft.com wrote: Looks like you

Re: Spark steaming : work with collect() but not without collect()

2014-12-11 Thread Tathagata Das
What does process do? Maybe when this process function is being run in the Spark executor, it is causing the some static initialization, which fails causing this exception. For Oracle documentation, an ExceptionInInitializerError is thrown to indicate that an exception occurred during evaluation

Session for connections?

2014-12-11 Thread Ashic Mahtab
Hi, I was wondering if there's any way of having long running session type behaviour in spark. For example, let's say we're using Spark Streaming to listen to a stream of events. Upon receiving an event, we process it, and if certain conditions are met, we wish to send a message to rabbitmq.

Re: ERROR YarnClientClusterScheduler: Lost executor Akka client disassociated

2014-12-11 Thread Muhammad Ahsan
-- Code -- scala import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext._ scala import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala import org.apache.spark.sql.SchemaRDD

Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
Hi, I'm trying to use spark-streaming with kafka but I get a strange error on class that are missing. I would like to ask if my way to build the fat jar is correct or no. My program is val kafkaStream = KafkaUtils.createStream(ssc, zookeeperQuorum, kafkaGroupId, kafkaTopicsWithThreads)

Re: Session for connections?

2014-12-11 Thread Tathagata Das
You could create a lazily initialized singleton factory and connection pool. Whenever an executor starts running the firt task that needs to push out data, it will create the connection pool as a singleton. And subsequent tasks running on the executor is going to use the connection pool. You will

RE: Session for connections?

2014-12-11 Thread Ashic Mahtab
That makes sense. I'll try that. Thanks :) From: tathagata.das1...@gmail.com Date: Thu, 11 Dec 2014 04:53:01 -0800 Subject: Re: Session for connections? To: as...@live.com CC: user@spark.apache.org You could create a lazily initialized singleton factory and connection pool. Whenever an

Re: KafkaUtils explicit acks

2014-12-11 Thread Tathagata Das
I am updating the docs right now. Here is a staged copy that you can have sneak peek of. This will be part of the Spark 1.2. http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html The updated fault-tolerance section tries to simplify the explanation of when and what data

Spark SQL Vs CQL performance on Cassandra

2014-12-11 Thread Ajay
Hi, To test Spark SQL Vs CQL performance on Cassandra, I did the following: 1) Cassandra standalone server (1 server in a cluster) 2) Spark Master and 1 Worker Both running in a Thinkpad laptop with 4 cores and 8GB RAM. 3) Written Spark SQL code using Cassandra-Spark Driver from Cassandra

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-11 Thread Daniel Darabos
Yes, this is perfectly legal. This is what RDD.foreach() is for! You may be encountering an IO exception while writing, and maybe using() suppresses it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd expect there is less that can go wrong with that simple call. On Thu, Dec

Standalone app: IOException due to broadcast.destroy()

2014-12-11 Thread Alberto Garcia
Hello. I'm pretty new with Spark I am developing an Spark application, conducting the test on local prior to deploy it on a cluster. I have a problem with a broacast variable. The application raises Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
In this way it works but it's not portable and the idea of having a fat jar is to avoid exactly this. Is there any system to create a self-contained portable fatJar? On 11.12.2014 13:57, Akhil Das wrote: Add these jars while creating the Context. val sc = new SparkContext(conf)

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Akhil Das
Yes. You can do/use *sbt assembly* and create a big fat jar with all dependencies bundled inside it. Thanks Best Regards On Thu, Dec 11, 2014 at 7:10 PM, Mario Pastorelli mario.pastore...@teralytics.ch wrote: In this way it works but it's not portable and the idea of having a fat jar is to

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
Thanks akhil for the answer. I am using sbt assembly and the build.sbt is in the first email. Do you know why those classes are included in that way? Thanks, Mario On 11.12.2014 14:51, Akhil Das wrote: Yes. You can do/use *sbt assembly* and create a big fat jar with all dependencies

Re: Session for connections?

2014-12-11 Thread Tathagata Das
Also, this is covered in the streaming programming guide in bits and pieces. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote: That makes sense. I'll try that. Thanks :)

Re: Session for connections?

2014-12-11 Thread Gerard Maas
I'm doing the same thing for using Cassandra, For Cassandra, use the Spark-Cassandra connector [1], which does the Session management, as described by TD, for you. [1] https://github.com/datastax/spark-cassandra-connector -kr, Gerard. On Thu, Dec 11, 2014 at 1:55 PM, Ashic Mahtab

Re: Trouble with cache() and parquet

2014-12-11 Thread Yana Kadiyska
I see -- they are the same in design but the difference comes from partitioned Hive tables: when the RDD is generated by querying an external Hive metastore, the partition is appended as part of the row, and shows up as part of the schema. Can you shed some light on why this is a problem:

Re: Locking for shared RDDs

2014-12-11 Thread Tathagata Das
Aditya, I think you have the mental model of spark streaming a little off the mark. Unlike traditional streaming systems, where any kind of state is mutable, SparkStreaming is designed on Sparks immutable RDDs. Streaming data is received and divided into immutable blocks, then form immutable RDDs,

Re: Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-11 Thread Tathagata Das
First of all, how long do you want to keep doing this? The data is going to increase infinitely and without any bounds, its going to get too big for any cluster to handle. If all that is within bounds, then try the following. - Maintain a global variable having the current RDD storing all the log

Re: Specifying number of executors in Mesos

2014-12-11 Thread Tathagata Das
Not that I am aware of. Spark will try to spread the tasks evenly across executors, its not aware of the workers at all. So if the executors to worker allocation is uneven, I am not sure what can be done. Maybe others can get smoe ideas. On Tue, Dec 9, 2014 at 6:20 AM, Gerard Maas

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Flávio Santos
Hello guys, Thank you for your prompt reply. I followed Akhil suggestion with no success. Then, I tried again replacing S3 by HDFS and the job seems to work properly. TD, I'm not using speculative execution. I think I've just realized what is happening. Due to S3 eventual consistency, these

Re: Spark-SQL JDBC driver

2014-12-11 Thread Denny Lee
Yes, that is correct. A quick reference on this is the post https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1 with the pertinent section being: It is important to note that when you create Spark tables (for

Re: parquet file not loading (spark v 1.1.0)

2014-12-11 Thread Muhammad Ahsan
Hi It worked for me like this. Just define the case class outside of any class to write to parquet format successfully. I am using Spark version 1.1.1. case class person(id: Int, name: String, fathername: String, officeid: Int) object Program { def main (args: Array[String]) { val

Re: Error outputing to CSV file

2014-12-11 Thread Muhammad Ahsan
Hi saveAsTextFile is a member of RDD where as fields.map(_.mkString(|)).mkString(\n) is a string. You have to transform it into RDD using something like sc.parallel(...) before saveAsTextFile. Thanks -- View this message in context:

Newbie Question

2014-12-11 Thread Fernando O.
Hi guys, I'm planning to use spark on a project and I'm facing a problem, I couldn't find a log that explains what's wrong with what I'm doing. I have 2 vms that run a small hadoop (2.6.0) cluster. I added a file that has a 50 lines of json data Compiled spark, all tests passed, I run some

Exception using amazonaws library

2014-12-11 Thread Albert Manyà
Hi, I've made a simple script in scala that after doing a spark sql query it sends the result to AWS's cloudwatch. I've tested both parts individually (the spark sql one and the cloudwatch one) and they worked fine. The trouble comes when I execute the script through spark-submit that gives me

Different Vertex Ids in Graph and Edges

2014-12-11 Thread th0rsten
Hello all, I'm using GraphX (1.1.0) to process RDF-data. I want to build an graph out of the data from the Berlin Benchmark ( BSBM http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/ ). The steps that I'm doing to load the data into a graph are: *1.* Split the RDF triples

broadcast: OutOfMemoryError

2014-12-11 Thread ll
hi. i'm running into this OutOfMemory issue when i'm broadcasting a large array. what is the best way to handle this? should i split the array into smaller arrays before broadcasting, and then combining them locally at each node? thanks! -- View this message in context:

Re: RDD.aggregate?

2014-12-11 Thread ll
any explaination on how aggregate works would be much appreciated. i already looked at the spark example and still am confused about the seqop and combop... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20634.html Sent from

Re: what is the best way to implement mini batches?

2014-12-11 Thread ll
any advice/comment on this would be much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

why is spark + scala code so slow, compared to python?

2014-12-11 Thread ll
hi.. i'm converting some of my machine learning python code into scala + spark. i haven't been able to run it on large dataset yet, but on small datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code is much slower than my python code (5 to 10 times slower than python) i

custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty(spark.master)); sparkConf.setAppName(myCustomName); sparkConf.set(spark.logConf, true); JavaSparkContext sc =

Re: custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini tomer@gmail.com wrote: Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty(spark.master));

Re: RDD.aggregate?

2014-12-11 Thread Gerard Maas
There's some explanation and an example here: http://stackoverflow.com/questions/26611471/spark-data-processing-with-grouping/26612246#26612246 -kr, Gerard. On Thu, Dec 11, 2014 at 7:15 PM, ll duy.huynh@gmail.com wrote: any explaination on how aggregate works would be much appreciated. i

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Flávio Santos
Hi Mario, Try to include this to your libraryDependencies (in your sbt file): org.apache.kafka % kafka_2.10 % 0.8.0 exclude(javax.jms, jms) exclude(com.sun.jdmk, jmxtools) exclude(com.sun.jmx, jmxri) exclude(org.slf4j, slf4j-simple) Regards, *--Flávio R. Santos* Chaordic |

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Natu Lauchande
Are you using Scala in a distributed enviroment or in a standalone mode ? Natu On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote: hi.. i'm converting some of my machine learning python code into scala + spark. i haven't been able to run it on large dataset yet, but on small

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
both. first, the distributed version is so much slower than python. i tried a few things like broadcasting variables, replacing Seq with Array, and a few other little things. it helps to improve the performance, but still slower than the python code. so, i wrote a local version that's pretty

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
just to give some reference point. with the same algorithm running on mnist dataset. 1. python implementation: ~10 miliseconds per iteration (can be faster if i switch to gpu) 2. local version (scala + breeze): ~2 seconds per iteration 3. distributed version (spark + scala + breeze): 15

Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on the iterator in each one to get a sliding window. One problem is that you will not be able to slide forward into the next partition at partition boundaries. If this matters to you, you need to do something more

Proper way to check SparkContext's status within code

2014-12-11 Thread Edwin
Hi, Is there a way to check the status of the SparkContext regarding whether it's alive or not through the code, not through UI or anything else? Thanks Edwin -- View this message in context:

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Sean Owen
In general, you would not expect a distributed computation framework to perform nearly as fast as a non-distributed one, when both are run on one machine. Spark has so much more overhead that doesn't go away just because it's on one machine. Of course, that's the very reason it scales past one

Re: what is the best way to implement mini batches?

2014-12-11 Thread Duy Huynh
the dataset i'm working on has about 100,000 records. the batch that we're training on has a size around 10. can you repartition(10,000) into 10,000 partitions? On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can just do mapPartitions on the whole RDD, and

Re: Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-11 Thread Rakesh Nair
TD, While looking at the API Ref(version 1.1.0) for SchemaRDD i did find these two methods: def insertInto(tableName: String): Unit def insertInto(tableName: String, overwrite: Boolean): Unit Wouldnt these be a nicer way of appending RDD's to a table or are these not recommended as of now?

Re: Does filter on an RDD scan every data item ?

2014-12-11 Thread dsiegel
Also, you may want to use .lookup() instead of .filter() def lookup(key: K): Seq[V] Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. You might want to partition your first

Re: Error: Spark-streaming to Cassandra

2014-12-11 Thread Tathagata Das
This seems to be compilation errors. The second one seems to be that you are using CassandraJavaUtil.javafunctions wrong. Look at the documentation and set the parameter list correctly. TD On Mon, Dec 8, 2014 at 9:47 AM, m.sar...@accenture.com wrote: Hi, I am intending to save the streaming

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Tathagata Das
Aah yes, that makes sense. You could write first to HDFS, and when that works, copy from HDFS to S3. That should work as it wont depend on the temporary files to be in S3. I am not sure how much you can customize just for S3 in Spark code. In Spark, since we just use Hadoop API to write there isnt

Re: broadcast: OutOfMemoryError

2014-12-11 Thread Sameer Farooqui
Is the OOM happening to the Driver JVM or one of the Executor JVMs? What memory size is each JVM? How large is the data you're trying to broadcast? If it's large enough, you may want to consider just persisting the data to distributed storage (like HDFS) and read it in through the normal read RDD

Re: RDDs being cleaned too fast

2014-12-11 Thread Ranga
I was having similar issues with my persistent RDDs. After some digging around, I noticed that the partitions were not balanced evenly across the available nodes. After a repartition, the RDD was spread evenly across all available memory. Not sure if that is something that would help your use-case

Native library error when trying to use Spark with Snappy files

2014-12-11 Thread Rich Haase
I am running a Hadoop cluster with Spark on YARN. The cluster running the CDH5.2 distribution. When I try to run spark jobs against snappy compressed files I receive the following error. java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z

Running spark-submit from a remote machine using a YARN application

2014-12-11 Thread ryaminal
We are trying to submit a Spark application from a Tomcat application running our business logic. The Tomcat app lives in a seperate non-hadoop cluster. We first were doing this by using the spark-yarn package to directly call Client#runApp() but found that the API we were using in Spark is being

Job status from Python

2014-12-11 Thread Michael Nazario
In PySpark, is there a way to get the status of a job which is currently running? My use case is that I have a long running job that users may not know whether or not the job is still running. It would be nice to have an idea of whether or not the job is progressing even if it isn't very

Re: Specifying number of executors in Mesos

2014-12-11 Thread Andrew Ash
Gerard, Are you familiar with spark.deploy.spreadOut http://spark.apache.org/docs/latest/spark-standalone.html in Standalone mode? It sounds like you want the same thing in Mesos mode. On Thu, Dec 11, 2014 at 6:48 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Not that I am aware of.

Spark Streaming in Production

2014-12-11 Thread twizansk
Hi, I'm looking for resources and examples for the deployment of spark streaming in production. Specifically, I would like to know how high availability and fault tolerance of receivers is typically achieved. The workers are managed by the spark framework and are therefore fault tolerant

Error on JavaSparkContext.stop()

2014-12-11 Thread Taeyun Kim
(Sorry if this mail is duplicate, but it seems that my previous mail could not reach the mailing list.) Hi, When my spark program calls JavaSparkContext.stop(), the following errors occur. 14/12/11 16:24:19 INFO Main: sc.stop { 14/12/11 16:24:20 ERROR

Spark Server - How to implement

2014-12-11 Thread Manoj Samel
Hi, If spark based services are to be exposed as a continuously available server, what are the options? * The API exposed to client will be proprietary and fine grained (RPC style ..), not a Job level API * The client API need not be SQL so the Thrift JDBC server does not seem to be option ..

Re: Spark Server - How to implement

2014-12-11 Thread Marcelo Vanzin
Oops, sorry, fat fingers. We've been playing with something like that inside Hive: https://github.com/apache/hive/tree/spark/spark-client That seems to have at least a few of the characteristics you're looking for; but it's a very young project, and at this moment we're not developing it as a

Re: Spark Server - How to implement

2014-12-11 Thread Marcelo Vanzin
Hi Manoj, I'm not aware of any public projects that do something like that, except for the Ooyala server which you say doesn't cover your needs. We've been playing with something like that inside Hive, though: On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi,

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Andy Wagner
This is showing a factor of 200 between python and scala and 1400 when distributed. Is this really accurate? If not, what is the real performance difference expected on average between the 3 cases? On Thu, Dec 11, 2014 at 11:33 AM, Duy Huynh duy.huynh@gmail.com wrote: just to give some

Re: KryoRegistrator exception and Kryo class not found while compiling

2014-12-11 Thread bonnahu
Is the class com.dataken.spark.examples.MyRegistrator public? if not, change it to public and give a try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KryoRegistrator-exception-and-Kryo-class-not-found-while-compiling-tp10396p20646.html Sent from the

Re: KryoSerializer exception in Spark Streaming JAVA

2014-12-11 Thread bonnahu
class MyRegistrator implements KryoRegistrator { public void registerClasses(Kryo kryo) { kryo.register(ImpressionFactsValue.class); } } change this class to public and give a try -- View this message in context:

Re: what is the best way to implement mini batches?

2014-12-11 Thread Imran Rashid
Minor correction: I think you want iterator.grouped(10) for non-overlapping mini batches On Dec 11, 2014 1:37 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You can just do mapPartitions on the whole RDD, and then called sliding() on the iterator in each one to get a sliding window. One

Re: Spark Streaming in Production

2014-12-11 Thread Tathagata Das
Spark Streaming takes care of restarting receivers if it fails. Regarding the fault-tolerance properties and deployment options, we made some improvements in the upcoming Spark 1.2. Here is a staged version of the Spark Streaming programming guide that you can read for the up-to-date explanation

Re: what is the best way to implement mini batches?

2014-12-11 Thread Ilya Ganelin
Hi all. I've been working on a similar problem. One solution that is straightforward (if suboptimal) is to do the following. A.zipWithIndex().filter(_._2 =range_start _._2 range_end). Lastly just put that in a for loop. I've found that this approach scales very well. As Matei said another

Error on JavaSparkContext.stop()

2014-12-11 Thread 김태연
(Sorry if this mail is a duplicate, but it seems that my previous mail could not reach the mailing list.) Hi, When my spark program calls JavaSparkContext.stop(), the following errors occur. 14/12/11 16:24:19 INFO Main: sc.stop { 14/12/11 16:24:20 ERROR

Re: monitoring for spark standalone

2014-12-11 Thread Otis Gospodnetic
Hi Judy, SPM monitors Spark. Here are some screenshots: http://blog.sematext.com/2014/10/07/apache-spark-monitoring/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Mon, Dec 8, 2014 at 2:35 AM, Judy Nash

GroupBy and nested Top on

2014-12-11 Thread sparkuser2014
I'm currently new to pyspark, thank you for your patience in advance - my current problem is the following: I have a RDD composed of the field A, B, and count = result1 = rdd.map(lambda x: (A,B),1).reduceByKey(lambda a,b: a + b) Then I wanted to group the results based on 'A', so I did

Using Spark at the U.S.Treasury

2014-12-11 Thread Max Funk
Kindly take a moment to look over this proposal to bring Spark into the U.S. Treasury: http://www.systemaccounting.org/sparking_the_data_driven_republic

Adding a column to a SchemaRDD

2014-12-11 Thread Nathan Kronenfeld
Hi, there. I'm trying to understand how to augment data in a SchemaRDD. I can see how to do it if can express the added values in SQL - just run SELECT *,valueCalculation AS newColumnName FROM table I've been searching all over for how to do this if my added value is a scala function, with no

Mllib Error

2014-12-11 Thread amin mohebbi
 I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program:Object Mllib is not a member of package org.apache.sparkThen, I realized that I have to add Mllib as dependency as follow :libraryDependencies ++= Seq(

Re: Mllib Error

2014-12-11 Thread MEETHU MATHEW
Hi,Try this.Change spark-mllib to spark-mllib_2.10 libraryDependencies ++=Seq( org.apache.spark % spark-core_2.10 % 1.1.1  org.apache.spark % spark-mllib_2.10 % 1.1.1 )  Thanks Regards, Meethu M On Friday, 12 December 2014 12:22 PM, amin mohebbi aminn_...@yahoo.com.INVALID wrote:  

Re: custom spark app name in yarn-cluster mode

2014-12-11 Thread Sandy Ryza
Hi Tomer, In yarn-cluster mode, the application has already been submitted to YARN by the time the SparkContext is created, so it's too late to set the app name there. I believe giving it with the --name property to spark-submit should work. -Sandy On Thu, Dec 11, 2014 at 10:28 AM, Tomer

Re: Remote jar file

2014-12-11 Thread rahulkumar-aws
Put Jar file in site HDFS, URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. - Software Developer SigmoidAnalytics, Bangalore -- View this message in context: