RE: RDDs

2014-09-04 Thread Liu, Raymond
Actually, a replicated RDD and a parallel job on the same RDD, this two conception is not related at all. A replicated RDD just store data on multiple node, it helps with HA and provide better chance for data locality. It is still one RDD, not two separate RDD. While regarding run two jobs on

Fwd: RDDs

2014-09-04 Thread rapelly kartheek
-- Forwarded message -- From: rapelly kartheek kartheek.m...@gmail.com Date: Thu, Sep 4, 2014 at 11:49 AM Subject: Re: RDDs To: Liu, Raymond raymond@intel.com Thank you Raymond. I am more clear now. So, if an rdd is replicated over multiple nodes (i.e. say two sets of nodes

Re: memory size for caching RDD

2014-09-04 Thread 牛兆捷
But is it possible to make t resizable? When we don't have many RDD to cache, we can give some memory to others. 2014-09-04 13:45 GMT+08:00 Patrick Wendell pwend...@gmail.com: Changing this is not supported, it si immutable similar to other spark configuration settings. On Wed, Sep 3, 2014

Re: RDDs

2014-09-04 Thread Tathagata Das
Yes Raymond is right. You can always run two jobs on the same cached RDD, and they can run in parallel (assuming you launch the 2 jobs from two different threads). However, with one copy of each RDD partition, the tasks of two jobs will experience some slot contentions. So if you replicate it, you

Re: memory size for caching RDD

2014-09-04 Thread 牛兆捷
Thanks raymond. I duplicated the question. Please see the reply here. [?] 2014-09-04 14:27 GMT+08:00 牛兆捷 nzjem...@gmail.com: But is it possible to make t resizable? When we don't have many RDD to cache, we can give some memory to others. 2014-09-04 13:45 GMT+08:00 Patrick Wendell

RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
You don’t need to. It is not static allocated to RDD cache, it is just an up limit. If you don’t use up the memory by RDD cache, it is always available for other usage. except those one also controlled by some memoryFraction conf. e.g. spark.shuffle.memoryFraction which you also set the up

Programatically running of the Spark Jobs.

2014-09-04 Thread Vicky Kak
I have been able to submit the spark jobs using the submit script but I would like to do it via code. I am unable to search anything matching to my need. I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may be have to write some utility that passes the parameters required for

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Matt Chu
https://github.com/spark-jobserver/spark-jobserver Ooyala's Spark jobserver is the current de facto standard, IIUC. I just added it to our prototype stack, and will begin trying it out soon. Note that you can only do standalone or Mesos; YARN isn't quite there yet. (The repo just moved from

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-04 Thread DB Tsai
For saving the memory, I recommend you compress the cached RDD, and it will be couple times smaller than original data sets. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Sep 3,

error: type mismatch while assigning RDD to RDD val object

2014-09-04 Thread Dhimant
I am receiving following error in Spark-Shell while executing following code. /class LogRecrod(logLine: String) extends Serializable { val splitvals = logLine.split(,); val strIp: String = splitvals(0) val hostname: String = splitvals(1) val server_name: String = splitvals(2)

Re: memory size for caching RDD

2014-09-04 Thread 牛兆捷
Oh I see. I want to implement something like this: sometimes I need to release some memory for other usage even when they are occupied by some RDDs (can be recomputed with the help of lineage when they are needed), does spark provide interfaces to force it to release some memory ? 2014-09-04

RE: memory size for caching RDD

2014-09-04 Thread Liu, Raymond
I think there is no public API available to do this. In this case, the best you can do might be unpersist some RDDs manually. The problem is that this is done by RDD unit, not by block unit. And then, if the storage level including disk level, the data on the disk will be removed too. Best

Re: Web UI

2014-09-04 Thread Akhil Das
Hi You can see this doc https://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security for all the available webUI ports. Yes there are ways to get the data metrics in Json format, One of them is below: *​​http://webUI:8080/json/ http://webUI:8080/json/* Or

Re: memory size for caching RDD

2014-09-04 Thread 牛兆捷
ok. So can I use the similar logic as the block manager does when space fills up ? 2014-09-04 15:05 GMT+08:00 Liu, Raymond raymond@intel.com: I think there is no public API available to do this. In this case, the best you can do might be unpersist some RDDs manually. The problem is that

Iterate over ArrayBuffer

2014-09-04 Thread Deep Pradhan
Hi, I have the following ArrayBuffer *ArrayBuffer(5,3,1,4)* Now, I want to iterate over the ArrayBuffer. What is the way to do it? Thank You

Re: .sparkrc for Spark shell?

2014-09-04 Thread Jianshi Huang
I se. Thanks Prashant! Jianshi On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma scrapco...@gmail.com wrote: Hey, You can use spark-shell -i sparkrc, to do this. Prashant Sharma On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: To make my shell experience

Re: error: type mismatch while assigning RDD to RDD val object

2014-09-04 Thread Sean Owen
I think this is a known problem with the shell and case classes. Have a look at JIRA. https://issues.apache.org/jira/browse/SPARK-1199 On Thu, Sep 4, 2014 at 7:56 AM, Dhimant dhimant84.jays...@gmail.com wrote: I am receiving following error in Spark-Shell while executing following code.

Re: Iterate over ArrayBuffer

2014-09-04 Thread Madabhattula Rajesh Kumar
Hi Deep, If you are requirement is to read the values from ArrayBuffer use below code scala import scala.collection.mutable.ArrayBuffer import scala.collection.mutable.ArrayBuffer scala var a = ArrayBuffer(5,3,1,4) a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4) scala

Re: Iterate over ArrayBuffer

2014-09-04 Thread Ngoc Dao
I want to iterate over the ArrayBuffer. You should get yourself familiar with methods related to the Scala collection library: https://twitter.github.io/scala_school/collections.html Almost all of the methods take a function as their parameter. This is a very convenient feature of Scala (unlike

Multiple spark shell sessions

2014-09-04 Thread Dhimant
Hi, I am receiving following error while connecting the spark server via shell if one shell is already open. How can I open multiple sessions ? Does anyone know abt Workflow Engine/Job Server like apache oozie for spark ? / Welcome to __ / __/__ ___ _/ /__

Spark streaming saveAsHadoopFiles API question

2014-09-04 Thread Hemanth Yamijala
Hi, I extended the Spark streaming wordcount example to save files to Hadoop file system - just to test how that interface works. In doing so, I ran into an API problem that I hope folks here can help clarify. My goal was to see how I could save the final word counts generated in each

Spark processes not doing on killing corresponding YARN application

2014-09-04 Thread Hemanth Yamijala
Hi, I launched a spark streaming job under YARN using default configuration for Spark, using spark-submit with the master as yarn-cluster. It launched an ApplicationMaster, and 2 CoarseGrainedExecutorBackend processes. Everything ran fine, then I killed the application using yarn application

Re: RDDs

2014-09-04 Thread Kartheek.R
Thank you yuanbosoft. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13444.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe,

spark sql results maintain order (in python)

2014-09-04 Thread jamborta
hi, I ran into a problem with spark sql, when run a query like this select count(*), city, industry from table group by hour and I would like to take the results from the shemaRDD 1, I have to parse each line to get the values out of the dic (eg in order to convert it to a csv) 2, The order is

RE: Programatically running of the Spark Jobs.

2014-09-04 Thread Ruebenacker, Oliver A
Hello, Can this be used as a library from within another application? Thanks! Best, Oliver From: Matt Chu [mailto:m...@kabam.com] Sent: Thursday, September 04, 2014 2:46 AM To: Vicky Kak Cc: user Subject: Re: Programatically running of the Spark Jobs.

RE: Web UI

2014-09-04 Thread Ruebenacker, Oliver A
Hello, Thanks for the link – this is for standalone, though, and most URLs don’t work for local. I will look into deploying as standalone on a single node for testing and development. Best, Oliver From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Thursday, September 04,

Is cluster manager same as master?

2014-09-04 Thread Ruebenacker, Oliver A
Hello, Is cluster manager mentioned herehttps://spark.apache.org/docs/latest/cluster-overview.html the same thing as master mentioned herehttps://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually? Thanks! Best, Oliver Oliver Ruebenacker | Solutions

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Vicky Kak
I don't think so. On Thu, Sep 4, 2014 at 5:36 PM, Ruebenacker, Oliver A oliver.ruebenac...@altisource.com wrote: Hello, Can this be used as a library from within another application? Thanks! Best, Oliver *From:* Matt Chu [mailto:m...@kabam.com] *Sent:* Thursday,

Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
Hello all, I've been trying to figure out how to add data to an existing Parquet file without having a schema. Spark has allowed me to load JSON and save it as a Parquet file but I was wondering if anyone knows how to ADD/INSERT more data. I tried using sql insert and that doesn't work. All of

Object serialisation inside closures

2014-09-04 Thread Andrianasolo Fanilo
Hello Spark fellows :) I'm a new user of Spark and Scala and have been using both for 6 months without too many problems. Here I'm looking for best practices for using non-serializable classes inside closure. I'm using Spark-0.9.0-incubating here with Hadoop 2.2. Suppose I am using OpenCSV

[Spark Streaming] Tracking/solving 'block input not found'

2014-09-04 Thread Gerard Maas
Hello Sparkers, I'm currently running load tests on a Spark Streaming job. When the task duration increases beyond the batchDuration the job become unstable. In the logs I see tasks failed with the following message: Job aborted due to stage failure: Task 266.0:1 failed 4 times, most recent

Re: Object serialisation inside closures

2014-09-04 Thread Sean Owen
In your original version, the object is referenced by the function but it's on the driver, and so has to be serialized. This leads to an error since it's not serializable. Instead, you want to recreate the object locally on each of the remote machines. In your third version you are holding the

Re: Object serialisation inside closures

2014-09-04 Thread Yana Kadiyska
In the third case the object does not get shipped around. Each executor will create it's own instance. I got bitten by this here: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-object-access-from-mapper-simple-question-tt8125.html On Thu, Sep 4, 2014 at 9:29 AM, Andrianasolo

Re: Spark processes not doing on killing corresponding YARN application

2014-09-04 Thread didata
Thanks for asking this. I've have this issue with pyspark too on YARN 100 of the time: I quit out of pyspark and, while my Unix shell prompt returns, a 'yarn application -list' always shows (as does the UI) that application is still running (or at least not totally dead). When I then log onto

Re: Multiple spark shell sessions

2014-09-04 Thread Yana Kadiyska
These are just warnings from the web server. Normally your application will have a UI page on port 4040. In your case, a little after the warning it should bind just fine to another port (mine picked 4041). Im running on 0.9.1. Do you actually see the application failing? The main thing when

RE: Object serialisation inside closures

2014-09-04 Thread Andrianasolo Fanilo
Thank you for the quick answer, looks good to me Though that brings me to another question. Suppose we want to open a connection to a database, an ElasticSearch, etc... I now have two proceedings : 1/ use .mapPartitions and setup the connection at the start of each partition, so I get a

Re: Multiple spark shell sessions

2014-09-04 Thread Dhimant
Thanks Yana, I am able to execute application and command via another session, i also received another port for UI application. Thanks, Dhimant -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-spark-shell-sessions-tp13441p13459.html Sent from the

re: advice on spark input development - python or scala?

2014-09-04 Thread Johnny Kelsey
Hi guys, We're testing out a spark/cassandra cluster, we're very impressed with what we've seen so far. However, I'd very much like some advice from the shiny brains on the mailing list. We have a large collection of python code that we're in the process of adapting to move into

advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Johnny Kelsey
Hi guys, We're testing out a spark/cassandra cluster, we're very impressed with what we've seen so far. However, I'd very much like some advice from the shiny brains on the mailing list. We have a large collection of python code that we're in the process of adapting to move into

Setting Java properties for Standalone on Windows 7?

2014-09-04 Thread Ruebenacker, Oliver A
Hello, I'm running Spark on Windows 7 as standalone, with everything on the same machine. No Hadoop installed. My app throws exception and worker reports: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. I had the same problem earlier when deploying local. I

2 python installations cause PySpark on Yarn problem

2014-09-04 Thread Oleg Ruchovets
Hi , I am evaluating the PySpark. I have hdp hortonworks installed with python 2.6.6. (I can't remove it since it is used by hortonworks). I can successfully execute PySpark on Yarn. We need to use Anaconda packages , so I install anaconda. Anaconda is installed with python 2.7.7 and it is

pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
Folks, I have been working on a pandas-like dataframe DSL on top of spark. It is written in Scala and can be used from spark-shell. The APIs have the look and feel of pandas which is a wildly popular piece of software data scientists use. The goal is to let people familiar with pandas scale their

efficient zipping of lots of RDDs

2014-09-04 Thread Mohit Jaggi
Folks, I sent an email announcing https://github.com/AyasdiOpenSource/df This dataframe is basically a map of RDDs of columns(along with DSL sugar), as column based operations seem to be most common. But row operations are not uncommon. To get rows out of columns right now I zip the column RDDs

Re: advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Mohit Jaggi
Johnny, Without knowing the domain of the problem it is hard to choose a programming language. I would suggest you ask yourself the following questions: - What if your project depends on a lot of python libraries that don't have Scala/Java counterparts? It is unlikely but possible. - What if

Re: spark sql results maintain order (in python)

2014-09-04 Thread Davies Liu
On Thu, Sep 4, 2014 at 3:42 AM, jamborta jambo...@gmail.com wrote: hi, I ran into a problem with spark sql, when run a query like this select count(*), city, industry from table group by hour and I would like to take the results from the shemaRDD 1, I have to parse each line to get the

Re: EC2 - JNI crashes JVM with multi core instances

2014-09-04 Thread maxpar
Hi, What is the setup of your native library? Probably it is not thread safe? Thanks, Max -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/EC2-JNI-crashes-JVM-with-multi-core-instances-tp13463p13470.html Sent from the Apache Spark User List mailing list

Re: Object serialisation inside closures

2014-09-04 Thread Mohit Jaggi
I faced the same problem and ended up using the same approach that Sean suggested https://github.com/AyasdiOpenSource/df/blob/master/src/main/scala/com/ayasdi/df/DF.scala#L313 Option 3 also seems reasonable. It should create a CSVParser per executor. On Thu, Sep 4, 2014 at 6:58 AM, Andrianasolo

spark streaming - saving DStream into HBASE doesn't work

2014-09-04 Thread salemi
Hi, I am using the following code to write data to hbase. I see the jobs are send off but I never get anything in my hbase database. Spark doesn't throw any error? How can such a problem be debugged. Is the code below correct for writing data to hbase? val conf =

Re: 2 python installations cause PySpark on Yarn problem

2014-09-04 Thread Davies Liu
Hey Oleg, In pyspark, you MUST have the same version of Python in all the machines of the cluster, which means when you run `python` on these machines, all of them should be the same version ( 2.6 or 2.7). With PYSPARK_PYTHON, you can run pyspark with a specified version of Python. Also, you

SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread DanteSama
It seems that running insertInto on an SchemaRDD with a ParquetRelation creates an individual file for each item in the RDD. Sometimes, it has multiple rows in one file, and sometimes it only writes the column headers. My question is, is it possible to have it write the entire RDD as 1 file, but

Re: Starting Thriftserver via hostname on Spark 1.1 RC4?

2014-09-04 Thread Denny Lee
Ahh got it - I knew I was missing something  - appreciate the clarification! :) On September 4, 2014 at 10:27:44, Cheng Lian (lian.cs@gmail.com) wrote: You may configure listening host and port in the same way as HiveServer2 of Hive, namely: via environment variables

Re: advice sought on spark/cassandra input development - scala or python?

2014-09-04 Thread Gerard Maas
Johnny, Currently, probably the easiest (and most performant way) to integrate Spark and Cassandra is using the spark-cassandra-connector [1] Given an rdd, saving it to cassandra is as easy as: rdd.saveToCassandra(keyspace, table, Seq(columns)) We tried many 'hand crafted' options to interact

Re: Is cluster manager same as master?

2014-09-04 Thread Andrew Or
Correct. For standalone mode, Master is your cluster manager. Spark also supports other cluster managers such as Yarn and Mesos. -Andrew 2014-09-04 5:52 GMT-07:00 Ruebenacker, Oliver A oliver.ruebenac...@altisource.com: Hello, Is “cluster manager” mentioned here

Re: Web UI

2014-09-04 Thread Andrew Or
Hi all, The JSON version of the web UI is not officially supported; I don't believe this is documented anywhere. The alternative is to set `spark.eventLog.enabled` to true before running your application. This will create JSON SparkListenerEvents with details about each task and stage as a log

Re: Viewing web UI after fact

2014-09-04 Thread Andrew Or
Hi Grzegorz, Sorry for the late response. Unfortunately, if the Master UI doesn't know about your applications (they are completed with respect to a different Master), then it can't regenerate the UIs even if the logs exist. You will have to use the history server for that. How did you start the

Re: pandas-like dataframe in spark

2014-09-04 Thread Matei Zaharia
Hi Mohit, This looks pretty interesting, but just a note on the implementation -- it might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The reason is that SchemaRDDs already have an efficient in-memory representation (columnar storage), and can be read from a variety of data

Re: API to add/remove containers inside an application

2014-09-04 Thread Praveen Seluka
+user On Thu, Sep 4, 2014 at 10:53 PM, Praveen Seluka psel...@qubole.com wrote: Spark on Yarn has static allocation of resources. https://issues.apache.org/jira/browse/SPARK-3174 - This JIRA by Sandy is about adding and removing executors dynamically based on load. Even before doing this,

Re: pandas-like dataframe in spark

2014-09-04 Thread Mohit Jaggi
Thanks Matei. I will take a look at SchemaRDDs. On Thu, Sep 4, 2014 at 11:24 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Mohit, This looks pretty interesting, but just a note on the implementation -- it might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The

Reduce truncates RDD in standalone, but fine when local.

2014-09-04 Thread Ruebenacker, Oliver A
Hello, In the app below, when I run it with local[1] or local [3], I get the expected result - a list of the square roots of the numbers from 1 to 20. When I try the same app as standalone with one or two workers on the same machine, it will only print 1.0. Adding print statements

Getting the type of an RDD in spark AND pyspark

2014-09-04 Thread esamanas
Hi, I'm new to spark and scala, so apologies if this is obvious. Every RDD appears to be typed, which I can see by seeing the output in the spark-shell when I execute 'take': scala val t = sc.parallelize(Array(1,2,3)) t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at

Re: Using Spark to add data to an existing Parquet file without a schema

2014-09-04 Thread Jim Carroll
Okay, Obviously I don't care about adding more files to the system so is there a way to point to an existing parquet file (directory) and seed the individual part-r-***.parquet (the value of partition + offset) while preventing I mean, I can hack it by copying files into the same parquet

Re: Spark Streaming with Kafka, building project with 'sbt assembly' is extremely slow

2014-09-04 Thread Aris
Thanks for answering Daniil - I have SBT version 0.13.5, is that an old version? Seems pretty up-to-date. It turns out I figured out a way around this entire problem: just use 'sbt package', and when using bin/spark-submit, pass it the --jars option and GIVE IT ALL THE JARS from the local iv2

Re: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread DanteSama
Yep, that worked out. Does this solution have any performance implications past all the work being done on (probably) 1 node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html Sent from the

RE: Reduce truncates RDD in standalone, but fine when local.

2014-09-04 Thread Ruebenacker, Oliver A
Hello, I tracked it down to the field nIters being uninitialized when passed to the reduce job while running standalone, but initialized when running local. Must be some strange interaction between Spark and scala.App. If I move the reduce job into a method and make nIters a local

TimeStamp selection with SparkSQL

2014-09-04 Thread Benjamin Zaitlen
I may have missed this but is it possible to select on datetime in a SparkSQL query jan1 = sqlContext.sql(SELECT * FROM Stocks WHERE datetime = '2014-01-01') Additionally, is there a guide as to what SQL is valid? The guide says, Note that Spark SQL currently uses a very basic SQL parser It

spark RDD join Error

2014-09-04 Thread Veeranagouda Mukkanagoudar
I am planning to use RDD join operation, to test out i was trying to compile some test code, but am getting following compilation error *value join is not a member of org.apache.spark.rdd.RDD[(String, Int)]* *[error] rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }* Code: import

Re: spark RDD join Error

2014-09-04 Thread Zhan Zhang
Try this: Import org.apache.spark.SparkContext._ Thanks. Zhan Zhang On Sep 4, 2014, at 4:36 PM, Veeranagouda Mukkanagoudar veera...@gmail.com wrote: I am planning to use RDD join operation, to test out i was trying to compile some test code, but am getting following compilation error

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Steve Lewis
Assume I define a partitioner like /** * partition on the first letter */ public class PartitionByStart extends Partitioner { @Override public int numPartitions() { return 26; } @Override public int getPartition(final Object key) {

Re: spark RDD join Error

2014-09-04 Thread Veeranagouda Mukkanagoudar
Thanks a lot, that fixed the issue :) On Thu, Sep 4, 2014 at 4:51 PM, Zhan Zhang zzh...@hortonworks.com wrote: Try this: Import org.apache.spark.SparkContext._ Thanks. Zhan Zhang On Sep 4, 2014, at 4:36 PM, Veeranagouda Mukkanagoudar veera...@gmail.com wrote: I am planning to use

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
Partitioners also work in local mode, the only question is how to see which data fell into each partition, since most RDD operations hide the fact that it's partitioned. You can do rdd.glom().collect() -- the glom() operation turns an RDD of elements of type T into an RDD of ListT, with a

Re: Mapping Hadoop Reduce to Spark

2014-09-04 Thread Matei Zaharia
BTW you can also use rdd.partitions() to get a list of Partition objects and see how many there are. On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote: Partitioners also work in local mode, the only question is how to see which data fell into each partition,

Re: spark RDD join Error

2014-09-04 Thread Chris Fregly
specifically, you're picking up the following implicit: import org.apache.spark.SparkContext.rddToPairRDDFunctions (in case you're a wildcard-phobe like me) On Thu, Sep 4, 2014 at 5:15 PM, Veeranagouda Mukkanagoudar veera...@gmail.com wrote: Thanks a lot, that fixed the issue :) On Thu,

Re: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread Michael Armbrust
It depends on the RDD in question exactly where the work will be done. I believe that if you do a repartition(1) instead of a coalesce it will force a shuffle so the work will be done distributed and then a single node will read that shuffled data and write it out. If you want to write to a

Re: advice on spark input development - python or scala?

2014-09-04 Thread Tobias Pfeiffer
Hi, On Thu, Sep 4, 2014 at 11:49 PM, Johnny Kelsey jkkel...@semblent.com wrote: As a concrete example, we have a python class (part of a fairly large class library) which, as part of its constructor, also creates a record of itself in the cassandra key space. So we get an initialised class a

RE: SchemaRDD - Parquet - insertInto makes many files

2014-09-04 Thread Cheng, Hao
Hive can launch another job with strategy to merged the small files, probably we can also do that in the future release. From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, September 05, 2014 8:59 AM To: DanteSama Cc: u...@spark.incubator.apache.org Subject: Re: SchemaRDD -

How spark parallelize maps Slices to tasks/executors/workers

2014-09-04 Thread Mozumder, Monir
I have this 2-node cluster setup, where each node has 4-cores. MASTER (Worker-on-master) (Worker-on-node1) (slaves(master,node1)) SPARK_WORKER_INSTANCES=1 I am trying to understand Spark's parallelize behavior. The sparkPi example has this code: val slices =

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread ericacm
Ahh - that probably explains an issue I am seeing. I am a brand new user and I tried running the SimpleApp class that is on the Quick Start page (http://spark.apache.org/docs/latest/quick-start.html). When I use conf.setMaster(local) then I can run the class directly from my IDE. But when I try

Serialize input path

2014-09-04 Thread jerryye
Hi, I have a quick serialization issue. I'm trying to read a date range of input files and I'm getting a serialization issue when using an input path that has a object generate a date range. Specifically, my code uses DateTimeFormat in the Joda time package, which is not serializable. How do I get

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Guru Medasani
I am able to run Spark jobs and Spark Streaming jobs successfully via YARN on a CDH cluster. When you mean YARN isn’t quite there yet, you mean to submit the jobs programmatically? or just in general? On Sep 4, 2014, at 1:45 AM, Matt Chu m...@kabam.com wrote:

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Vicky Kak
I don't want to use YARN or Mesos, just trying the standalone spark cluster. We need a way to do seamless submission with the API which I don't see. To my surprise I was hit by this issue when i tried running the submit from another machine, it is crazy that I have to submit the job from the

Re: Programatically running of the Spark Jobs.

2014-09-04 Thread Vicky Kak
I get this error when i run it from IDE *** Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at