Actually, a replicated RDD and a parallel job on the same RDD, this two
conception is not related at all.
A replicated RDD just store data on multiple node, it helps with HA and provide
better chance for data locality. It is still one RDD, not two separate RDD.
While regarding run two jobs on
-- Forwarded message --
From: rapelly kartheek kartheek.m...@gmail.com
Date: Thu, Sep 4, 2014 at 11:49 AM
Subject: Re: RDDs
To: Liu, Raymond raymond@intel.com
Thank you Raymond.
I am more clear now. So, if an rdd is replicated over multiple nodes (i.e.
say two sets of nodes
But is it possible to make t resizable? When we don't have many RDD to
cache, we can give some memory to others.
2014-09-04 13:45 GMT+08:00 Patrick Wendell pwend...@gmail.com:
Changing this is not supported, it si immutable similar to other spark
configuration settings.
On Wed, Sep 3, 2014
Yes Raymond is right. You can always run two jobs on the same cached RDD,
and they can run in parallel (assuming you launch the 2 jobs from two
different threads). However, with one copy of each RDD partition, the tasks
of two jobs will experience some slot contentions. So if you replicate it,
you
Thanks raymond.
I duplicated the question. Please see the reply here. [?]
2014-09-04 14:27 GMT+08:00 牛兆捷 nzjem...@gmail.com:
But is it possible to make t resizable? When we don't have many RDD to
cache, we can give some memory to others.
2014-09-04 13:45 GMT+08:00 Patrick Wendell
You don’t need to. It is not static allocated to RDD cache, it is just an up
limit.
If you don’t use up the memory by RDD cache, it is always available for other
usage. except those one also controlled by some memoryFraction conf. e.g.
spark.shuffle.memoryFraction which you also set the up
I have been able to submit the spark jobs using the submit script but I
would like to do it via code.
I am unable to search anything matching to my need.
I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may be
have to write some utility that passes the parameters required for
https://github.com/spark-jobserver/spark-jobserver
Ooyala's Spark jobserver is the current de facto standard, IIUC. I just
added it to our prototype stack, and will begin trying it out soon. Note
that you can only do standalone or Mesos; YARN isn't quite there yet.
(The repo just moved from
For saving the memory, I recommend you compress the cached RDD, and it will
be couple times smaller than original data sets.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Wed, Sep 3,
I am receiving following error in Spark-Shell while executing following code.
/class LogRecrod(logLine: String) extends Serializable {
val splitvals = logLine.split(,);
val strIp: String = splitvals(0)
val hostname: String = splitvals(1)
val server_name: String = splitvals(2)
Oh I see.
I want to implement something like this: sometimes I need to release some
memory for other usage even when they are occupied by some RDDs (can be
recomputed with the help of lineage when they are needed), does spark
provide interfaces to force it to release some memory ?
2014-09-04
I think there is no public API available to do this. In this case, the best you
can do might be unpersist some RDDs manually. The problem is that this is done
by RDD unit, not by block unit. And then, if the storage level including disk
level, the data on the disk will be removed too.
Best
Hi
You can see this doc
https://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security
for all the available webUI ports.
Yes there are ways to get the data metrics in Json format, One of them is
below:
*http://webUI:8080/json/ http://webUI:8080/json/* Or
ok. So can I use the similar logic as the block manager does when space
fills up ?
2014-09-04 15:05 GMT+08:00 Liu, Raymond raymond@intel.com:
I think there is no public API available to do this. In this case, the
best you can do might be unpersist some RDDs manually. The problem is that
Hi,
I have the following ArrayBuffer
*ArrayBuffer(5,3,1,4)*
Now, I want to iterate over the ArrayBuffer.
What is the way to do it?
Thank You
I se. Thanks Prashant!
Jianshi
On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma scrapco...@gmail.com
wrote:
Hey,
You can use spark-shell -i sparkrc, to do this.
Prashant Sharma
On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
To make my shell experience
I think this is a known problem with the shell and case classes. Have
a look at JIRA.
https://issues.apache.org/jira/browse/SPARK-1199
On Thu, Sep 4, 2014 at 7:56 AM, Dhimant dhimant84.jays...@gmail.com wrote:
I am receiving following error in Spark-Shell while executing following code.
Hi Deep,
If you are requirement is to read the values from ArrayBuffer use below code
scala import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer
scala var a = ArrayBuffer(5,3,1,4)
a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4)
scala
I want to iterate over the ArrayBuffer.
You should get yourself familiar with methods related to the Scala
collection library:
https://twitter.github.io/scala_school/collections.html
Almost all of the methods take a function as their parameter. This is
a very convenient feature of Scala (unlike
Hi,
I am receiving following error while connecting the spark server via shell
if one shell is already open.
How can I open multiple sessions ?
Does anyone know abt Workflow Engine/Job Server like apache oozie for spark
?
/
Welcome to
__
/ __/__ ___ _/ /__
Hi,
I extended the Spark streaming wordcount example to save files to Hadoop
file system - just to test how that interface works. In doing so, I ran
into an API problem that I hope folks here can help clarify.
My goal was to see how I could save the final word counts generated in each
Hi,
I launched a spark streaming job under YARN using default configuration for
Spark, using spark-submit with the master as yarn-cluster. It launched an
ApplicationMaster, and 2 CoarseGrainedExecutorBackend processes.
Everything ran fine, then I killed the application using yarn application
Thank you yuanbosoft.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-tp13343p13444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe,
hi,
I ran into a problem with spark sql, when run a query like this select
count(*), city, industry from table group by hour and I would like to take
the results from the shemaRDD
1, I have to parse each line to get the values out of the dic (eg in order
to convert it to a csv)
2, The order is
Hello,
Can this be used as a library from within another application?
Thanks!
Best, Oliver
From: Matt Chu [mailto:m...@kabam.com]
Sent: Thursday, September 04, 2014 2:46 AM
To: Vicky Kak
Cc: user
Subject: Re: Programatically running of the Spark Jobs.
Hello,
Thanks for the link – this is for standalone, though, and most URLs don’t
work for local.
I will look into deploying as standalone on a single node for testing and
development.
Best, Oliver
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Thursday, September 04,
Hello,
Is cluster manager mentioned
herehttps://spark.apache.org/docs/latest/cluster-overview.html the same thing
as master mentioned
herehttps://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually?
Thanks!
Best, Oliver
Oliver Ruebenacker | Solutions
I don't think so.
On Thu, Sep 4, 2014 at 5:36 PM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
Hello,
Can this be used as a library from within another application?
Thanks!
Best, Oliver
*From:* Matt Chu [mailto:m...@kabam.com]
*Sent:* Thursday,
Hello all,
I've been trying to figure out how to add data to an existing Parquet file
without having a schema. Spark has allowed me to load JSON and save it as a
Parquet file but I was wondering if anyone knows how to ADD/INSERT more
data.
I tried using sql insert and that doesn't work. All of
Hello Spark fellows :)
I'm a new user of Spark and Scala and have been using both for 6 months without
too many problems.
Here I'm looking for best practices for using non-serializable classes inside
closure. I'm using Spark-0.9.0-incubating here with Hadoop 2.2.
Suppose I am using OpenCSV
Hello Sparkers,
I'm currently running load tests on a Spark Streaming job. When the task
duration increases beyond the batchDuration the job become unstable. In the
logs I see tasks failed with the following message:
Job aborted due to stage failure: Task 266.0:1 failed 4 times, most recent
In your original version, the object is referenced by the function but
it's on the driver, and so has to be serialized. This leads to an
error since it's not serializable. Instead, you want to recreate the
object locally on each of the remote machines.
In your third version you are holding the
In the third case the object does not get shipped around. Each executor
will create it's own instance. I got bitten by this here:
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-object-access-from-mapper-simple-question-tt8125.html
On Thu, Sep 4, 2014 at 9:29 AM, Andrianasolo
Thanks for asking this.
I've have this issue with pyspark too on YARN 100 of the time: I quit out
of pyspark and, while my Unix shell prompt returns, a 'yarn application
-list' always shows (as does the UI) that application is still running (or
at least not totally dead). When I then log onto
These are just warnings from the web server. Normally your application will
have a UI page on port 4040. In your case, a little after the warning it
should bind just fine to another port (mine picked 4041). Im running on
0.9.1. Do you actually see the application failing? The main thing when
Thank you for the quick answer, looks good to me
Though that brings me to another question. Suppose we want to open a connection
to a database, an ElasticSearch, etc...
I now have two proceedings :
1/ use .mapPartitions and setup the connection at the start of each partition,
so I get a
Thanks Yana,
I am able to execute application and command via another session, i also
received another port for UI application.
Thanks,
Dhimant
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-spark-shell-sessions-tp13441p13459.html
Sent from the
Hi guys,
We're testing out a spark/cassandra cluster, we're very impressed with
what we've seen so far. However, I'd very much like some advice from the
shiny brains on the mailing list.
We have a large collection of python code that we're in the process of
adapting to move into
Hi guys,
We're testing out a spark/cassandra cluster, we're very impressed with
what we've seen so far. However, I'd very much like some advice from the
shiny brains on the mailing list.
We have a large collection of python code that we're in the process of
adapting to move into
Hello,
I'm running Spark on Windows 7 as standalone, with everything on the same
machine. No Hadoop installed. My app throws exception and worker reports:
Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
I had the same problem earlier when deploying local. I
Hi ,
I am evaluating the PySpark.
I have hdp hortonworks installed with python 2.6.6. (I can't remove it
since it is used by hortonworks). I can successfully execute PySpark on
Yarn.
We need to use Anaconda packages , so I install anaconda. Anaconda is
installed with python 2.7.7 and it is
Folks,
I have been working on a pandas-like dataframe DSL on top of spark. It is
written in Scala and can be used from spark-shell. The APIs have the look
and feel of pandas which is a wildly popular piece of software data
scientists use. The goal is to let people familiar with pandas scale their
Folks,
I sent an email announcing
https://github.com/AyasdiOpenSource/df
This dataframe is basically a map of RDDs of columns(along with DSL sugar),
as column based operations seem to be most common. But row operations are
not uncommon. To get rows out of columns right now I zip the column RDDs
Johnny,
Without knowing the domain of the problem it is hard to choose a
programming language. I would suggest you ask yourself the following
questions:
- What if your project depends on a lot of python libraries that don't have
Scala/Java counterparts? It is unlikely but possible.
- What if
On Thu, Sep 4, 2014 at 3:42 AM, jamborta jambo...@gmail.com wrote:
hi,
I ran into a problem with spark sql, when run a query like this select
count(*), city, industry from table group by hour and I would like to take
the results from the shemaRDD
1, I have to parse each line to get the
Hi,
What is the setup of your native library? Probably it is not thread safe?
Thanks,
Max
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/EC2-JNI-crashes-JVM-with-multi-core-instances-tp13463p13470.html
Sent from the Apache Spark User List mailing list
I faced the same problem and ended up using the same approach that Sean
suggested
https://github.com/AyasdiOpenSource/df/blob/master/src/main/scala/com/ayasdi/df/DF.scala#L313
Option 3 also seems reasonable. It should create a CSVParser per executor.
On Thu, Sep 4, 2014 at 6:58 AM, Andrianasolo
Hi,
I am using the following code to write data to hbase. I see the jobs are
send off but I never get anything in my hbase database. Spark doesn't throw
any error? How can such a problem be debugged. Is the code below correct for
writing data to hbase?
val conf =
Hey Oleg,
In pyspark, you MUST have the same version of Python in all the
machines of the cluster,
which means when you run `python` on these machines, all of them
should be the same
version ( 2.6 or 2.7).
With PYSPARK_PYTHON, you can run pyspark with a specified version of
Python. Also,
you
It seems that running insertInto on an SchemaRDD with a ParquetRelation
creates an individual file for each item in the RDD. Sometimes, it has
multiple rows in one file, and sometimes it only writes the column headers.
My question is, is it possible to have it write the entire RDD as 1 file,
but
Ahh got it - I knew I was missing something - appreciate the clarification! :)
On September 4, 2014 at 10:27:44, Cheng Lian (lian.cs@gmail.com) wrote:
You may configure listening host and port in the same way as HiveServer2 of
Hive, namely:
via environment variables
Johnny,
Currently, probably the easiest (and most performant way) to integrate
Spark and Cassandra is using the spark-cassandra-connector [1]
Given an rdd, saving it to cassandra is as easy as:
rdd.saveToCassandra(keyspace, table, Seq(columns))
We tried many 'hand crafted' options to interact
Correct. For standalone mode, Master is your cluster manager. Spark also
supports other cluster managers such as Yarn and Mesos.
-Andrew
2014-09-04 5:52 GMT-07:00 Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com:
Hello,
Is “cluster manager” mentioned here
Hi all,
The JSON version of the web UI is not officially supported; I don't believe
this is documented anywhere.
The alternative is to set `spark.eventLog.enabled` to true before running
your application. This will create JSON SparkListenerEvents with details
about each task and stage as a log
Hi Grzegorz,
Sorry for the late response. Unfortunately, if the Master UI doesn't know
about your applications (they are completed with respect to a different
Master), then it can't regenerate the UIs even if the logs exist. You will
have to use the history server for that.
How did you start the
Hi Mohit,
This looks pretty interesting, but just a note on the implementation -- it
might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The
reason is that SchemaRDDs already have an efficient in-memory representation
(columnar storage), and can be read from a variety of data
+user
On Thu, Sep 4, 2014 at 10:53 PM, Praveen Seluka psel...@qubole.com wrote:
Spark on Yarn has static allocation of resources.
https://issues.apache.org/jira/browse/SPARK-3174 - This JIRA by Sandy is
about adding and removing executors dynamically based on load. Even before
doing this,
Thanks Matei. I will take a look at SchemaRDDs.
On Thu, Sep 4, 2014 at 11:24 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Hi Mohit,
This looks pretty interesting, but just a note on the implementation -- it
might be worthwhile to try doing this on top of Spark SQL SchemaRDDs. The
Hello,
In the app below, when I run it with local[1] or local [3], I get the
expected result - a list of the square roots of the numbers from 1 to 20.
When I try the same app as standalone with one or two workers on the same
machine, it will only print 1.0.
Adding print statements
Hi,
I'm new to spark and scala, so apologies if this is obvious.
Every RDD appears to be typed, which I can see by seeing the output in the
spark-shell when I execute 'take':
scala val t = sc.parallelize(Array(1,2,3))
t: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize
at
Okay,
Obviously I don't care about adding more files to the system so is there a
way to point to an existing parquet file (directory) and seed the individual
part-r-***.parquet (the value of partition + offset) while preventing
I mean, I can hack it by copying files into the same parquet
Thanks for answering Daniil -
I have SBT version 0.13.5, is that an old version? Seems pretty up-to-date.
It turns out I figured out a way around this entire problem: just use 'sbt
package', and when using bin/spark-submit, pass it the --jars option and
GIVE IT ALL THE JARS from the local iv2
Yep, that worked out. Does this solution have any performance implications
past all the work being done on (probably) 1 node?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-Parquet-insertInto-makes-many-files-tp13480p13501.html
Sent from the
Hello,
I tracked it down to the field nIters being uninitialized when passed to the
reduce job while running standalone, but initialized when running local. Must
be some strange interaction between Spark and scala.App. If I move the reduce
job into a method and make nIters a local
I may have missed this but is it possible to select on datetime in a
SparkSQL query
jan1 = sqlContext.sql(SELECT * FROM Stocks WHERE datetime = '2014-01-01')
Additionally, is there a guide as to what SQL is valid? The guide says,
Note that Spark SQL currently uses a very basic SQL parser It
I am planning to use RDD join operation, to test out i was trying to
compile some test code, but am getting following compilation error
*value join is not a member of org.apache.spark.rdd.RDD[(String, Int)]*
*[error] rddA.join(rddB).map { case (k, (a, b)) = (k, a+b) }*
Code:
import
Try this:
Import org.apache.spark.SparkContext._
Thanks.
Zhan Zhang
On Sep 4, 2014, at 4:36 PM, Veeranagouda Mukkanagoudar veera...@gmail.com
wrote:
I am planning to use RDD join operation, to test out i was trying to compile
some test code, but am getting following compilation error
Assume I define a partitioner like
/**
* partition on the first letter
*/
public class PartitionByStart extends Partitioner {
@Override public int numPartitions() {
return 26;
}
@Override public int getPartition(final Object key) {
Thanks a lot, that fixed the issue :)
On Thu, Sep 4, 2014 at 4:51 PM, Zhan Zhang zzh...@hortonworks.com wrote:
Try this:
Import org.apache.spark.SparkContext._
Thanks.
Zhan Zhang
On Sep 4, 2014, at 4:36 PM, Veeranagouda Mukkanagoudar veera...@gmail.com
wrote:
I am planning to use
Partitioners also work in local mode, the only question is how to see which
data fell into each partition, since most RDD operations hide the fact that
it's partitioned. You can do rdd.glom().collect() -- the glom() operation turns
an RDD of elements of type T into an RDD of ListT, with a
BTW you can also use rdd.partitions() to get a list of Partition objects and
see how many there are.
On September 4, 2014 at 5:18:30 PM, Matei Zaharia (matei.zaha...@gmail.com)
wrote:
Partitioners also work in local mode, the only question is how to see which
data fell into each partition,
specifically, you're picking up the following implicit:
import org.apache.spark.SparkContext.rddToPairRDDFunctions
(in case you're a wildcard-phobe like me)
On Thu, Sep 4, 2014 at 5:15 PM, Veeranagouda Mukkanagoudar
veera...@gmail.com wrote:
Thanks a lot, that fixed the issue :)
On Thu,
It depends on the RDD in question exactly where the work will be done. I
believe that if you do a repartition(1) instead of a coalesce it will force
a shuffle so the work will be done distributed and then a single node will
read that shuffled data and write it out.
If you want to write to a
Hi,
On Thu, Sep 4, 2014 at 11:49 PM, Johnny Kelsey jkkel...@semblent.com
wrote:
As a concrete example, we have a python class (part of a fairly large
class library) which, as part of its constructor, also creates a record of
itself in the cassandra key space. So we get an initialised class a
Hive can launch another job with strategy to merged the small files, probably
we can also do that in the future release.
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, September 05, 2014 8:59 AM
To: DanteSama
Cc: u...@spark.incubator.apache.org
Subject: Re: SchemaRDD -
I have this 2-node cluster setup, where each node has 4-cores.
MASTER
(Worker-on-master) (Worker-on-node1)
(slaves(master,node1))
SPARK_WORKER_INSTANCES=1
I am trying to understand Spark's parallelize behavior. The sparkPi example has
this code:
val slices =
Ahh - that probably explains an issue I am seeing. I am a brand new user and
I tried running the SimpleApp class that is on the Quick Start page
(http://spark.apache.org/docs/latest/quick-start.html).
When I use conf.setMaster(local) then I can run the class directly from my
IDE. But when I try
Hi,
I have a quick serialization issue. I'm trying to read a date range of input
files and I'm getting a serialization issue when using an input path that
has a object generate a date range. Specifically, my code uses
DateTimeFormat in the Joda time package, which is not serializable. How do I
get
I am able to run Spark jobs and Spark Streaming jobs successfully via YARN on a
CDH cluster.
When you mean YARN isn’t quite there yet, you mean to submit the jobs
programmatically? or just in general?
On Sep 4, 2014, at 1:45 AM, Matt Chu m...@kabam.com wrote:
I don't want to use YARN or Mesos, just trying the standalone spark cluster.
We need a way to do seamless submission with the API which I don't see.
To my surprise I was hit by this issue when i tried running the submit from
another machine, it is crazy that I have to submit the job from the
I get this error when i run it from IDE
***
Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Master removed our application: FAILED
at
81 matches
Mail list logo