I am running Spark Standalone mode with Spark 1.1
I started SparkSQL thrift server as follows:
./sbin/start-thriftserver.sh
Then I use beeline to connect to it.
Now, I can CREATE, SELECT, SHOW the databases or the tables;
But when I DROP or Load data inpath 'kv1.txt' into table src, the
Beeline
Hi,
I compile spark with cmd bash -x make-distribution.sh -Pyarn -Phive
--skip-java-test --with-tachyon --tgz -Pyarn.version=2.3.0
-Phadoop.version=2.3.0, it errors.
How to use it correct?
message:
+ set -o pipefail
+ set -e
+++ dirname make-distribution.sh
++ cd .
++ pwd
Hi Jim,
This approach will not work right out of the box. You need to understand a
few things. A driver program and the master will be communicating with each
other, for that you need to open up certain ports for your public ip (Read
about port forwarding http://portforward.com/). Also on the
This is a bug, I had create an issue to track this:
https://issues.apache.org/jira/browse/SPARK-3500
Also, there is PR to fix this: https://github.com/apache/spark/pull/2369
Before next bugfix release, you can workaround this by:
srdd = sqlCtx.jsonRDD(rdd)
srdd2 =
Hi all
Am seeing a strange issue in Spark on Yarn(Stable). Let me know if known,
or am missing something as it looks very fundamental.
Launch a Spark job with 2 Containers. addContainerRequest called twice and
then calls allocate to AMRMClient. This will get 2 Containers allocated.
Fine as of
Another aspect to keep in mind is JVM above 8-10GB starts to misbehave.
Typically better to split up ~ 15GB intervals.
if you are choosing machines 10GB/Core is a approx to maintain.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Is it possible to use DoubleRDDFunctions
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
for calculating mean and std dev for Paired RDDs (key, value)?
Now I'm using an approach with ReduceByKey but want to make my code more
concise and readable.
I generally call values.stats, e.g.:
val stats = myPairRdd.values.stats
On Fri, Sep 12, 2014 at 4:46 PM, rzykov rzy...@gmail.com wrote:
Is it possible to use DoubleRDDFunctions
https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html
for calculating mean
resolved:
./make-distribution.sh --name spark-hadoop-2.3.0 --tgz --with-tachyon -Pyarn
-Phadoop-2.3 -Dhadoop.version=2.3.0 -Phive -DskipTests
This code is a bit misleading
Zhanfeng Huo
From: Zhanfeng Huo
Date: 2014-09-12 14:13
To: user
Subject: spark-1.1.0 with make-distribution.sh
These functions operate on an RDD of Double which is not what you have, so
no this is not a way to use DoubleRDDFunctions. See earlier in the thread
for canonical solutions.
On Sep 12, 2014 8:06 AM, rzykov rzy...@gmail.com wrote:
Tried this:
ordersRDD.join(ordersRDD).map{case((partnerid,
Oh I see, I think you're trying to do something like (in SQL):
SELECT order, mean(price) FROM orders GROUP BY order
In this case, I'm not aware of a way to use the DoubleRDDFunctions, since
you have a single RDD of pairs where each pair is of type (KeyType,
Iterable[Double]).
It seems to me
Thank you very much for your help :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-memory-questions-tp13805p14069.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I'm using --use-existing-master to launch a previous stopped ec2 cluster
with spark-ec2. However, my configuration files are overwritten once is the
cluster is setup. What's the best way of preserving existing configuration
files in spark/conf.
Alternatively, what I'm trying to do is set
Hi,
I am also facing the same problem. Has any one found out the solution yet?
It just returns a vague set of characters.
Please help..
Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Exception while deserializing and fetching task:
Hi all,
I am currently working on pyspark for NLP processing etc.I am using TextBlob
python library.Normally in a standalone mode it easy to install the external
python libraries .In case of cluster mode I am facing problem to install
these libraries on worker nodes remotely.I cannot access each
Dear all,
I am sorry. This was a false alarm
There was some issue in the RDD processing logic which leads to large
backlog. Once I fixed the issues in my processing logic, I can see all
messages being pulled nicely without any Block Removed error. I need to
tune certain configurations in my
Hi there,
I’m pretty new to Spark, and so far I’ve written my jobs the same way I wrote
Scalding jobs - one-off, read data from HDFS, count words, write counts back to
HDFS.
Now I want to display these counts in a dashboard. Since Spark allows to cache
RDDs in-memory and you have to
Our issue could be related to this problem as described in:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-td14027.html
which
the DStream is processed for every 1 hour batch duration.
I have implemented IO throttling in the
I agree,
Even the Low Level Kafka Consumer which I have written has tunable IO
throttling which help me solve this issue ... But question remains , even
if there are large backlog, why Spark drop the unprocessed memory blocks ?
Dib
On Fri, Sep 12, 2014 at 5:47 PM, Jeoffrey Lim
Hi Reinis
Try if the exclude suggestion from me and Sean works for you. If not, can
you turn on verbose class loading to see from where
javax.servlet.ServletRegistration is loaded? The class should load
from org.mortbay.jetty
% servlet-api % jettyVersion. If it loads from some other jar, you
Hi all
Sorry but this was totally my mistake. In my persistence logic, I was
creating async http client instance in RDD foreach but was never closing it
leading to memory leaks.
Apologies for wasting everyone's time.
Thanks,
Aniket
On 12 September 2014 02:20, Tathagata Das
I got this error from the executor's stderr:
[akka.tcp://sparkDriver@saturn00:49464] disassociated! Shutting down. What is
the reason of Actor not found?
// Yoonmin Nam
-
To unsubscribe, e-mail:
Hi Akhil,
Thanks! I guess in short that means the master (or slaves?) connect back to
the driver. This seems like a really odd way to work given the driver needs
to already connect to the master on port 7077. I would have thought that if
the driver could initiate a connection to the master, that
Driver needs a consistent connection to the master in standalone mode as whole
bunch of client stuff happens on the driver. So calls like parallelize send
data from driver to the master collect send data from master to the driver.
If you are looking to avoid the connect you can look into
Hi,
I do not see any pre-build binaries on the site currently. I am using the
make-distribution.sh to create a binary package. After that is done the the
file generated by that will allow you to run execute the scripts in the bin
folder.
HTH,
Andrew
--
View this message in context:
What is your system setup? Can you paste the spark-env.sh? Looks like you
have some issues with your configuration.
Thanks
Best Regards
On Fri, Sep 12, 2014 at 6:31 PM, 남윤민 rony...@dgist.ac.kr wrote:
I got this error from the executor's stderr:
Using Spark's default log4j profile:
thanks. I will try to do that way.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I would like to define the names of my output in Spark, I have a process
which write many fails and I would like to name them, is it possible? I
guess that it's not possible with saveAsText method.
It would be something similar to the MultipleOutput of Hadoop.
In spark streaming programming document
https://spark.apache.org/docs/latest/streaming-programming-guide.html ,
it specifically states how to shut down a spark streaming context:
The existing application is shutdown gracefully (see
StreamingContext.stop(...) or JavaStreamingContext.stop(...)
Dear all,
I'm facing the following problem and I can't figure how to solve it.
I need to join 2 rdd in order to find their intersections. The first RDD
represent an image encoded in base64 string associated with image id. The
second RDD represent a set of geometric primitives (rectangle)
Hi Davies,
Thanks for the quick fix. I'm sorry to send out a bug report on release day
- 1.1.0 really is a great release. I've been running the 1.1 branch for a
while and there's definitely lots of good stuff.
For the workaround, I think you may have meant:
srdd2 =
[moving to user@]
This would typically be accomplished with a union() operation. You
can't mutate an RDD in-place, but you can create a new RDD with a
union() which is an inexpensive operator.
On Fri, Sep 12, 2014 at 5:28 AM, Archit Thakur
archit279tha...@gmail.com wrote:
Hi,
We have a use
I think this is a popular issue, but need help figuring a way around if this
issue is unresolved. I have a dataset that has more than 70 columns. To have
all the columns fit into my RDD, I am experimenting the following. (I intend
to use the InputData to parse the file and have 3 or 4 columnsets
This is only a problem in shell, but works fine in batch mode though. I am
also interested in how others are solving the problem of case class
limitation on number of variables.
Regards
Ram
On Fri, Sep 12, 2014 at 12:12 PM, iramaraju iramar...@gmail.com wrote:
I think this is a popular issue,
http://engineering.ayasdi.com/2014/09/11/df-dataframes-on-spark/
What is your spark version ? This was fixed I suppose. Can you try it with
latest release ?
Prashant Sharma
On Fri, Sep 12, 2014 at 9:47 PM, Ramaraju Indukuri iramar...@gmail.com
wrote:
This is only a problem in shell, but works fine in batch mode though. I am
also interested in how others
So, after toying around a bit, here's what I ended up with. First off,
there's no function registerTempTable -- registerTable seems to be
enough to work (it's the same whether directly on a SchemaRDD or on a
SqlContext being passed an RDD). The problem I encountered after was
reloading a table in
On Fri, Sep 12, 2014 at 8:55 AM, Brad Miller bmill...@eecs.berkeley.edu wrote:
Hi Davies,
Thanks for the quick fix. I'm sorry to send out a bug report on release day
- 1.1.0 really is a great release. I've been running the 1.1 branch for a
while and there's definitely lots of good stuff.
There is one thing that I am confused about.
Spark has codes that have been implemented in Scala. Now, can we run any
Scala code on the Spark framework? What will be the difference in the
execution of the scala code in normal systems and on Spark?
The reason for my question is the following:
I had
unpersist is a method on RDDs. RDDs are abstractions introduced by Spark.
An Int is just a Scala Int. You can't call unpersist on Int in Scala, and
that doesn't change in Spark.
On Fri, Sep 12, 2014 at 12:33 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
There is one thing that I am
Hi,
Anyone have a stable streaming app running in production? Can you
share some overview of the app and setup like number of nodes, events
per second, broad stream processing workflow, config highlights etc?
Thanks,
Tim
-
To
You can use MapPartitions to achieve this.
/split each partition into 10 equal parts with each part having number as
its id
val splittedRDD = self.mapPartitions((itr)= {
Iterate over this iterator and breaks this iterator into 10 parts.
val iterators = Array[ArrayBuffer[T]](10)
var i =0
for(tuple
Hi Patrick,
What if all the data has to be keep in cache all time. If applying union
result in new RDD then caching this would result into keeping older as well
as this into memory hence duplicating data.
Below is what i understood from your comment.
sqlContext.cacheTable(existingRDD)// caches
Similar issue (Spark 1.0.0). Streaming app runs for a few seconds
before these errors start to pop all over the driver logs:
14/09/12 17:30:23 WARN TaskSetManager: Loss was due to java.lang.Exception
java.lang.Exception: Could not compute split, block
input-4-1410542878200 not found
at
By SparkContext.addPyFile(xx.zip), the xx.zip will be copies to all
the workers
and stored in temporary directory, the path to xx.zip will be in the sys.path on
worker machines, so you can import xx in your jobs, it does not need to be
installed on worker machines.
PS: the package or module
You can always use sqlContext.uncacheTable to uncache the old table.
On Fri, Sep 12, 2014 at 10:33 AM, pankaj.arora pankajarora.n...@gmail.com
wrote:
Hi Patrick,
What if all the data has to be keep in cache all time. If applying union
result in new RDD then caching this would result into
Turns out it was Spray with a bad route -- the results weren't updating
despite the table running. This thread can be ignored.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14114.html
Sent from the Apache Spark
Can anyone help me with this? I have been stuck on this for a few days and
don't know what to try anymore.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/EOFException-when-reading-from-HDFS-tp13844p14115.html
Sent from the Apache Spark User List mailing
Hi Praveen,
I believe you are correct. I noticed this a little while ago and had a fix
for it as part of SPARK-1714, but that's been delayed. I'll look into this
a little deeper and file a JIRA.
-Sandy
On Thu, Sep 11, 2014 at 11:44 PM, praveen seluka praveen.sel...@gmail.com
wrote:
Hi all
I just started playing with Spark. So I ran the SimpleApp program from
tutorial (https://spark.apache.org/docs/1.0.0/quick-start.html), which works
fine.
However, if I change the file location from local to hdfs, then I get an
EOFException.
I did some search online which suggests this error is
Ping...
On Thu, Sep 11, 2014 at 5:44 PM, Victor Tso-Guillen v...@paxata.com wrote:
So I have a bunch of hardware with different core and memory setups. Is
there a way to do one of the following:
1. Express a ratio of cores to memory to retain. The spark worker config
would represent all of
Andrew,
This email was pretty helpful. I feel like this stuff should be summarized
in the docs somewhere, or perhaps in a blog post.
Do you know if it is?
Nick
On Thu, Jun 5, 2014 at 6:36 PM, Andrew Ash and...@andrewash.com wrote:
The locality is how close the data is to the code that's
LittleCode snippet:
line1: cacheTable(existingRDDTableName)
line2: //some operations which will materialize existingRDD dataset.
line3: existingRDD.union(newRDD).registerAsTable(new_existingRDDTableName)
line4: cacheTable(new_existingRDDTableName)
line5: //some operation that will materialize new
Hi,
I am using the Spark 1.1.0 version that was released yesterday. I recompiled
my program to use the latest version using sbt assembly after modifying
project.sbt to use the 1.1.0 version. The compilation goes through and the
jar is built. When I run the jar using spark-submit, I get an error:
This issue is resolved. Looks like in the new spark-submit, the jar path has
to be at the end of the options. Earlier I could specify this path in any
order on the command line.
thanks
--
View this message in context:
What is in your hive-site.xml?
On Thu, Sep 11, 2014 at 11:04 PM, linkpatrickliu linkpatrick...@live.com
wrote:
I am running Spark Standalone mode with Spark 1.1
I started SparkSQL thrift server as follows:
./sbin/start-thriftserver.sh
Then I use beeline to connect to it.
Now, I can
Ah, I see. So basically what you need is something like cache write through
support which exists in Shark but not implemented in Spark SQL yet. In
Shark, when inserting data into a table that has already been cached, the
newly inserted data will be automatically cached and “union”-ed with the
Spark 1.0.0
I write logs out from my app using this object:
object LogService extends Logging {
/** Set reasonable logging levels for streaming if the user has not
configured log4j. */
def setStreamingLogLevels() {
val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElements
Newbie for Java. so please be specific on how to resolve this,
The command I was running is
$ ./spark-submit --driver-class-path
/home/cloudera/Downloads/spark-1.1.0-bin-hadoop2.3/lib/spark-examples-1.1.0-hadoop2.3.0.jar
Something like the following should let you launch the thrift server on
yarn.
HADOOP_CONF_DIR=/etc/hadoop/conf HIVE_SERVER2_THRIFT_PORT=12345 MASTER=yarn-
client ./sbin/start-thriftserver.sh
On Thu, Sep 11, 2014 at 8:30 PM, Denny Lee denny.g@gmail.com wrote:
Could you provide some
Hi,
Anyone setting any explicit GC options for the executor jvm? If yes,
what and how did you arrive at them?
Thanks,
- Tim
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail:
The same command passed in another quick-start vm (v4.7) which has hbase 0.96
installed. maybe there are some conflicts for the newer hbase version and
spark 1.1.0? just my guess.
Thanks.
--
View this message in context:
Hi,
I was trying the following on spark-shell (built with apache master and hadoop
2.4.0). Both calling rdd2.collect and calling rdd3.collect threw
java.io.NotSerializableException: org.apache.hadoop.io.NullWritable.
I got the same problem in similar code of my app which uses the newly
Another observation I had was reading over local filesystem with “file://“. it
was stated as PROCESS_LOCAL which was confusing.
Regards,
Liming
On 13 Sep, 2014, at 3:12 am, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Andrew,
This email was pretty helpful. I feel like this stuff
We know some memory of spark are used for computing (e.g., shuffle buffer)
and some are used for caching RDD for future use.
Is there any existing workload which utilize both of them? I want to do
some performance study by adjusting the ratio between them.
I know that unpersist is a method on RDD.
But my confusion is that, when we port our Scala programs to Spark, doesn't
everything change to RDDs?
On Fri, Sep 12, 2014 at 10:16 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
unpersist is a method on RDDs. RDDs are abstractions introduced
No, Scala primitives remain primitives. Unless you create an RDD using one
of the many methods - you would not be able to access any of the RDD
methods. There is no automatic porting. Spark is an application as far as
scala is concerned - there is no compilation (except of course, the scala,
JIT
Take for example this:
I have declared one queue *val queue = Queue.empty[Int]*, which is a pure
scala line in the program. I actually want the queue to be an RDD but there
are no direct methods to create RDD which is a queue right? What say do you
have on this?
Does there exist something like:
Hi Du,
I don't think NullWritable has ever been serializable, so you must be doing
something differently from your previous program. In this case though, just use
a map() to turn your Writables to serializable types (e.g. null and String).
Matie
On September 12, 2014 at 8:48:36 PM, Du Li
An RDD is a fault-tolerant distributed structure. It is the primary
abstraction in Spark.
I would strongly suggest that you have a look at the following to get a
basic idea.
http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
Hey SK,
Yeah, the documented format is the same (we expect users to add the
jar at the end) but the old spark-submit had a bug where it would
actually accept inputs that did not match the documented format. Sorry
if this was difficult to find!
- Patrick
On Fri, Sep 12, 2014 at 1:50 PM, SK
I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option: https://github.com/mengxr/redshift-input-format , which
can recognize multi-line records.
Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
the delimiter character. You can apply the same escaping
Assume I have a large book with many Chapters and many lines of text.
Assume I have a function that tells me the similarity of two lines of
text. The objective is to find the most similar line in the same chapter
within 200 lines of the line found.
The real problem involves biology and is beyond
follow the instruction here:
http://spark.apache.org/docs/latest/building-with-maven.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14144.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
74 matches
Mail list logo