Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0?
Thanks,
Ron
Sent from my iPad
On May 20, 2014, at 6:27 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
Hi Sandy,
Is there a programmatic way? We're building a platform as a service and
need to assign it to
Hello,
I am currently using MLlib ALS to process a large volume of data, about 1.2
billion Rating(userId, productId, rates) triples. The dataset was sepatated
into 4000 partitions for parallized computation on our yarn clusters.
I encountered this error Errors communicating with
Unfortunately, there is no API support for this right now. You could
implement it yourself by implementing your own receiver and controlling the
rate at which objects are received. If you are using any of the standard
receivers (Flume, Kafka, etc.), I recommended looking at the source code of
the
Apologies for the premature send.
Unfortunately, there is no API support for this right now. You could
implement it yourself by implementing your own receiver and controlling the
rate at which objects are received. If you are using any of the standard
receivers (Flume, Kafka, etc.), I
if you saw some exception message like the JIRA
https://issues.apache.org/jira/browse/SPARK-1886 mentioned in work's log
file, you are welcome to have a try https://github.com/apache/spark/pull/827
On Wed, May 21, 2014 at 11:21 AM, Josh Marcus jmar...@meetup.com wrote:
Aaron:
I see this
Hi,
I have set up a cluster with Mesos (backed by Zookeeper) with three
master and three slave instances. I set up Spark (git HEAD) for use
with Mesos according to this manual:
http://people.apache.org/~pwendell/catalyst-docs/running-on-mesos.html
Using the spark-shell, I can connect to this
Noone has any idea ?It's really troublesome, it seems like i have no way to
catch errors while an action is beeing processed and just ignore it.Here's
a bit more details on what i'm doing:
JavaRDD a = sc.textFile(s3n://+missingFilenamePattern) JavaRDD b =
Hi,
I want to do union of all RDDs in each window of DStream. I found Dstream.union
and haven't seen anything like DStream.windowRDDUnion.
Is there any way around it?
I want to find mean and SD of all values which comes under each sliding window
for which I need to union all the RDDs in each
I am new to spark and we are developing a data science pipeline based on
spark on ec2. So far we have been using python on spark standalone cluster.
However, being a newbie I would like to know more about how can I do
debugging (program level) from spark logs (is it stderr ?). I find it a bit
Can you identify a specific file that fails?
There might be a real bug here, but I have found gzip to be reliable.
Every time I have run into a bad header error with gzip, I had a non-gzip
file with the wrong extension for whatever reason.
-
Madhu
https://www.linkedin.com/in/msiddalingaiah
Hi,
I'm noticing a difference between two installations of Spark. I'm pretty
sure both are version 0.9.1. One is able to import
pyspark.rdd.ResultIterable and the other isn't. Is this an environment
problem or do we actually have two different versions of Spark? To be
clear, on one box, one can
Hi Tobias,
Regarding my comment on closure serialization:
I was discussing it with my fellow Sparkers here and I totally overlooked
the fact that you need the class files to de-serialize the closures (or
whatever) on the workers, so you always need the jar file delivered to the
workers in order
Thanks Nick and Matei. I'll take a look at the patch and keep you updated.
Tommer
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Python-Spark-and-HBase-tp6142p6176.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
After the several fixes that we have made to exception handling in Spark
1.0.0, I expect that this behavior will be quite different from 0.9.1.
Executors should be far more likely to shutdown cleanly in the event of
errors, allowing easier restarts. But I expect that there will be more
bugs to
Hi Nick,
Which version of Hadoop are you using with Spark? I spotted an issue with
the built-in GzipDecompressor while doing something similar with Hadoop
1.0.4, all my Gzip files were valid and tested yet certain files blew up
from Hadoop/Spark.
The following JIRA ticket goes into more detail
Thanks this really helps.
As long as I stick to HDFS paths, and files I’m good. I do know that code a bit
but have never used it to say take input from one cluster via
“hdfs://server:port/path” and output to another via
“hdfs://another-server:another-port/path”. This seems to be supported by
Does anyone know if:
./bin/spark-shell --master yarn
is running yarn-cluster or yarn-client by default?
Base on source code:
./core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
if (args.deployMode == cluster args.master.startsWith(yarn)) {
args.master = yarn-cluster
I run the pagerank example processing a large data set, 5GB in size, using 48
machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
attached log shows. It was stuck there for more than 10 hours and then I
killed it at last. But I did not find any information explaining why it
Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui
On Wed, May 21, 2014 at 11:23 AM, yxzhao yxz...@ualr.edu wrote:
I run the pagerank example processing a large data set, 5GB in size, using 48
machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
Hi Tobias,
On Wed, May 21, 2014 at 5:45 PM, Tobias Pfeiffer t...@preferred.jp wrote:
first, thanks for your explanations regarding the jar files!
No prob :-)
On Thu, May 22, 2014 at 12:32 AM, Gerard Maas gerard.m...@gmail.com
wrote:
I was discussing it with my fellow Sparkers here and I
Thanks Xiangrui, How to check and make sure the data is distributed
evenly? Thanks again.
On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [via Apache Spark User
List] ml-node+s1001560n6187...@n3.nabble.com wrote:
Many OutOfMemoryErrors in the log. Is your data distributed evenly?
-Xiangrui
On
If the RDD is cached, you can check its storage information in the
Storage tab of the Web UI.
On Wed, May 21, 2014 at 12:31 PM, yxzhao yxz...@ualr.edu wrote:
Thanks Xiangrui, How to check and make sure the data is distributed
evenly? Thanks again.
On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng
Here's the 1.0.0rc9 version of the docs:
https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/running-on-mesos.html
I refreshed them with the goal of steering users more towards prebuilt
packages than relying on compiling from source plus improving overall
formatting and clarity, but not
Does any one know how to configure the digest mailing list? For example, I
want to receive daily digest, not every 10 messages.
Thanks!
On Mon, May 19, 2014 at 4:29 PM, Shangyu Luo lsy...@gmail.com wrote:
Hi Andrew and Madhu,
Thank you for your help here! Will unsubscribe through another
Hi Andrew,
Thanks for the current doc.
I'd almost gotten to the point where I thought that my custom code needed
to be included in the SPARK_EXECUTOR_URI but that can't possibly be
correct. The Spark workers that are launched on Mesos slaves should start
with the Spark core jars and then
Looking forward to that update!
Given a table of JSON objects like this one:
{
name: Nick,
location: {
x: 241.6,
y: -22.5
},
likes: [ice cream, dogs, Vanilla Ice]}
It would be SUPER COOL if we could query that table in a way that is as
natural as follows:
SELECT
I have a few test cases for Spark which extend TestSuiteBase from
org.apache.spark.streaming.
The tests run fine on my machine but when I commit to repo and run the tests
automatically with bamboo the test cases fail with these errors.
How to fix?
21-May-2014 16:33:09
[info]
I have a graph and am trying to take a random sample of vertices without
replacement, using the RDD.sample() method
verts are the vertices in the graph
val verts = graph.vertices
and executing this multiple times in a row
verts.sample(false, 1.toDouble/v1.count.toDouble,
Just found this at the top of the log:
17:14:41.124 [pool-7-thread-3-ScalaTest-running-StreamingSpikeSpec] WARN
o.e.j.u.component.AbstractLifeCycle - FAILED
SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in
use
build 21-May-2014 17:14:41
It doesn't guarantee the exact sample size. If you fix the random
seed, it would return the same result every time. -Xiangrui
On Wed, May 21, 2014 at 2:05 PM, glxc r.ryan.mcc...@gmail.com wrote:
I have a graph and am trying to take a random sample of vertices without
replacement, using the
The problem is solved after hadoop-core dependency added. But I think there
is a misunderstanding about local files. I found this one:
Note that if you've connected to a Spark master, it's possible that it will
attempt to load the file on one of the different machines in the cluster, so
make sure
Ah, forgot the -verbose option. Thanks Andrew. That is very helpful.
Date: Wed, 21 May 2014 11:07:55 -0700
Subject: Re: Is spark 1.0.0 spark-shell --master=yarn running in yarn-cluster
mode or yarn-client mode?
From: and...@databricks.com
To: user@spark.apache.org
The answer is actually
This do happens sometimes, but it is a warning because Spark is designed
try successive ports until it succeeds. So unless a cray number of
successive ports are blocked (runaway processes?? insufficient clearing of
ports by OS??), these errors should not be a problem for tests passing.
On
I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster
mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or
2.2. The application successfully ran to conclusion but it
ultimately failed.
There were 2 anomalies...
1. ASM reported only
You could do
records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }
On Wed, May 21, 2014 at 3:28 PM, Ian Holsman i...@holsman.com.au wrote:
Hi.
Firstly I'm a newb (to both Scala Spark).
I have a stream, that contains multiple types of records, and I would like
to
Hi,
I would like to setup apache platform on a mini cluster. Is there any
recommendation for the hardware that I can buy to set it up. I am thinking
about processing significant amount of data like in the range of few
terabytes.
Thanks
Upender
Suggestion - try to get an idea of your hardware requirements by running a
sample on Amazon's EC2 or Google compute engine. It's relatively easy (and
cheap) to get started on the cloud before you invest in your own hardware
IMO.
On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar
Great, thanks for that tip. I will update the documents!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-tp5529p6210.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }
I think a Scala-ish way would be
records.flatMap(_ match {
case i: Int=
Some(i)
case _ =
None
})
Thanks Tobias Tathagata.
these are great.
On Wed, May 21, 2014 at 8:02 PM, Tobias Pfeiffer t...@preferred.jp wrote:
On Thu, May 22, 2014 at 8:07 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }
I think a
As the yarn-client mode,will spark be deployed in the node of yarn? If it is
deployed only in the client,can spark submit the job to yarn?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213.html
Sent from the Apache Spark User
Hi Sophia,
In yarn-client mode, the node that submits the application can either be
inside or outside of the cluster. This node also hosts the driver
(SparkContext) of the application. All the executors, however, will be
launched on nodes inside the YARN cluster.
Andrew
2014-05-21 18:17
But,I don't understand this point,is it necessary to deploy slave node of
spark in the yarn node?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6216.html
Sent from the Apache Spark User List mailing list archive at
Hi,
On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
I want to do union of all RDDs in each window of DStream.
A window *is* a union of all RDDs in the respective time interval.
The documentation says a DStream is represented as a sequence of
RDDs. However, data from a
Hi,
as far as I understand, if you create an RDD with a relational
structure from your JSON, you should be able to do much of that
already today. For example, take lift-json's deserializer and do
something like
val json_table: RDD[MyCaseClass] = json_data.flatMap(json =
Seems you are asking that does spark related jar need to be deploy to yarn
cluster manually before you launch application?
Then, no , you don't, just like other yarn application. And it doesn't matter
it is yarn-client or yarn-cluster mode..
Best Regards,
Raymond Liu
-Original
It sounds like something is closing the hdfs filesystem before everyone is
really done with it. The filesystem gets cached and is shared so if someone
closes it while other threads are still using it you run into this error. Is
your application closing the filesystem? Are you using the
Hi Mohit,
The log line about the ExternalAppendOnlyMap is more of a symptom of
slowness than causing slowness itself. The ExternalAppendOnlyMap is used
when a shuffle is causing too much data to be held in memory. Rather than
OOM'ing, Spark writes the data out to disk in a sorted order and
One thing you can try is to pull each file out of S3 and decompress with
gzip -d to see if it works. I'm guessing there's a corrupted .gz file
somewhere in your path glob.
Andrew
On Wed, May 21, 2014 at 12:40 PM, Michael Cutler mich...@tumra.com wrote:
Hi Nick,
Which version of Hadoop are
That's a good idea. So you're saying create a SchemaRDD by applying a
function that deserializes the JSON and transforms it into a relational
structure, right?
The end goal for my team would be to expose some JDBC endpoint for analysts
to query from, so once Shark is updated to use Spark SQL that
Thank you
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/yarn-client-mode-question-tp6213p6224.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks for the suggestions, people. I will try to hone in on which specific
gzipped files, if any, are actually corrupt.
Michael,
I’m using Hadoop 1.0.4, which I believe is the default version that gets
deployed by spark-ec2. The JIRA issue I linked to earlier,
It depends on what stack you want to run. A quick cut:
- Worker Machines (DataNode, HBase Region Servers, Spark Worker Nodes)
- Dual 6 core CPU
- 64 to 128 GB RAM
- 3 X 3TB disk (JBOD)
- Master Node (Name Node, HBase Master,Spark Master)
- Dual 6 core CPU
- 64
Hi,
I'm quite new and recetly started to try spark. I've setup a single node
spark cluster and followed the tutorials in Quick Start. But I've come
across some issues.
The thing I was trying to do is to try the java api and run it on the
single-node cluster. I followed the Quick Start/A
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 /
HDP(?) ? We may be able to reproduce this in that case.
TD
On Wed, May 21, 2014 at 8:35 PM, Tom Graves tgraves...@yahoo.com wrote:
It sounds like something is closing the hdfs filesystem before everyone is
really done
55 matches
Mail list logo