I'm not sure exactly how your cluster is configured. But as far as I can
tell Cloudera's MR1 CDH5 dependencies are against Hadoop 2.3. I'd just find
the exact CDH version you have and link against the `mr1` version of their
published dependencies in that version.
So I think you wan't
Hi,
For our usecases we are looking into 20 x 1M matrices which comes in the
similar ranges as outlined by the paper over here:
http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html
Is the exponential runtime growth in spark ALS as outlined by the blog
still exists in
I am getting strange behavior with the RDDs.
All I want is to persist the RDD contents in a single file.
The saveAsTextFile() saves them in multiple textfiles for each partition. So
I tried with rdd.coalesce(1,true).saveAsTextFile(). This fails with the
exception :
Egor, i encounter the same problem which you have asked in this thread:
http://mail-archives.apache.org/mod_mbox/spark-user/201402.mbox/%3CCAMrx5DwJVJS0g_FE7_2qwMu4Xf0y5VfV=tlyauv2kh5v4k6...@mail.gmail.com%3E
have you fixed this problem?
i am using shark to read a table which i have created on
hi, Praveen, thanks for replying.
I am using hive-0.11 which comes from amplab, at the begining , the
hive-site.xml of amplab is empty, so , i copy one hive-site.xml from my
cluster and then remove some attributes and aslo add some atrributs.
i think it is not the reason for my problem,
i think
Oh k. You must be running shark server on bigdata001 to use it from other
machines.
./bin/shark --service sharkserver # runs shark server on port 1
You could connect to shark server as ./bin/shark -h bigdata001, this
should work unless there is a firewall blocking it. You might use telnet
hi, Praveen, I can start server on bigdata001 using /bin/shark --service
sharkserver, i can also connect this server using ./bin/shark -h
bigdata001 .
but, the problem still there:
run select count(*) from b on bigdata001, no result , no error.
run select count(*) from b on bigdata002, no
OK, it was working.
I printed System.getenv(..) for both env variables and they gave correct
values.
However it did not give me the intended result. My intention was to load a
native library from LD_LIBRARY_PATH, but looks like the library is loaded
from value of -Djava.library.path.
Value
i have found such log on bigdata003:
14/03/25 17:08:49 INFO network.ConnectionManager: Accepted connection
from [bigdata001/192.168.1.101]
14/03/25 17:08:49 INFO network.ConnectionManager: Accepted connection
from [bigdata002/192.168.1.102]
14/03/25 17:08:49 INFO network.ConnectionManager:
Thanks for the response Mayur. I was seeing the webui of 0.9.0 spark. I see
lots of detailed statistics in the newer 1.0.0-snapshot version. The only thing
I found missing was the actual code that I had typed in at the spark-shell
prompt, but I can always get it from the shell history.
On
When you say launch long-running tasks does it mean long running Spark
jobs/tasks, or long-running tasks in another system?
If the rate of requests from Kafka is not low (in terms of records per
second), you could collect the records in the driver, and maintain the
shared bag in the driver. A
Is it possible to run across cluster using Spark Interactive Shell ?
To be more explicit, is the procedure similar to running standalone
master-slave spark.
I want to execute my code in the interactive shell in the master-node, and
it should run across the cluster [say 5 node]. Is the procedure
what do you mean by run across the cluster?
you want to start the spark-shell across the cluster or you want to distribute
tasks to multiple machines?
if the former case, yes, as long as you indicate the right master URL
if the later case, also yes, you can observe the distributed task in the
Nan Zhu, its the later, I want to distribute the tasks to the cluster
[machines available.]
If i set the SPARK_MASTER_IP at the other machines and set the slaves-IP in
the /conf/slaves at the master node, will the interactive shell code run at
the master get distributed across multiple machines
what you only need to do is ensure your spark cluster is running well, (you can
check by access the Spark UI to see if all workers are displayed)
then, you have to set correct SPARK_MASTER_IP in the machine where you run
spark-shell
The more details are :
when you run bin/spark-shell, it will
Nan (or anyone who feels they understand the cluster architecture well),
can you clarify something for me.
From reading this user group and your explanation above it appears that the
cluster master is only involved in this during application startup -- to
allocate executors(from what you wrote
master does more work than that actually, I just explained why he should set
MASTER_IP correctly
a simplified list:
1. maintain the worker status
2. maintain in-cluster driver status
3. maintain executor status (the worker tells master what happened on the
executor,
--
Nan Zhu
On
and, yes, I think that picture is a bit misleading, though in the following
paragraph it has mentioned that
“
Because the driver schedules tasks on the cluster, it should be run close to
the worker nodes, preferably on the same local area network. If you’d like to
send requests to the
Have you looked at the individual nodes logs? Can you post a bit more of
the exception's output?
On 3/26/14, 8:42 AM, Jaonary Rabarisoa wrote:
Hi all,
I got java.lang.ClassNotFoundException even with addJar called. The
jar file is present in each node.
I use the version of spark from
Hi,
I have a similar issue like the user below:
I’m running Spark 0.8.1 (standalone). When I test the streaming
NetworkWordCount example as in the docs with local[2] it works fine. As soon as
I want to connect to my cluster using [NetworkWordCount master …] it says:
---
Failed to load native
I'm passing a moving average function during the map phase like this:
val average= new Sma(window=3)
stream.map(x= average.addNumber(x))
where
class Sma extends Serializable { .. }
I also tried to put creation of object average in an object like I saw in
another post:
object Average {
Have you looked through the logs fully? I have seen this (in my limited
experience) pop up as a result of previous exceptions/errors, also as a
result of being unable to serialize objects etc.
Ognen
On 3/26/14, 10:39 AM, Jaonary Rabarisoa wrote:
I notice that I get this error when I'm trying
Tried with reduce and it's giving me pretty weird results that make no sense
ie: 1 number for an entire RDD
val smaStream= inputStream.reduce{ case(t1,t2) =
{
val sma= average.addDataPoint(t1)
sma
}}
Tried with transform and that worked correctly, but unfortunately, it
Hi,
I want to do something like this:
rdd3 = rdd1.coalesce(N).partitions.zip(rdd2.coalesce(N).partitions)
I realize the above will get me something like Array[(partition,partition)].
I hope you see what I'm going for here -- any tips on how to accomplish
this?
Thanks
I'm trying to understand Spark streaming, hoping someone can help.
I've kinda-sorta got a version of Word Count running, and it looks like
this:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
object StreamingWordCount {
def
*Answer 1:*Make sure you are using master as local[n] with n 1
(assuming you are running it in local mode). The way Spark Streaming
works is that it assigns a code to the data receiver, and so if you
run the program with only one core (i.e., with local or local[1]),
then it wont have resources to
Hi Diana,
I'll answer Q3:
You can check if an RDD is empty in several ways.
Someone here mentioned that using an iterator was safer:
val isEmpty = rdd.mapPartitions(iter = Iterator(! iter.hasNext)).reduce(__)
You can also check with a fold or rdd.count
rdd.reduce(_ + _) // can't handle
Answering my own question here. This may not be efficient, but this is
what I came up with:
rdd1.coalesce(N).glom.zip(rdd2.coalesce(N).glom).map { case(x,y) = x++y}
On Wed, Mar 26, 2014 at 11:11 AM, Walrus theCat walrusthe...@gmail.comwrote:
Hi,
I want to do something like this:
rdd3 =
Thanks, Tagatha, very helpful. A follow-up question below...
On Wed, Mar 26, 2014 at 2:27 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
*Answer 3:*You can do something like
wordCounts.foreachRDD((rdd: RDD[...], time: Time) = {
if (rdd.take(1).size == 1) {
// There exists
Hello, (this is Yarn related)
I'm able to load an external jar and use its classes within
ApplicationMaster. I wish to use this jar within worker nodes, so I added
sc.addJar(pathToJar) and ran.
I get the following exception:
org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 4
Can you give us the more detailed exception + stack trace in the log? It
should be in the driver log. If not, please take a look at the executor
logs, through the web ui to find the stack trace.
TD
On Tue, Mar 25, 2014 at 10:43 PM, gaganbm gagan.mis...@gmail.com wrote:
Hi Folks,
Is this
it seems to be an old problem :
http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3c7f6aa9e820f55d4a96946a87e086ef4a4bcdf...@eagh-erfpmbx41.erf.thomson.com%3E
https://groups.google.com/forum/#!topic/spark-users/Q66UOeA2u-I
Does anyone got the solution ?
On Wed, Mar 26, 2014 at
Hi Sung,
Are you using yarn-standalone mode? Have you specified the --addJars
option with your external jars?
-Sandy
On Wed, Mar 26, 2014 at 1:17 PM, Sung Hwan Chung
coded...@cs.stanford.eduwrote:
Hello, (this is Yarn related)
I'm able to load an external jar and use its classes within
context.objectFile[ReIdDataSetEntry](data) -not sure how this is compiled
in scala. But, if it uses some sort of ObjectInputStream, you need to be
careful - ObjectInputStream uses root classloader to load classes and does
not work with jars that are added to TCCC. Apache commons has
Hey Everyone,
This already went out to the dev list, but I wanted to put a pointer here
as well to a new feature we are pretty excited about for Spark 1.0.
http://databricks.com/blog/2014/03/26/Spark-SQL-manipulating-structured-data-using-Spark.html
Michael
No idea how feasible this is. Has anyone done it?
This is so, so COOL. YES. I'm excited about using this once I'm a bit more
comfortable with Spark.
Nice work, people!
On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote:
Hey Everyone,
This already went out to the dev list, but I wanted to put a pointer here
as
Fantastic! Although, I think they missed an obvious name choice: SparkQL
(pronounced sparkle) :)
Skyler
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, March 26, 2014 3:58 PM
To: user@spark.apache.org
Subject: Announcing Spark SQL
Hey Everyone,
This already went out
To clarify: I don't need the actual paths, just the distances.
On Wed, Mar 26, 2014 at 3:04 PM, Ryan Compton compton.r...@gmail.com wrote:
No idea how feasible this is. Has anyone done it?
For the record, I tried this, and it worked.
On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat walrusthe...@gmail.comwrote:
Oh so if I had something more reasonable, like RDD's full of tuples of
say, (Int,Set,Set), I could expect a more uniform distribution?
Thanks
On Mon, Mar 24, 2014
Congrats Michael co for putting this together — this is probably the neatest
piece of technology added to Spark in the past few months, and it will greatly
change what users can do as more data sources are added.
Matei
On Mar 26, 2014, at 3:22 PM, Ognen Duzlevski og...@plainvanillagames.com
+1 Michael, Reynold et al. This is key to some of the things we're doing.
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen
On Wed, Mar 26, 2014 at 2:58 PM, Michael Armbrust mich...@databricks.comwrote:
Hey Everyone,
This already went out to the
Yeah, if you’re just worried about statistics, maybe you can do sampling (do
single-pair paths from 100 random nodes and you get an idea of what percentage
of nodes have what distribution of neighbors in a given distance).
Matei
On Mar 26, 2014, at 5:55 PM, Ryan Compton compton.r...@gmail.com
The web-ui shows 3 executors, the driver and one spark task on each worker.
I do see that there were 8 successful tasks and the ninth failed like so...
java.lang.Exception (java.lang.Exception: Could not compute split, block
input-0-1395860790200 not found)
Very nice.
Any plans to make the SQL typesafe using something like Slick (
http://slick.typesafe.com/)
Thanks !
On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote:
Hey Everyone,
This already went out to the dev list, but I wanted to put a pointer here
as well to
Hi,
What's the splittable compression format that works with Spark right now ?
We are looking into bzip2 / lzo / gzip...gzip is not splittable so not a
good optionWithin bzip2/lzo I am confused.
Thanks.
Deb
Any plans to make the SQL typesafe using something like Slick (
http://slick.typesafe.com/)
I would really like to do something like that, and maybe we will in a
couple of months. However, in the near term, I think the top priorities are
going to be performance and stability.
Michael
Hi all,
I've got something which I think should be straightforward but it's not so
I'm not getting it.
I have an 8 node spark 0.9.0 cluster also running HDFS. Workers have 16g of
memory using 8 cores.
In HDFS I have a CSV file of 110M lines of 9 columns (e.g., [key,a,b,c...]).
I have another
48 matches
Mail list logo