Sorry, but the accumulator is still going to require you to walk through the
RDD to get an accurate count, right?
Its not being persisted?
On Jan 14, 2015, at 5:17 AM, Ganelin, Ilya ilya.gane...@capitalone.com wrote:
Alternative to doing a naive toArray is to declare an accumulator per
The reason is fairly simple actually - we don't want to commit to
maintaining the specific APIs exposed. If we expose OpenHashSet, we will
have to always keep that in Spark and not change the API.
On Tue, Jan 13, 2015 at 12:39 PM, Tae-Hyuk Ahn ahn@gmail.com wrote:
Thank, Josh and Reynold.
Yes, we are close to having more 2 billion users. In this case what is the
best way to handle this.
Thanks,
Nishanth
On Fri, Jan 9, 2015 at 9:50 PM, Xiangrui Meng men...@gmail.com wrote:
Do you have more than 2 billion users/products? If not, you can pair
each user/product id with an integer
conversion_action_id is a int. We also tried a string column predicate with
single quotes string value and hit the same error stack.
On Wed, Jan 14, 2015 at 7:11 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Just a guess but what is the type of conversion_aciton_id? I do queries
over an
Thanks a lot!
I just realize the spark is not a really in-memory version of mapreduce J
From: Akhil Das [mailto:ak...@sigmoidanalytics.com]
Sent: Tuesday, January 13, 2015 3:53 PM
To: Shuai Zheng
Cc: user@spark.apache.org
Subject: Re: Why always spilling to disk and how to improve it?
An assumption would be your master process is getting killed for some
reason, it could be because of OOM kills by the kernel. (I'm assuming you
are running your driver program on the master node itself.)
Thanks
Best Regards
On Wed, Jan 14, 2015 at 11:25 PM, TJ Klein tjkl...@gmail.com wrote:
Hi,
I am using the sbt tool to build and run the scala tests related to spark.
In my /src/test/scala directory, there are two test classes (TestA, TestB),
both of which use the class in Spark for creating SparkContext, something
like
trait LocalTestSparkContext extends BeforeAndAfterAll { self:
Thanks Josh and Marcelo! It now works!
BTW, just wondering, is there any perf difference between running spark in
standalone mode and under yarn? The only goal that I created this cluster is to
run spark jobs. So I can set up spark in standalone mode if it runs slow in
yarn.
best.
From:
All,
I'm still facing this issue. Any thoughts on how I can fix this?
NR
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ExceptionInInitializerError-Unable-to-load-YARN-support-tp20775p21143.html
Sent from the Apache Spark User List mailing list
A cdh5.3.0 with spark is set up. just wondering how to run a python
application on it.
I used 'spark-submit --master yarn-cluster ./loadsessions.py' but got the
error,
Error: Cluster deploy mode is currently not supported for python
applications.
Run with --help for usage help or --verbose for
I quickly went through the code,
In ExecutorBackend, we build the actor system with
// Create SparkEnv using properties we fetched from the driver.
val driverConf = new SparkConf().setAll(props)
val env = SparkEnv.createExecutorEnv(
driverConf, executorId, hostname, port, cores, isLocal =
Hi Nishanth,
Just found out where you work:) We had some discussion in
https://issues.apache.org/jira/browse/SPARK-2465 . Having long IDs
will increase the communication cost, which may not worth the benefit.
Not many companies have more than 1 billion users. If they do, maybe
they can mirror the
sorry for the mistake,
I found that those akka related messages are from Spark Akka-related component
(ActorLogReceive) , instead of Akka itself, though it has been enough for the
debugging purpose (in my case)
the question in this thread is still in open status….
Best,
--
Nan Zhu
Sean,
thanks for your message.
On Wed, Jan 14, 2015 at 8:36 PM, Sean Owen so...@cloudera.com wrote:
On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer t...@preferred.jp wrote:
OK, it seems like even on a local machine (with no network overhead), the
groupByKey version is about 5 times slower
I assume you have looked at:
http://doc.akka.io/docs/akka/2.0/scala/logging.html
http://doc.akka.io/docs/akka/current/additional/faq.html (Debugging, last
question)
Cheers
On Wed, Jan 14, 2015 at 2:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, all
though
That's not quite what that error means. Spark is not out of memory. It
means that Spark is using more memory than it asked YARN for. That in
turn is because the default amount of cushion established between the
YARN allowed container size and the JVM heap size is too small. See
for others who have the same question:
you can simply set logging level in log4j.properties to DEBUG to achieve this
Best,
--
Nan Zhu
http://codingcat.me
On Wednesday, January 14, 2015 at 6:28 PM, Nan Zhu wrote:
I quickly went through the code,
In ExecutorBackend, we build the
I'm working with RDD[Map[String,Any]] objects all over my codebase. These
objects were all originally parsed from JSON. The processing I do on RDDs
consists of parsing json - grouping/transforming dataset into a feasible
report - outputting data to a file.
I've been wanting to infer the schemas
Got help from Marcelo and Josh. Now it is running smoothly. In case you need
this info - Just use yarn-client instead of yarn-cluster
Thanks folks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-run-python-app-in-yarn-tp21141p21142.html
Sent from
I solved the issue. In case anyone else is looking for an answer, by
default, scalatest executes all the tests in parallel. To disable this,
just put the following line in your build.sbt
parallelExecution in Test := false
Thanks
On Wed, Jan 14, 2015 at 2:30 PM, Jianguo Li
There's an open PR for supporting yarn-cluster mode in PySpark:
https://github.com/apache/spark/pull/3976 (currently blocked on reviewer
attention / time)
On Wed, Jan 14, 2015 at 3:16 PM, Marcelo Vanzin van...@cloudera.com wrote:
As the error message says...
On Wed, Jan 14, 2015 at 3:14 PM,
Hi, Ted,
Thanks
I know how to set in Akka’s context, my question is just how to pass this
aka.loglevel=DEBUG to Spark’s actor system
Best,
--
Nan Zhu
http://codingcat.me
On Wednesday, January 14, 2015 at 6:09 PM, Ted Yu wrote:
I assume you have looked at:
The log showed it failed in parsing, so the typo stuff shouldn’t be the root
cause. BUT I couldn’t reproduce that with master branch.
I did the test as follow:
sbt/sbt –Phadoop-2.3.0 –Phadoop-2.3 –Phive –Phive-0.13.1 hive/console
scala sql(“SELECT user_id FROM actions where
Hi, all
though https://issues.apache.org/jira/browse/SPARK-609 was closed, I’m still
unclear about how to enable debug level log output in Spark’s actor system
Anyone can give the suggestion?
BTW, I think we need to document it on somewhere, as the user who writes
actor-based receiver of
I got it working. It was a bug in Spark 1.1. After upgrading to 1.2 it
worked.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-running-spark-on-cluster-tp21138p21140.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am working with a RowMatrix and I noticed in the multiply() method that the
local matrix with which it is being multiplied is being distributed to all of
the rows of the RowMatrix. If this is the case, then is it impossible to
multiply a row matrix within a map operation? Because this would
As the error message says...
On Wed, Jan 14, 2015 at 3:14 PM, freedafeng freedaf...@yahoo.com wrote:
Error: Cluster deploy mode is currently not supported for python
applications.
Use yarn-client instead of yarn-cluster for pyspark apps.
--
Marcelo
Yes, you can only use RowMatrix.multiply() within the driver. We are
working on distributed block matrices and linear algebra operations on
top of it, which would fit your use cases well. It may take several
PRs to finish. You can find the first one here:
https://github.com/apache/spark/pull/3200
yeah, that makes sense. Pala, are you on a prebuild version of Spark -- I
just tried the CDH4 prebuilt...Here is what I get for the = token:
[image: Inline image 1]
The literal type shows as 290, not 291, and 290 is numeric. According to
this
Take a look at the implementation linked from here
https://issues.apache.org/jira/browse/SPARK-4964
see if that would meet your needs
On Wed, Jan 14, 2015 at 9:58 PM, mykidong mykid...@gmail.com wrote:
Hi,
My Spark Streaming Job is doing like kafka etl to HDFS.
For instance, every 10 min.
Look at the method pipe
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
On Wed, Jan 14, 2015 at 11:16 PM, umanga bistauma...@gmail.com wrote:
This is question i originally asked in Quora: http://qr.ae/6qjoI
We have some code written in C++ and Python that
This is question i originally asked in Quora: http://qr.ae/6qjoI
http://qr.ae/6qjoI
We have some code written in C++ and Python that does data enrichment to our
data streams. If i use Storm, i could use those code with some small
modifications using ShellBolt and IRichBolt. Since the
I think there're two solutions:
1. Enable write ahead log in Spark Streaming if you're using Spark 1.2.
2. Using third-party Kafka consumer
(https://github.com/dibbhatt/kafka-spark-consumer).
Thanks
Saisai
-Original Message-
From: mykidong [mailto:mykid...@gmail.com]
Sent: Thursday,
Hi all,
I'd like to leverage some of the fast Spark collection implementations in
my own code.
Particularity for doing things like distinct counts in a mapPartitions
loop.
Are there any plans to make the org.apache.spark.util.collection
implementations public? Is there any other library out
This is question i originally asked in Quora: http://qr.ae/6qjoI
We have some code written in C++ and Python that does data enrichment to our
data streams. If i use Storm, i could use those code with some small
modifications using ShellBolt and IRichBolt. Since the functionalities is
all about
Hi,
On Thu, Jan 15, 2015 at 12:23 AM, Ted Yu yuzhih...@gmail.com wrote:
On Wed, Jan 14, 2015 at 6:58 AM, Jianguo Li flyingfromch...@gmail.com
wrote:
I am using Spark-1.1.1. When I used sbt test, I ran into the following
exceptions. Any idea how to solve it? Thanks! I think somebody posted
Hi,
My Spark Streaming Job is doing like kafka etl to HDFS.
For instance, every 10 min. my streaming job is retrieving messages from
kafka, and save them as avro files onto hdfs.
My question is, if worker fails to write avro to hdfs, sometimes, I want to
replay consuming messages from the last
Hello,
From accumulator documentation, it says that if the accumulator is named,
it will be displayed in the WebUI. However, I cannot find it anywhere.
Do I need to specify anything in the spark ui config?
Thanks.
Justin
Thanks Sean.
I guess Cloudera Manager has parameters executor_total_max_heapsize
and worker_max_heapsize
which point to the parameters you mentioned above.
How much should that cushon between the jvm heap size and yarn memory limit
be?
I tried setting jvm memory to 20g and yarn to 24g, but it
Just a guess but what is the type of conversion_aciton_id? I do queries
over an epoch all the time with no issues(where epoch's type is bigint).
You can see the source here
https://github.com/apache/spark/blob/v1.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
--
not sure what
I am using Spark-1.1.1. When I used sbt test, I ran into the
following exceptions. Any idea how to solve it? Thanks! I think
somebody posted this question before, but no one seemed to have
answered it. Could it be the version of io.netty I put in my
build.sbt? I included an dependency
I am using Spark-1.1.1. When I used sbt test, I ran into the
following exceptions. Any idea how to solve it? Thanks! I think
somebody posted this question before, but no one seemed to have
answered it. Could it be the version of io.netty I put in my
build.sbt? I included an dependency
If the wildcard path you have doesn't work you should probably open a bug
-- I had a similar problem with Parquet and it was a bug which recently got
closed. Not sure if sqlContext.avroFile shares a codepath with
.parquetFile...you
can try running with bits that have the fix for .parquetFile or
From pom.xml (master branch):
dependency
groupIdio.netty/groupId
artifactIdnetty-all/artifactId
version4.0.23.Final/version
/dependency
Please check the version of netty Spark 1.1.1 depends on.
Cheers
On Wed, Jan 14, 2015 at 6:58 AM, Jianguo Li
Found it. Thanks Patrick.
Justin
On Wed, Jan 14, 2015 at 10:38 PM, Patrick Wendell pwend...@gmail.com
wrote:
It should appear in the page for any stage in which accumulators are
updated.
On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io
wrote:
Hello,
From
Hi,
sorry, I don't like questions about serializability myself, but still...
Can anyone give me a hint why
for (i - 0 to (maxId - 1)) { ... }
throws a NotSerializableException in the loop body while
var i = 0
while (i maxId) {
// same code as in the for loop
i += 1
}
works
Hi Xiangrui,
Thanks! for the reply, I will explore the suggested solutions.
-Nishanth
Hi Nishanth,
Just found out where you work:) We had some discussion in
https://issues.apache.org/jira/browse/SPARK-2465 . Having long IDs
will increase the communication cost, which may not worth the benefit.
It should appear in the page for any stage in which accumulators are updated.
On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote:
Hello,
From accumulator documentation, it says that if the accumulator is named, it
will be displayed in the WebUI. However, I cannot find
Just noticed an error in my wording.
Should be I'm assuming it's not immediately aggregating on the driver
each time I call the += on the Accumulator.
On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet cjno...@gmail.com wrote:
What are the limitations of using Accumulators to get a union of a bunch
What are the limitations of using Accumulators to get a union of a bunch of
small sets?
Let's say I have an RDD[Map{String,Any} and i want to do:
rdd.map(accumulator += Set(_.get(entityType).get))
What implication does this have on performance? I'm assuming it's not
immediately aggregating
You can view the logs for the particular containers on the YARN UI if you go
to the page for a specific node, and then from the Tools menu on the left,
select Local Logs. There should be a userlogs directory which will contain
the specific application ids for each job that you run. Inside the
Should I be able to pass multiple paths separated by commas? I haven't
tried but didn't think it'd work. I'd expected a function that accepted a
list of strings.
On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
If the wildcard path you have doesn't work you should
I can¹t speak to Mesos solutions, but for YARN you can define queues in
which to run your jobs, and you can customize the amount of resources the
queue consumes. When deploying your Spark job, you can specify the queue
queue_name option to schedule the job to a particular queue. Here are
some
The problem is that it gives an error message saying something to the
effect that:
URI is not hierarchical
This is consistent with your explanation.
Thanks,
arun
On Wed, Jan 14, 2015 at 1:14 AM, Sean Owen so...@cloudera.com wrote:
My hunch is that it is because the URI of a resource in a
Hi
I have a question regarding design trade offs and best practices. I'm working
on a real time analytics system. For simplicity, I have data with timestamps
(the key) and counts (the value). I use DStreams for the real time aspect.
Tuples w the same timestamp can be across various RDDs and I
Hi all,
A small number of the files being moved into my landing directory are not
being seen by my fileStream reciever. After looking at the code it seems
that, in the case of long batches ( 1minute), if files are created before
a batch finishes, but only become visible after that batch finished
Hi,
I am running PySpark on a cluster. Generally it runs. However, frequently I
get the warning message (and consequently, the task not being executed):
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
Can I save RDD to the local file system and then read it back on a spark
cluster with multiple nodes?
rdd.saveAsObjectFile(file:///home/data/rdd1file:///\\home\data\rdd1)
val rdd2 = sc.objectFile(file:///home/data/rdd1file:///\\home\data\rdd1)
This will works if the cluster has only one node.
We have dockerized Spark Master and worker(s) separately and are using it
in
our dev environment.
Is this setup available on github or dockerhub?
On Tue, Dec 9, 2014 at 3:50 PM, Venkat Subramanian vsubr...@gmail.com
wrote:
We have dockerized Spark Master and worker(s) separately and are
I'm new to Spark. When I use the Movie Lens dataset 100k
(http://grouplens.org/datasets/movielens/), Spark crashes when I run the
following code. The first call to movieData.first() gives the correct result.
Depending on which machine I use, it crashed the second or third time. Does
anybody
Hello,
I'm learning to use Spark with MongoDB, but I've encountered a problem that
I think is related to the way I use Spark., because it doesn't make any
sense to me.
My concept test is that I want to filter a collection containing about 800K
documents by a certain field.
My code is very
Could we run spark streaming and reducebywindow with same window length and
slide interval?
I used the spark1.1
On Wed, Jan 14, 2015 at 2:24 PM, Aaron Davidson ilike...@gmail.com wrote:
What version are you running? I think spark.shuffle.use.netty was a
valid option only in Spark 1.1, where the Netty stuff was strictly
experimental. Spark 1.2 contains an officially supported and
My hunch is that it is because the URI of a resource in a JAR file will
necessarily be specific to where the JAR is on the local filesystem and
that is not portable or the right way to read a resource. But you didn't
specify the problem here.
On Jan 14, 2015 5:15 AM, Arun Lists
Why would you want to ? That would be equivalent to setting your batch interval
to the same duration you’re suggesting to specify for both window length and
slide interval, IIUC.
Here’s a nice explanation of windowing by TD :
https://groups.google.com/forum/#!topic/spark-users/GQoxJHAAtX4
Hi Akhil,
[cid:image001.png@01D03008.40E75B40]
Config:
scala-spark-jar-location=””
spark-master-location=spark://172.22.193.138:7007
Error log msg:
14:46:52.443 [sparkDriver-akka.actor.default-dispatcher-2] DEBUG
o.a.s.s.c.SparkDeploySchedulerBackend - [actor] received message ReviveOffers
I would like to counting value in non overlap window, so that I think, I
can do it with same value of window length and slide interval, (note that,
this value is multiple of batch interval). However, nothing is calculated.
Thanks!
On Wed Jan 14 2015 at 4:16:27 PM francois.garil...@typesafe.com
Yeah it is actually serializing elements after chunking them into arrays of
10 elements each. It's not actually a key-value pair in the SequenceFile
for each element. That is how objectFile() reads it and flatMaps it, and
the docs say that the intent is that this is an opaque, not-guaranteed
Hi,
I have a program that loads a single avro file using spark SQL, queries it,
transforms it and then outputs the data. The file is loaded with:
val records = sqlContext.avroFile(filePath)
val data = records.registerTempTable(data)
...
Now I want to run it over tens of thousands of Avro files
You're just describing the normal operation of Spark Streaming them.
Windowing is for when you want overlapping intervals. Here you simply
do not window, and ask for intervals of whatever time you want. You
get non-overlapping RDDs.
On Wed, Jan 14, 2015 at 9:26 AM, Hoai-Thu Vuong
Hi Akhil
Thanks for the response
Our use case is Object detection in multiple videos. It’s kind of searching
an image if present in the video by matching the image with all the frames of
the video. I am able to do it in normal java code using OpenCV lib now but I
don’t think it is scalable to
Hi all, i sadly found on YARN mode i cannot view executor logs on YARN web
UI nor on SPARK history web UI. On YARN web UI i can only view AppMaster
logs and on SPARK history web UI i can only view Application metric
information. If i want to see whether a executor is being full GC i can
only use
Hi all,
I'm able to submit spark jobs through spark-jobserver. But this allows to
use spark only in yarn-client mode. I want to use spark also in
yarn-cluster mode but jobserver does not allow it, like says in the README
file https://github.com/spark-jobserver/spark-jobserver.
Could you tell
Here's an example for detecting the face
https://github.com/openreserach/bin2seq/blob/master/src/test/java/com/openresearchinc/hadoop/test/SparkTest.java#L190
you might find it useful.
Thanks
Best Regards
On Wed, Jan 14, 2015 at 3:06 PM, jishnu.prat...@wipro.com wrote:
Hi Akhil
Thanks for
thanks Jorn, sorry for the special character your name needs, i dont know
how to use it. I was thinking the same. Do you know somebody that tries to
use this approach?
Alonso Isidoro Roman.
Mis citas preferidas (de hoy) :
Si depurar es el proceso de quitar los errores de software, entonces
I want to know the internal traversal of Graph by GraphX. Is it vertex and
edges based traversal or sequential traversal of RDDS? For example given a
vertex of graph, i want to fetch only of its neighbors Not the neighbors of
all the vertices ? How GraphX will traverse the graph in this case.
Oh I see, thank you all guys.
On Wed Jan 14 2015 at 4:35:16 PM Sean Owen so...@cloudera.com wrote:
You're just describing the normal operation of Spark Streaming them.
Windowing is for when you want overlapping intervals. Here you simply
do not window, and ask for intervals of whatever time
Hi,
running spark 1.1.0 in yarn-client mode (cdh 5.2.1) on XEN based cloud and
randomly getting my executors failing on errors like bellow. I suspect it is
some cloud networking issue (XEN driver bug?) but wondering if there is any
spark/yarn workaround that I could use to mitigate?
Basically, you have to think about how to split the data (for pictures this
can be for instance 8x8 matrices) and use spark to distribute it to
different workers which themselves call opencv with the data. Afterwards
you need to combine all results again. It really depends on your image /
video
On Wed, Jan 14, 2015 at 4:53 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Now I don't know (yet) if all of the functions I want to compute can be
expressed in this way and I was wondering about *how much* more expensive we
are talking about.
OK, it seems like even on a local machine (with no
I was thinking about if it could be possible to use apache spark, and
opencv in order to recognize stellar objects from the data set provided by
Nasa,Esa, and the others space agencies. I am asking about advice or
feelings about it, nor example codes or something else, but if there were
any,
Interesting, sounds plausible. Another way to avoid the problem has been to
cache intermediate output for large jobs (i.e. split large jobs into
smaller and then union together) Unfortunately that this type of tweaking
should be necessary though, hopefully better in 1.2.1.
On Tue, Jan 13, 2015 at
82 matches
Mail list logo