StorageLevel.OFF_HEAP requires to run Tachyon:
http://spark.apache.org/docs/latest/programming-guide.html
If you don't know if you have tachyon or not, you probably don't :)
http://tachyon-project.org/
For local testing, you can use other persist() solutions without running
Tachyon.
Best,
Haoyu
I believe this is something to do with how Kafka High Level API manages
consumers within a Consumer group and how it re-balance during failure. You
can find some mention in this Kafka wiki.
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design
Due to various issues in Kafka
Thanks Sean.
adding user@spark.apache.org again.
On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen wrote:
> On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta
> wrote:
> > Is the MapReduce API "simpler" or the implementation? Almost, every Spark
> > presentation has a slide that shows 100+ lines of Hado
Concerning your second question, I believe you try to set number of
partitions with something like this:
rdd = sc.textFile(..., 8)
but things like `textFile()` don't actually take fixed number of
partitions. Instead, they expect *minimal* number of partitions. Since in
your file you have 21 b
perhaps the closure ends up including the "main" object which is not
defined as serializable...try making it a "case object" or "object main
extends Serializable".
On Sat, Nov 22, 2014 at 4:16 PM, lordjoe wrote:
> I posted several examples in java at http://lordjoesoftware.blogspot.com/
>
> Gene
Adding to already interesting answers:
- "Is there any case where MR is better than Spark? I don't know what cases
I should be used Spark by MR. When is MR faster than Spark?"
- Many. MR would be better (am not saying faster ;o)) for
- Very large dataset,
- Multistage ma
I posted several examples in java at http://lordjoesoftware.blogspot.com/
Generally code like this works and I show how to accumulate more complex
values.
// Make two accumulators using Statistics
final Accumulator totalLetters= ctx.accumulator(0L,
"ttl");
JavaRDD lines = ..
Some more details: Adding a println to the function reveals that it is indeed
called only once. Furthermore, running:
/rdd/.map(_.s.hashCode).min == /rdd/.map(_.s.hashCode).max // returns true
...reveals that all 1000 elements do indeed point to the same object,
and so the data structure ess
I could not iterate thru the set but changed the code to get what I was looking
for(Not elegant but gets me going)
package org.medicalsidefx.common.utils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import scala.collection.mutable.ArrayBuffer
/**
* C
Hi Sowen,
You're right, that example works, but look what example does not work for
me:
object Main {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("name")
val sc = new SparkContext(conf)
val accum = sc.accumulator(0)
for (i <- 1 to 10) {
va
First, the --conf error: What version of Spark? I don't think some of
these existed before 1.1 so that may be the issue. This is all on one
line I assume. Quoting is not an issue here.
The real issue is that auto.reset.offset is indeed a Kafka option.
It's not a system property; if it were, you co
That seems to work fine. Add to your example
def foo(i: Int, a: Accumulator[Int]) = a += i
and add an action at the end to get the expression to evaluate:
sc.parallelize(Array(1, 2, 3, 4)).map(x => foo(x,accum)).foreach(println)
and it works, and you have accum with value 10 at the end.
The si
Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000 parquet
Thanks Jeyregardssanjay
From: Jey Kottalam
To: Sanjay Subramanian
Cc: Arun Ahuja ; Andrew Ash ; user
Sent: Friday, November 21, 2014 10:07 PM
Subject: Extracting values from a Collecion
Hi Sanjay,
These are instances of the standard Scala collection type "Set", and its
document
Hello all
I'm very pleased to announce the launch of http://www.SparkBigData.com: The
Apache Spark Knowledge Base.
As your one-stop information resource dedicated to Apache Spark.
SparkBigData.com, provides free, easy and fast access to hundreds of Apache
Spark resources organized in several categ
Hi all,
I am using Spark to consume from Kafka. However, after the job has run for
several hours, I saw the following failure of an executor:
kafka.common.ConsumerRebalanceFailedException:
group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31
can't rebalance after 4 retries
Thanks.
After writing it as static inner class, that exception not coming. But
getting snappy related exception. I could see the corresponding dependency
is in the spark assembly jar. Still getting the exception. Any quick
suggestion on this?
Here is the stack trace.
java.lang.UnsatisfiedLinkErr
What makes you think that each executor is reading the whole file? If that
is the case then the count value returned to the driver will be actual X
NumOfExecutors. Is that the case when compared with actual lines in the
input file? If the count returned is same as actual then you probably don't
hav
One month later, the same problem. I think that someone (e.g. inventors of
Spark) should show us a big example of how to use accumulators. I can start
telling that we need to see an example of the following form:
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).map(x => foo(x,accum)
Not that I'm professional user of Amazon services, but I have a guess about
your performance issues. From [1], there are two different filesystems over
S3:
- native that behaves just like regular files (schema: s3n)
- block-based that looks more like HDFS (schema: s3)
Since you use "s3n" in you
You are declaring an anonymous inner class here. It has a reference to the
containing class even if you don't use it. If the closure cleaner can't
determine it isn't used, this reference will cause everything in the outer
class to serialize. Try rewriting this as a named static inner class .
On Nov
Thanks for your prompt response.
I'm not using any thing in my map function. please see the below code. For
sample purpose, I would like to using 'select * from
'.
This code worked for me in standalone mode. But when I integrated with my
web application, it is throwing the specified exception.
You could be referring/sending the StandardSessionFacade inside your map
function. You could bring the class StandardSessionFacade locally and
Serialize it to get it fixed quickly.
Thanks
Best Regards
On Sat, Nov 22, 2014 at 10:02 PM, vdiwakar.malladi <
vdiwakar.mall...@gmail.com> wrote:
> Hi
>
Thanks Akhil for your input.
I have already tried with 3 executors and it still results into the same
problem. So as Sean mentioned, the problem does not seem to be related to
that.
On Sat, Nov 22, 2014 at 11:00 AM, Sean Owen wrote:
> That doesn't seem to be the problem though. It processes bu
This message appears in normal operation. I do not think it refers to
anything in your code.
On Nov 21, 2014 11:57 PM, "YaoPau" wrote:
> When I submit a Spark Streaming job, I see these INFO logs printing
> frequently:
>
> 14/11/21 18:53:17 INFO DAGScheduler: waiting: Set(Stage 216)
> 14/11/21 18
That doesn't seem to be the problem though. It processes but then stops.
Presumably there are many executors.
On Nov 22, 2014 9:40 AM, "Akhil Das" wrote:
> For Spark streaming, you must always set *--executor-cores* to a value
> which is >= 2. Or else it will not do any processing.
>
> Thanks
> B
Hi all,
I am trying to persist a spark RDD in which the elements of each partition
all share access to a single, large object. However, this object seems get
stored in memory several times. Reducing my problem down to the toy case of
just a single partition with only 200 elements:
*val* /nElements
Hi
I'm trying to load the parquet file for querying purpose from my web
application. I could able to load it as JavaSchemaRDD. But at the time of
using map function on the JavaSchemaRDD, I'm getting the following
exception.
The class in which I'm using this code implements Serializable class. Co
MapReduce is simpler and narrower, which also means it is generally lighter
weight, with less to know and configure, and runs more predictably. If you
have a job that is truly just a few maps, with maybe one reduce, MR will
likely be more efficient. Until recently its shuffle has been more
develope
Thanks Alex! I'm actually working with views from HBase because I will
never edit the HBase table from Phoenix and I'd hate to accidentally drop
it. I'll have to work out how to create the view with the additional ID
column.
Regards,
Alaa Ali
On Fri, Nov 21, 2014 at 5:26 PM, Alex Kamil wrote:
>
Just to add some more stuff - there are various scenarios where traditional
Hadoop makes more sense than Spark. For example, if you have a long running
processing job in which you do not want to utilize too many resources of
the cluster. Another example could be that you want to run a distributed
e
Thanks for your answer Akhil,
I have already tried that and the query actually doesn't fail but it doesn't
return anything either as it should.
Using single quotes I think it reads it as a string and not as a timestamp.
I don't know how to solve this. Any other hint by any chance?
Thanks,
Ale
Spark can do Map Reduce and more, and faster.
One area where using MR would make sense is if you're using something (maybe
like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible
now...just pulled that name out of thin air!).
You *can* use Spark from Java, but you'd have a
Hello,
I'm a newbie with Spark but I've been working with Hadoop for a while.
I have two questions.
Is there any case where MR is better than Spark? I don't know what
cases I should be used Spark by MR. When is MR faster than Spark?
The other question is, I know Java, is it worth it to learn Sca
Err I meant #1 :)
- Nitay
Founder & CTO
On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe wrote:
> Anyone have any thoughts on this? Trying to understand especially #2 if
> it's a legit bug or something I'm doing wrong.
>
> - Nitay
> Founder & CTO
>
>
> On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joff
Anyone have any thoughts on this? Trying to understand especially #2 if
it's a legit bug or something I'm doing wrong.
- Nitay
Founder & CTO
On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe wrote:
> I have a simple S3 job to read a text file and do a line count.
> Specifically I'm doing *sc.textF
This thread might be helpful
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
On 11/20/14 4:11 AM, Mohammed Guller wrote:
Hi – I was curious if anyone is using the Spark SQL Thrift JDBC server
with Cassandra. It would be great be if you could share
You may try |EXPLIAN EXTENDED | to see the logical plan, analyzed
logical plan, optimized logical plan and physical plan. Also
|SchemaRDD.toDebugString| shows storage related debugging information.
On 11/21/14 4:11 AM, Gordon Benjamin wrote:
hey,
Can anyone tell me how to debug a sql executi
You're probably hitting this issue
https://issues.apache.org/jira/browse/SPARK-4532
Patrick made a fix for this https://github.com/apache/spark/pull/3398
On 11/22/14 10:39 AM, tridib wrote:
After taking today's build from master branch I started getting this error
when run spark-sql:
Class or
For Spark streaming, you must always set *--executor-cores* to a value
which is >= 2. Or else it will not do any processing.
Thanks
Best Regards
On Sat, Nov 22, 2014 at 8:39 AM, pankaj channe wrote:
> I have seen similar posts on this issue but could not find solution.
> Apologies if this has b
What about
sqlContext.sql("SELECT * FROM Logs as l where l.timestamp=*'2012-10-08
16:10:36.0'*").collect
You might need to quote the timestamp it looks like.
Thanks
Best Regards
On Sat, Nov 22, 2014 at 12:09 AM, whitebread wrote:
> Hi all,
>
> I put some log files into sql tables through Spar
What is your cluster setup? are you running a worker on the master node
also?
1. Spark usually assigns the task to the worker who has the data locally
available, If one worker has enough tasks, then i believe it will start
assigning to others as well. You could control it with the level of
paralle
If it is not picking it up from command line, you could try adding that
entry inside conf/spark-defaults.conf file and then try submitting the job.
(Of course, you might want to restart the cluster)
Isn't auto.offset.reset="largest" a kafka conf? You might want to set it
inside the kafka conf. (Ka
Yes, it is possible to have both spark and hdfs running on the same
cluster. We have a lot of clusters running without any issues. And yes, it
is possible to hook spark up with remote hdfs. You might feel a bit lag if
they are on different networks.
Thanks
Best Regards
On Fri, Nov 21, 2014 at 8:4
44 matches
Mail list logo