Yes, it is possible to have both spark and hdfs running on the same
cluster. We have a lot of clusters running without any issues. And yes, it
is possible to hook spark up with remote hdfs. You might feel a bit lag if
they are on different networks.
Thanks
Best Regards
On Fri, Nov 21, 2014 at
If it is not picking it up from command line, you could try adding that
entry inside conf/spark-defaults.conf file and then try submitting the job.
(Of course, you might want to restart the cluster)
Isn't auto.offset.reset=largest a kafka conf? You might want to set it
inside the kafka conf.
What is your cluster setup? are you running a worker on the master node
also?
1. Spark usually assigns the task to the worker who has the data locally
available, If one worker has enough tasks, then i believe it will start
assigning to others as well. You could control it with the level of
What about
sqlContext.sql(SELECT * FROM Logs as l where l.timestamp=*'2012-10-08
16:10:36.0'*).collect
You might need to quote the timestamp it looks like.
Thanks
Best Regards
On Sat, Nov 22, 2014 at 12:09 AM, whitebread ale.panebia...@me.com wrote:
Hi all,
I put some log files into sql
For Spark streaming, you must always set *--executor-cores* to a value
which is = 2. Or else it will not do any processing.
Thanks
Best Regards
On Sat, Nov 22, 2014 at 8:39 AM, pankaj channe pankajc...@gmail.com wrote:
I have seen similar posts on this issue but could not find solution.
You're probably hitting this issue
https://issues.apache.org/jira/browse/SPARK-4532
Patrick made a fix for this https://github.com/apache/spark/pull/3398
On 11/22/14 10:39 AM, tridib wrote:
After taking today's build from master branch I started getting this error
when run spark-sql:
Class
You may try |EXPLIAN EXTENDED sql| to see the logical plan, analyzed
logical plan, optimized logical plan and physical plan. Also
|SchemaRDD.toDebugString| shows storage related debugging information.
On 11/21/14 4:11 AM, Gordon Benjamin wrote:
hey,
Can anyone tell me how to debug a sql
This thread might be helpful
http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html
On 11/20/14 4:11 AM, Mohammed Guller wrote:
Hi – I was curious if anyone is using the Spark SQL Thrift JDBC server
with Cassandra. It would be great be if you could share
Anyone have any thoughts on this? Trying to understand especially #2 if
it's a legit bug or something I'm doing wrong.
- Nitay
Founder CTO
On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote:
I have a simple S3 job to read a text file and do a line count.
Specifically I'm
Err I meant #1 :)
- Nitay
Founder CTO
On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote:
Anyone have any thoughts on this? Trying to understand especially #2 if
it's a legit bug or something I'm doing wrong.
- Nitay
Founder CTO
On Thu, Nov 20, 2014 at 11:54 AM,
Hello,
I'm a newbie with Spark but I've been working with Hadoop for a while.
I have two questions.
Is there any case where MR is better than Spark? I don't know what
cases I should be used Spark by MR. When is MR faster than Spark?
The other question is, I know Java, is it worth it to learn
Spark can do Map Reduce and more, and faster.
One area where using MR would make sense is if you're using something (maybe
like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible
now...just pulled that name out of thin air!).
You *can* use Spark from Java, but you'd have a
Thanks for your answer Akhil,
I have already tried that and the query actually doesn't fail but it doesn't
return anything either as it should.
Using single quotes I think it reads it as a string and not as a timestamp.
I don't know how to solve this. Any other hint by any chance?
Thanks,
Just to add some more stuff - there are various scenarios where traditional
Hadoop makes more sense than Spark. For example, if you have a long running
processing job in which you do not want to utilize too many resources of
the cluster. Another example could be that you want to run a distributed
Thanks Alex! I'm actually working with views from HBase because I will
never edit the HBase table from Phoenix and I'd hate to accidentally drop
it. I'll have to work out how to create the view with the additional ID
column.
Regards,
Alaa Ali
On Fri, Nov 21, 2014 at 5:26 PM, Alex Kamil
MapReduce is simpler and narrower, which also means it is generally lighter
weight, with less to know and configure, and runs more predictably. If you
have a job that is truly just a few maps, with maybe one reduce, MR will
likely be more efficient. Until recently its shuffle has been more
Hi
I'm trying to load the parquet file for querying purpose from my web
application. I could able to load it as JavaSchemaRDD. But at the time of
using map function on the JavaSchemaRDD, I'm getting the following
exception.
The class in which I'm using this code implements Serializable class.
Hi all,
I am trying to persist a spark RDD in which the elements of each partition
all share access to a single, large object. However, this object seems get
stored in memory several times. Reducing my problem down to the toy case of
just a single partition with only 200 elements:
*val*
That doesn't seem to be the problem though. It processes but then stops.
Presumably there are many executors.
On Nov 22, 2014 9:40 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
For Spark streaming, you must always set *--executor-cores* to a value
which is = 2. Or else it will not do any
This message appears in normal operation. I do not think it refers to
anything in your code.
On Nov 21, 2014 11:57 PM, YaoPau jonrgr...@gmail.com wrote:
When I submit a Spark Streaming job, I see these INFO logs printing
frequently:
14/11/21 18:53:17 INFO DAGScheduler: waiting: Set(Stage 216)
Thanks Akhil for your input.
I have already tried with 3 executors and it still results into the same
problem. So as Sean mentioned, the problem does not seem to be related to
that.
On Sat, Nov 22, 2014 at 11:00 AM, Sean Owen so...@cloudera.com wrote:
That doesn't seem to be the problem
You could be referring/sending the StandardSessionFacade inside your map
function. You could bring the class StandardSessionFacade locally and
Serialize it to get it fixed quickly.
Thanks
Best Regards
On Sat, Nov 22, 2014 at 10:02 PM, vdiwakar.malladi
vdiwakar.mall...@gmail.com wrote:
Hi
Thanks for your prompt response.
I'm not using any thing in my map function. please see the below code. For
sample purpose, I would like to using 'select * from
'.
This code worked for me in standalone mode. But when I integrated with my
web application, it is throwing the specified exception.
You are declaring an anonymous inner class here. It has a reference to the
containing class even if you don't use it. If the closure cleaner can't
determine it isn't used, this reference will cause everything in the outer
class to serialize. Try rewriting this as a named static inner class .
On
Not that I'm professional user of Amazon services, but I have a guess about
your performance issues. From [1], there are two different filesystems over
S3:
- native that behaves just like regular files (schema: s3n)
- block-based that looks more like HDFS (schema: s3)
Since you use s3n in your
One month later, the same problem. I think that someone (e.g. inventors of
Spark) should show us a big example of how to use accumulators. I can start
telling that we need to see an example of the following form:
val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).map(x =
Thanks.
After writing it as static inner class, that exception not coming. But
getting snappy related exception. I could see the corresponding dependency
is in the spark assembly jar. Still getting the exception. Any quick
suggestion on this?
Here is the stack trace.
Hi all,
I am using Spark to consume from Kafka. However, after the job has run for
several hours, I saw the following failure of an executor:
kafka.common.ConsumerRebalanceFailedException:
group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31
can't rebalance after 4 retries
Hello all
I'm very pleased to announce the launch of http://www.SparkBigData.com: The
Apache Spark Knowledge Base.
As your one-stop information resource dedicated to Apache Spark.
SparkBigData.com, provides free, easy and fast access to hundreds of Apache
Spark resources organized in several
Thanks Jeyregardssanjay
From: Jey Kottalam j...@cs.berkeley.edu
To: Sanjay Subramanian sanjaysubraman...@yahoo.com
Cc: Arun Ahuja aahuj...@gmail.com; Andrew Ash and...@andrewash.com; user
user@spark.apache.org
Sent: Friday, November 21, 2014 10:07 PM
Subject: Extracting values from a
Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000
That seems to work fine. Add to your example
def foo(i: Int, a: Accumulator[Int]) = a += i
and add an action at the end to get the expression to evaluate:
sc.parallelize(Array(1, 2, 3, 4)).map(x = foo(x,accum)).foreach(println)
and it works, and you have accum with value 10 at the end.
The
First, the --conf error: What version of Spark? I don't think some of
these existed before 1.1 so that may be the issue. This is all on one
line I assume. Quoting is not an issue here.
The real issue is that auto.reset.offset is indeed a Kafka option.
It's not a system property; if it were, you
Hi Sowen,
You're right, that example works, but look what example does not work for
me:
object Main {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(name)
val sc = new SparkContext(conf)
val accum = sc.accumulator(0)
for (i - 1 to 10) {
val
I could not iterate thru the set but changed the code to get what I was looking
for(Not elegant but gets me going)
package org.medicalsidefx.common.utils
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import scala.collection.mutable.ArrayBuffer
/**
*
Some more details: Adding a println to the function reveals that it is indeed
called only once. Furthermore, running:
/rdd/.map(_.s.hashCode).min == /rdd/.map(_.s.hashCode).max // returns true
...reveals that all 1000 elements do indeed point to the same object,
and so the data structure
I posted several examples in java at http://lordjoesoftware.blogspot.com/
Generally code like this works and I show how to accumulate more complex
values.
// Make two accumulators using Statistics
final AccumulatorInteger totalLetters= ctx.accumulator(0L,
ttl);
Adding to already interesting answers:
- Is there any case where MR is better than Spark? I don't know what cases
I should be used Spark by MR. When is MR faster than Spark?
- Many. MR would be better (am not saying faster ;o)) for
- Very large dataset,
- Multistage
perhaps the closure ends up including the main object which is not
defined as serializable...try making it a case object or object main
extends Serializable.
On Sat, Nov 22, 2014 at 4:16 PM, lordjoe lordjoe2...@gmail.com wrote:
I posted several examples in java at
Concerning your second question, I believe you try to set number of
partitions with something like this:
rdd = sc.textFile(..., 8)
but things like `textFile()` don't actually take fixed number of
partitions. Instead, they expect *minimal* number of partitions. Since in
your file you have 21
Thanks Sean.
adding user@spark.apache.org again.
On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen so...@cloudera.com wrote:
On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta
soumya.sima...@gmail.com wrote:
Is the MapReduce API simpler or the implementation? Almost, every Spark
presentation has a
I believe this is something to do with how Kafka High Level API manages
consumers within a Consumer group and how it re-balance during failure. You
can find some mention in this Kafka wiki.
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design
Due to various issues in Kafka
StorageLevel.OFF_HEAP requires to run Tachyon:
http://spark.apache.org/docs/latest/programming-guide.html
If you don't know if you have tachyon or not, you probably don't :)
http://tachyon-project.org/
For local testing, you can use other persist() solutions without running
Tachyon.
Best,
43 matches
Mail list logo