You have an issue with your cluster setup. Can you paste your
conf/spark-env.sh and the conf/slaves files here?
The reason why your job is running fine is because you set the master
inside the job as local[*] which runs in local mode (not in standalone
cluster mode).
Thanks
Best Regards
On
What is called Bolt in Storm is essentially a combination of
[Transformation/Action and DStream RDD] in Spark – so to achieve a higher
parallelism for specific Transformation/Action on specific Dstream RDD simply
repartition it to the required number of partitions which directly relates to
the
Just wanted to check if somebody has seen similar behaviour or knows what
we might be doing wrong. We have a relatively complex spark application
which processes half a terabyte of data at various stages. We have profiled
it in several ways and everything seems to point to one place where 90% of
Hi,
I agree with Evo, Spark works at a different abstraction level than Storm,
and there is not a direct translation from Storm topologies to Spark
Streaming jobs. I think something remotely close is the notion of lineage
of DStreams or RDDs, which is similar to a logical plan of an engine like
Hi, I just saw this question. I posted my solution to this stack overflow
question.
https://stackoverflow.com/questions/29796928/whats-the-most-efficient-way-to-filter-a-dataframe
Scala reflection can take a classloader when creating a mirror (
universe.runtimeMirror(loader)). I can have a look,
Hi,
You can use the method repartition from DStream (for the Scala API) or
JavaDStream (for the Java API)
defrepartition(numPartitions: Int): DStream
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/streaming/dstream/DStream.html
[T]
Return a new DStream with an increased or
Hi all,
After do some tests,finally I solve it.I wrote here for other people who met
this question. here's a example of data format error I faced
0 0:0 1:0 2:1
1 1:1 3:2
the data for 0:0 and 1:0/1:1 is the reason for
ArrayIndexOutOfBoundsException.If someone who faced the same question just
Has this issue re-appeared?
I posted this on SO before I knew about this list...
http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
http://stackoverflow.com/questions/30057702/sparkr-filterrdd-and-flatmap-not-working
Also I don't have access to the issues on
Hi all,
After I read the example code using LDA in Spark, I found the input text
in the code is a matrix. the format of the text is as follows:
1 2 6 0 2 3 1 1 0 0 3
1 3 0 1 3 0 0 2 0 0 1
1 4 1 0 0 4 9 0 1 2 0
2 1 0 3 0 0 5 0 2 3 9
3 1 1 9 3 0 2 0 0 1 3
4 2 0 3 4 5 1 1 1 4 0
2 1 0 3 0 0 5 0 2 2
Thanks alot Juan,
That was a great post, One more thing if u can .Any there any demo/blog
telling how to configure or create a topology of different types .. i
mean how we can decide the pipelining model in spark as done in storm for
RDD1 = RDD.filter()
RDD2 = RDD.filter()
From: Bill Q [mailto:bill.q@gmail.com]
Sent: Tuesday, May 5, 2015 10:42 PM
To: user@spark.apache.org
Subject: Map one RDD into two RDD
Hi all,
I have a large RDD that I map a function to it. Based on the nature of each
record in the input
Hi,
If I have an RDD[MyClass] and I want to partition it by the hash code of
MyClass for performance reasons, is there any way to do this without
converting it into a PairRDD RDD[(K,V)] and calling partitionBy???
Mapping it to a tuple2 seems like a waste of space/computation.
It looks like the
But main problem is how to increase the level of parallelism for any
particular bolt logic .
suppose i want this type of topology .
https://storm.apache.org/documentation/images/topology.png
How we can manage it .
On Wed, May 6, 2015 at 1:36 PM, ayan guha guha.a...@gmail.com wrote:
Every
How does your MyClqss looks like? I was experimenting with Row class in
python and apparently partitionby automatically takes first column as key.
However, I am not sure how you can access a part of an object without
deserializing it (either explicitly or Spark doing it for you)
On Wed, May
I have setup an AWS EMR based cluster, where in I am being able to run my
spark queries quite ok.
The next part of my work is to run the queries coming in from a webclient
and show the results at it.
The Spark queries as i Know I can only run from my EMR, and they don't
return instantly with any
My code:
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD)
.
1. Fun01, Fun03 both executed as expected, which prove messages is not
null or empty.
2. On Spark application UI, I found Fun02's output stage in Spark stages,
which prove executed.
3. The first line of
update status after i did some tests. I modified some other parameters, found 2
parameters maybe relative.spark_worker_instance and spark.sql.shuffle.partitions
before Today I used default setting of spark_worker_instance and
spark.sql.shuffle.partitions whose value is 1 and 200.At that time ,
I don't see spark-streaming dependency at com.datastax.spark
http://mvnrepository.com/artifact/com.datastax.spark, but it does has a
kafka-streaming dependency though.
Thanks
Best Regards
On Tue, May 5, 2015 at 12:42 AM, Eric Ho eric...@intel.com wrote:
Can I specify this in my build file ?
Every transformation on a dstream will create another dstream. You may want
to take a look at foreachrdd? Also, kindly share your code so people can
help better
On 6 May 2015 17:54, anshu shukla anshushuk...@gmail.com wrote:
Please help guys, Even After going through all the examples given i
Also Kafka has a Hadoop consumer API for doing such things, please refer to
http://kafka.apache.org/081/documentation.html#kafkahadoopconsumerapi
2015-05-06 12:22 GMT+08:00 MrAsanjar . afsan...@gmail.com:
why not try https://github.com/linkedin/camus - camus is kafka to HDFS
pipeline
On
Did you set `--driver-memory` with spark-submit? -Xiangrui
On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni vmuttin...@ebay.com wrote:
Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760).
The spark (1.3.1) job is allocated 120 executors with 6GB each and the
driver also
Here's a complete example
https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html
Thanks
Best Regards
On Mon, May 4, 2015 at 12:57 PM, Yasemin Kaya godo...@gmail.com wrote:
Hi!
I am new at Spark and I want to begin Spark with simple wordCount example
in Java. But I want to give
Hi Imran,
I had tried setting a really huge kryo buffer size (GB), but it didn’t make
any difference.
In my data sets, objects are no more than 1KB each, and don’t form a graph,
so I don’t think the buffer size should need to be larger than a few MB,
except perhaps for reasons of efficiency?
Please help guys, Even After going through all the examples given i have
not understood how to pass the D-streams from one bolt/logic to other
(without writing it on HDFS etc.) just like emit function in storm .
Suppose i have topology with 3 bolts(say)
*BOLT1(parse the tweets nd emit tweet
Because using spark streaming looks like a lot simpler. Whats the
difference between Camus and Kafka Streaming for this case? Why Camus excel?
Rendy
On Wed, May 6, 2015 at 2:15 PM, Saisai Shao sai.sai.s...@gmail.com wrote:
Also Kafka has a Hadoop consumer API for doing such things, please
The “abstraction level” of Storm or shall we call it Architecture, is
effectively Pipelines of Nodes/Agents – Pipelines is one of the standard
Parallel Programming Patterns which you can use on multicore CPUs as well as
Distributed Systems – the chaps from Storm simply implemented it as a
Hi,
Can you please share your compression etc settings, which you are using.
Thanks,
Twinkle
On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I'm facing this error in Spark 1.3.1
https://issues.apache.org/jira/browse/SPARK-4105
Anyone knows what's the
Many thanks all, your responses have been very helpful. Cheers
On Wed, May 6, 2015 at 2:14 PM, ayan guha guha.a...@gmail.com wrote:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
On Wed, May 6, 2015 at 10:09 PM, James King
Hi,
I am running my graphx application on Spark, but it failed since there is an
error on one executor node(on which available hdfs space is small) that “no
space left on device”.
I can understand why it happened, because my vertex(-attribute) rdd was
becoming bigger and bigger during
Hi,
We use Spark to build graphs of events after querying cassandra. We use
mapPartition for both aggregating events and building two graphs per
partition. Graphs are returned as Tuple2 as follows :
val nodes = events.mapPartitions(part = {
var nodeLeft : Node = null
var
In the O'reilly book Learning Spark Chapter 10 section 24/7 Operation
It talks about 'Receiver Fault Tolerance'
I'm unsure of what a Receiver is here, from reading it sounds like when you
submit an application to the cluster in cluster mode i.e. *--deploy-mode
cluster *the driver program will
Incorrect. The receiver runs in an executor just like a any other tasks. In
the cluster mode, the driver runs in a worker, however it launches
executors in OTHER workers in the cluster. Its those executors running in
other workers that run tasks, and also the receivers.
On Wed, May 6, 2015 at
This is about Kafka Receiver IF you are using Spark Streaming
Ps: that book is now behind the curve in a quite a few areas since the release
of 1.3.1 – read the documentation and forums
From: James King [mailto:jakwebin...@gmail.com]
Sent: Wednesday, May 6, 2015 1:09 PM
To: user
I think for executor distribution, normally On YARN mode, RM tries its best
to evenly distribute container if don't explicitly specify the preferred
host. For standalone mode, one node only has one executor normally, so
executor distribution is not a big problem normally.
The problem of data skew
the pseudo code :
object myApp {
var myStaticRDD: RDD[Int]
def main() {
... //init streaming context, and get two DStream (streamA and streamB)
from two hdfs path
//complex transformation using the two DStream
val new_stream = streamA.transformWith(StreamB, (a, b, t) = {
Thanks, Shao. :-)
I am wondering if the spark will rebalance the storage overhead in
runtime…since still there is some available space on other nodes.
Best,
Yifan LI
On 06 May 2015, at 14:57, Saisai Shao sai.sai.s...@gmail.com wrote:
I think you could configure multiple disks through
Imran, Gerard,
Indeed your suggestions were correct and it helped me. Thank you for your
replies.
--
Emre
On Tue, May 5, 2015 at 4:24 PM, Imran Rashid iras...@cloudera.com wrote:
Gerard is totally correct -- to expand a little more, I think what you
want to do is a
We use Spark to build graphs of events after querying cassandra. We use
mapPartition for both aggregating events and building two graphs per
partition. Graphs are returned as Tuple2 as follows :
val nodes = events.mapPartitions(part = {
var nodeLeft : Node = null
var
oh yeah, I think I remember we discussed this a while back ... sorry I
forgot the details. If you know you don't have a graph, did you try
setting spark.kryo.referenceTracking to false? I'm also confused on how
you could hit this with a few million objects. Are you serializing them
one at a
I have to count RDD's in a spark streaming app. When data goes large, count()
becomes expensive. Did anybody have experience using countApprox()? How
accurate/reliable is it?
The documentation is pretty modest. Suppose the timeout parameter is in
milliseconds. Can I retrieve the count value by
I think you could configure multiple disks through spark.local.dir, default
is /tmp. Anyway if your intermediate data is larger than available disk
space, still will meet this issue.
spark.local.dir/tmpDirectory to use for scratch space in Spark, including
map output files and RDDs that get
Yes, you are right. For now I have to say the workload/executor is distributed
evenly…so, like you said, it is difficult to improve the situation.
However, have you any idea of how to make a *skew* data/executor distribution?
Best,
Yifan LI
On 06 May 2015, at 15:13, Saisai Shao
https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
On Wed, May 6, 2015 at 10:09 PM, James King jakwebin...@gmail.com wrote:
In the O'reilly book Learning Spark Chapter 10 section 24/7 Operation
It talks about 'Receiver Fault Tolerance'
I'm
I think it depends on your workload and executor distribution, if your
workload is evenly distributed without any big data skew, and executors are
evenly distributed on each nodes, the storage usage of each node is nearly
the same. Spark itself cannot rebalance the storage overhead as you
Thank you for your response, however, I'm afraid I still can't get it to
work, this is my code:
jar_path = '/home/mj/apps/spark_jars/spark-csv_2.11-1.0.3.jar'
spark_config =
SparkConf().setMaster('local').setAppName('data_frame_test').set(spark.jars,
jar_path)
sc =
Exception with sample testing in Intellij IDE:
Exception in thread main java.lang.NoClassDefFoundError:
scala/collection/GenTraversableOnce$class
at akka.util.Collections$EmptyImmutableSeq$.init(Collections.scala:15)
at akka.util.Collections$EmptyImmutableSeq$.clinit(Collections.scala)
at
+1
I had to browse spark-catalyst sources to find what is supported:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
Alexander
From: Gerard Maas [mailto:gerard.m...@gmail.com]
Sent: Wednesday, May 06, 2015 11:42 AM
To: spark
With binaryRecords it loads the file by line into RDD. With binaryFiles it
provides an input stream, so it is up to you if you want to load everything
into memory. Though documentation does not suggest to use this function for
large files.
From: Vijayasarathy Kannan [mailto:kvi...@vt.edu]
I don't think that works:
https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration
On Tue, May 5, 2015 at 6:25 PM, nitinkak001 nitinkak...@gmail.com wrote:
I am running hive queries from HiveContext, for which we need a
hive-site.xml.
Is it possible to replace it with
You are using Scala 2.11 with 2.10 libraries. You can change
org.apache.spark % spark-streaming_2.10 % 1.3.1
to
org.apache.spark %% spark-streaming % 1.3.1
And sbt will use the corresponding libraries according to your Scala
version.
Best Regards,
Shixiong Zhu
2015-05-06 16:21 GMT-07:00
Hi Iulian,
The relevant code is in ScalaReflection
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala,
and it would be awesome if you could suggest how to fix this more
generally. Specifically, this code is also broken when
Hi,
How can I specify Worker and Master LOG folders? If I set SPARK_WORKER_DIR in
spark-env, it only affects Executor logs and shuffling folder. But Worker and
Master logs still goes to something default:
starting org.apache.spark.deploy.master.Master, logging to
Can you check your local and remote logs?
2015-05-06 16:24 GMT-04:00 Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com:
This problem happen in Spark 1.3.1. It happen when two jobs are running
simultaneously each in its own Spark Context.
I don’t remember seeing this bug in Spark
Thanks.
In both cases, does the driver need to have enough memory to contain the
entire file? How do both these functions work when, for example, the binary
file is 4G and available driver memory is lesser?
On Wed, May 6, 2015 at 1:54 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:
Hi, All,
When I try to follow the document about tfidf from:
http://spark.apache.org/docs/latest/mllib-feature-extraction.html
val conf = new SparkConf().setAppName(TFIDF)
val sc=new SparkContext(conf)
val
This problem happen in Spark 1.3.1. It happen when two jobs are running
simultaneously each in its own Spark Context.
I don’t remember seeing this bug in Spark 1.2.0. Is it a new bug introduced in
Spark 1.3.1?
Ningjun
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Wednesday, May 06, 2015
Thanks for all your feedback!
I'm a little new to scala/spark so hopefully you'll bare with me while
I try to explain how I plan to go about this and give me advice as to
why this may or may not work. My terminology may be a little incorrect
as well.
Any feedback would be greatly appreciated.
Whoops I just saw this thread, it got caught in my spam filter. Thanks for
looking into this Xiangrui and Sean.
The implicit situation does seem fairly complicated to me. The cost
function (not including the regularization term) is affected both by the
number of ratings and by the number of
hello all,
i build spark 1.3.1 (for cdh 5.3 with yarn) twice: for scala 2.10 and scala
2.11. i am running on a secure cluster. the deployment configs are
identical.
i can launch jobs just fine on both the scala 2.10 and scala 2.11 versions.
spark-shell works on the scala 2.10 version, but not on
I've worked around this by dropping the jars into a directory (spark_jars)
and then creating a spark-defaults.conf file in conf containing this:
spark.driver.extraClassPath/home/mj/apps/spark_jars/*
--
View this message in context:
Hi,
Is there a way to read a large file, in parallel/distributed way? I have a
single large binary file which I currently read on the driver program and
then distribute it to executors (using groupBy(), etc.). I want to know if
there's a way to make the executors each read a specific/unique
SparkContext has two methods for reading binary files: binaryFiles (reads
multiple binary files into RDD) and binaryRecords (reads separate lines of a
single binary file into RDD). For example, I have a big binary file split into
logical parts, so I can use “binaryFiles”. The possible problem
It looks like you have data in these 24 partitions, or more. How many unique
name in your data set?
Enlarge the shuffle partitions only make sense if you have large partition
groups in your data. What you described looked like either your dataset having
data in these 24 partitions, or you have
I have a Counter family colums in Cassandra. I want update this counters
with a aplication in spark Streaming. How can I update counter cassandra
with Spark?
Thanks.
Hi,
I'm running a spark application with YARN-client or YARN-cluster mode.
But it seems to take too long to startup.
It takes 10+ seconds to initialize the spark context.
Is this normal? Or can it be optimized?
The environment is as follows:
- Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6)
I'm using the default settings.
Jianshi
On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva twinkle.sachd...@gmail.com
wrote:
Hi,
Can you please share your compression etc settings, which you are using.
Thanks,
Twinkle
On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang jianshi.hu...@gmail.com
This may help.
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio drarse.a...@gmail.com
wrote:
I have a Counter family colums in Cassandra. I want update this counters
Spark 1.3.1 -
i have a parquet file on hdfs partitioned by some string looking like this
/dataset/city=London/data.parquet
/dataset/city=NewYork/data.parquet
/dataset/city=Paris/data.paruqet
….
I am trying to get to load it using sqlContext using sqlcontext.parquetFile(
hdfs://some
Apologies for the repeat. The first was rejected by the submission
process
I created a simple Spark streaming program using updateStateByKey.
The domain is represented by case classes for clarity, type safety, etc.
Spark job continuously loads new classes, which are removed by GC to
maintain
a
Which release of Spark are you using ?
Thanks
On May 6, 2015, at 8:03 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I run a job on spark standalone cluster and got the exception below
Here is the line of code that cause problem
val myRdd: RDD[(String, String,
I run a job on spark standalone cluster and got the exception below
Here is the line of code that cause problem
val myRdd: RDD[(String, String, String)] = ... // RDD of (docid, cattegory,
path)
myRdd.persist(StorageLevel.MEMORY_AND_DISK_SER)
val cats: Array[String] = myRdd.map(t =
I am performing a job where I perform a number of steps in succession.
One step is a map on a JavaRDD which generates objects taking up
significant memory.
The this is followed by a join and an aggregateByKey.
The problem is that the system is running getting OutOfMemoryErrors -
Most tasks work
Hello,
This is how i read Avro data.
import org.apache.avro.generic.GenericData
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.Schema
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapreduce.AvroKeyInputFormat
-- Read
73 matches
Mail list logo