I think your *sparkUrl *points to an invalid cluster url. Just make sure
you are giving the correct url (the one you see on top left in the
master:8080 webUI).
Thanks
Best Regards
On Tue, Aug 26, 2014 at 11:07 AM, Forest D dev24a...@gmail.com wrote:
Hi Jonathan,
Thanks for the reply. I ran
With a lower number of partitions, I keep losing executors during
collect at KMeans.scala:283
The error message is ExecutorLostFailure (executor lost).
The program recovers by automatically repartitioning the whole dataset
(126G), which takes very long and seems to only delay the inevitable
Have a look at the history server, looks like you have enabled history
server on your local and not on the remote server.
http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html
Thanks
Best Regards
On Tue, Aug 26, 2014 at 7:01 AM, SK skrishna...@gmail.com wrote:
Hi,
I am
Answering my own question, it seems that the warnings are expected as
explained by TD @
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-questions-td3281.html
.
Here is what he wrote:
Spark Streaming is designed to replicate the received data within the
machines in a Spark cluster
You need to run your app in localmode ( aka master=local[2]) to get it
debugged locally. If you are running it on a cluster, then you can use
the remote
debugging feature.
http://stackoverflow.com/questions/19128264/how-to-remote-debug-in-intellij-12-1-4
For remote debugging, you need to pass the
Hi
Not sure this is the right way of doing it, but if you can create a
PairRDDFunction from that RDD then you can use the following piece of code
to access the filenames from the RDD.
PairRDDFunctionsK, V ds = .;
//getting the
I wrote a post on this forum but it shows the message This post has NOT
been accepted by the mailing list yet. above my post. How long will it
take to get it posted?
Regards,
Sandeep Vaid
+91 - 09881710301
Hello,
it's me again.
Now I've got an explanation for the behaviour. It seems that the driver
memory is not large enough to hold the whole result set of saveAsTextFile
In-Memory. And then OOM occures. I test it with a filter-step that removes
KV-pairs with WordCount smaller 100,000. So now the job
val node = textFile.map(line = {
val fileds = line.split(\\s+)
(fileds(1),fileds(2))
})
then you can manipulate node RDD with PairRDD function.
2014-08-26 12:55 GMT+08:00 Deep Pradhan pradhandeep1...@gmail.com:
Hi,
I have an input file of a graph in the format source_node
Thanks for the reply.
Ya it doesn't seem doable straight away. Someone suggested this
/For each of your streams, first create an emty RDD that you register as a
table, obtaining an empty table. For your example, let's say you call it
allTeenagers.
Then, for each of your queries, use SchemaRDD's
I actually tried without unpersisting, but given the performance I tryed to
add these in order to free the memory. After your anwser I tried to remove
them again, but without any change in the execution time...
Looking at the web interface, I can see that the mapPartitions at
GraphImpl.scala:184
Hi,
Plz give a try by changing the worker memory such that worker memoryexecutor
memory
Thanks Regards,
Meethu M
On Friday, 22 August 2014 5:18 PM, Yadid Ayzenberg ya...@media.mit.edu wrote:
Hi all,
I have a spark cluster of 30 machines, 16GB / 8 cores on each running in
standalone
println(parts(0)) does not solve the problem. It does not work
On Mon, Aug 25, 2014 at 1:30 PM, Sean Owen so...@cloudera.com wrote:
On Mon, Aug 25, 2014 at 7:18 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
When I add
parts(0).collect().foreach(println)
I have the following code
*val nodes = lines.map(s ={val fields = s.split(\\s+)
(fields(0),fields(1))}).distinct().groupByKey().cache()*
and when I print out the nodes RDD I get the following
*(4,ArrayBuffer(1))(2,ArrayBuffer(1))(3,ArrayBuffer(1))(1,ArrayBuffer(3, 2,
I'd suggest first reading the scaladoc for RDD and PairRDDFunctions to
familiarize yourself with all the operations available:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
it seems he means to query RDBMS or cassandra using Spark SQL, multi data
sources for spark SQL.
i looked through the link he posted
https://docs.wso2.com/display/BAM241/Creating+Hive+Queries+to+Analyze+Data#CreatingHiveQueriestoAnalyzeData-CreatingHivetablesforvariousdatasources
using their
Hi,
consider the following code:
import org.apache.spark.{SparkContext, SparkConf}
object ParallelismBug extends App {
var sConf = new SparkConf()
.setMaster(spark://hostName:7077) // .setMaster(local[4])
.set(spark.default.parallelism, 7) // or without it
val sc = new
Hello People,
I'm using java spark streaming. I'm just wondering, Can I make simple jdbc
connection in JavaDStream map() method?
Or
Do I need to create jdbc connection for each JavaPairDStream, after map
task?
Kindly give your thoughts.
Cheers,
Ravi Sharma
Hi,
I have in my application many union operations. But union increases number
of partitions of following RDDs. And performance on more partitions
sometimes is very slow. Is there any cleaner way to prevent increasing
number of partitions than adding
coalesce(numPartitions) after each union?
Yes, you can open a jdbc connection at the beginning of the map method then
close this connection at the end of map() and in between you can use this
connection.
Thanks
Best Regards
On Tue, Aug 26, 2014 at 6:12 PM, Ravi Sharma raviprincesha...@gmail.com
wrote:
Hello People,
I'm using java
Hi,
As I understand, your problem is similar to this JIRA.
https://issues.apache.org/jira/browse/SPARK-1647
The issue in this case, Kafka can not replay the message as offsets are
already committed. Also I think existing KafkaUtils ( The Default High
Level Kafka Consumer) also have this issue.
For the record, I'm using Chrome 36.0.1985.143 on 10.9.4 as well. Maybe
it's a Chrome add-on I'm running?
Anyway, as Matei pointed out, if I change the https to http, it works fine.
On Tue, Aug 26, 2014 at 1:46 AM, Michael Hausenblas
michael.hausenb...@gmail.com wrote:
Hi Bharat,
Thanks for your email. If the Kafka Reader worker process dies, it will
be replaced by different machine, and it will start consuming from the
offset where it left over ( for each partition). Same case can happen even
if I tried to have individual Receiver for every partition.
On Tue, Aug 26, 2014 at 10:28 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Maybe it's a Chrome add-on I'm running?
Hmm, scratch that. Trying in incognito mode (which disables add-ons, I
believe) also yields the same behavior.
Nick
I have not tried it. But, I guess you need to add your credential in the s3
path. Or, can you copy the jar to your driver node and try again?
On Sun, Aug 24, 2014 at 9:35 AM, S Malligarjunan smalligarju...@yahoo.com
wrote:
Hello Yin,
Additional note:
In ./bin/spark-shell --jars
I'm curious not only about what they do, but what their relationship is to
the rest of the system. I find that I get listener events for n block
managers added where n is also the number of workers I have available to
the application. Is this a stable constant?
Also, are there ways to determine
I have already tried setting the history server and accessing it on
master-url:18080 as per the link. But the page does not list any completed
applications. As I mentioned in my previous mail, I am running Spark in
standalone mode on the cluster (as well as on my local machine). According
to the
Hello Du,
Can you check if there is a dir metastore in the place you launching your
program. If so, can you delete it and try again?
Also, can you try HiveContext? LocalHiveContext is deprecated.
Thanks,
Yin
On Mon, Aug 25, 2014 at 6:33 PM, Du Li l...@yahoo-inc.com.invalid wrote:
Hi,
I
Hi All,
I want to invite users to submit to the Spark Powered By page. This page
is a great way for people to learn about Spark use cases. Since Spark
activity has increased a lot in the higher level libraries and people often
ask who uses each one, we'll include information about which
Hi People,
I'm using java kafka spark streaming and saving the result file into hdfs.
As per my understanding, spark streaming write every processed message or
event to hdfs file. Reason to creating one file per message or event could
be to ensure fault tolerance. Is there any way spark handle
Hello Michel,
I have executed git pull now, As per pom, version entry it is 1.1.0-SNAPSHOT.
Thanks and Regards,
Sankar S.
On Tuesday, 26 August 2014, 1:00, Michael Armbrust mich...@databricks.com
wrote:
Which version of Spark SQL are you using? Several issues with custom hive UDFs
How many partitions now? Btw, which Spark version are you using? I
checked your code and I don't understand why you want to broadcast
vectors2, which is an RDD.
var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector =
Right now, I have issues even at a far earlier point.
I'm fetching data from a registerd table via
var texts = ctx.sql(SELECT text FROM tweetTrainTable LIMIT
2000).map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
//persisted because it's used again
We are exploring using Kinesis and spark streaming together. I took at a
look at the kinesis receiver code in 1.1.0. I have a question regarding
kinesis partition spark streaming partition. It seems to be pretty
difficult to align these partitions.
Kinesis partitions a stream of data into
It should be fixed now. Maybe you have a cached version of the page in your
browser. Open DevTools (cmd-shift-I), press the gear icon, and check disable
cache while devtools open, then refresh the page to refresh without cache.
Matei
On August 26, 2014 at 7:31:18 AM, Nicholas Chammas
Confirmed. Works now. Thanks Matei.
(BTW, on OS X Command + Shift + R also refreshes the page without cache.)
On Tue, Aug 26, 2014 at 3:06 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
It should be fixed now. Maybe you have a cached version of the page in
your browser. Open DevTools
Hi,
I have the following piece of code that I am running on a cluster with 10
nodes with 2GB memory per node. The tasks seem to complete, but at the point
where it is generating output (saveAsTextFile), the program freezes after
some time and reports an out of memory error (error transcript
Hello all,
I have just checked out branch-1.1
and executed below command
./bin/spark-shell --driver-memory 1G
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql(CREATE TABLE IF NOT EXISTS src (key INT,
If someone doesn't have the access to do that is there any easy to specify a
different properties file to be used?
Patrick Wendell wrote
If you want to customize the logging behavior - the simplest way is to
copy
conf/log4j.properties.tempate to conf/log4j.properties. Then you can go
and
Hi,
The error doesn't occur during saveAsTextFile but rather during the groupByKey
as far as I can tell. We strongly urge users to not use groupByKey
if they don't have to. What I would suggest is the following work-around:
sc.textFile(baseFile)).map { line =
val fields = line.split(\t)
Hi David,
Your job is probably hanging on the groupByKey process. Probably GC is kicking
in and the process starts to hang or the data is unbalanced and you end up with
stragglers (Once GC kicks in you'll start to get the connection errors you
shared). If you don't care about the list of
Hi Grega,
Did you ever get this figured out? I'm observing the same issue in Spark
1.0.2.
For me it was after 1.5hr of a large .distinct call, followed by a
.saveAsTextFile()
14/08/26 20:57:43 INFO executor.CoarseGrainedExecutorBackend: Got assigned
task 18500
14/08/26 20:57:43 INFO
Hi Silvio,
I re-downloaded hive-0.12-bin and reset the related path in spark-env.sh.
However, I still got some error. Do you happen to know any step I did wrong?
Thank you!
My detailed step is as follows:
#enter spark-shell (successful)
/bin/spark-shell --master spark://S4:7077 --jars
Thanks. I''m just confused on the syntax, I'm not sure which variables or
where the value of the count is stored so that I can save it. Any examples
or tips?
On Mon, Aug 25, 2014 at 9:49 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:
You could try to use foreachRDD on the result of
i've seen this done using mapPartitions() where each partition represents a
single, multi-line json file. you can rip through each partition (json
file) and parse the json doc as a whole.
this assumes you use sc.textFile(path/*.json) or equivalent to load in
multiple files at once. each json
good suggestion, td.
and i believe the optimization that jon.burns is referring to - from the
big data mini course - is a step earlier: the sorting mechanism that
produces sortedCounts.
you can use mapPartitions() to get a top k locally on each partition, then
shuffle only (k * # of partitions)
Hello,
I'm using the following version of Spark - 1.0.0+cdh5.1.0+41
(1.cdh5.1.0.p0.27).
I've tried to specify the libraries Spark uses using the following ways -
1) Adding it to spark context
2) Specifying the jar path in
a) spark.executor.extraClassPath
b) spark.executor.extraLibraryPath
I wanted to make sure that there's full compatibility between minor
releases. I have a project that has a dependency on spark-core so that it
can be a driver program and that I can test locally. However, when
connecting to a cluster you don't necessarily know what version you're
connecting to. Is
At 2014-08-26 01:20:09 -0700, BertrandR bertrand.rondepierre...@gmail.com
wrote:
I actually tried without unpersisting, but given the performance I tryed to
add these in order to free the memory. After your anwser I tried to remove
them again, but without any change in the execution time...
Is this a standalone mode cluster? We don't currently make this guarantee,
though it will likely work in 1.0.0 to 1.0.2. The problem though is that the
standalone mode grabs the executors' version of Spark code from what's
installed on the cluster, while your driver might be built against
You can use sc.wholeTextFiles to read each file as a complete String, though it
requires each file to be small enough for one task to process.
On August 26, 2014 at 4:01:45 PM, Chris Fregly (ch...@fregly.com) wrote:
i've seen this done using mapPartitions() where each partition represents a
Yes, we are standalone right now. Do you have literature why one would want
to consider Mesos or YARN for Spark deployments?
Sounds like I should try upgrading my project and seeing if everything
compiles without modification. Then I can connect to an existing 1.0.0
cluster and see what what
Hi I am trying to find a CUDA library in Scala, to see if some matrix
manipulation in MLlib can be sped up.
I googled a few but found no active projects on Scala+CUDA. Python is
supported by CUDA though. Any suggestion on whether this idea makes any
sense?
Best regards,
Wei
Basically, a Block Manager manages the storage for most of the data in spark,
name a few: block that represent a cached RDD partition, intermediate shuffle
data, broadcast data etc. it is per executor, while in standalone mode,
normally, you have one executor per worker.
You don't control how
You should try to find a Java-based library, then you can call it from Scala.
Matei
On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote:
Hi I am trying to find a CUDA library in Scala, to see if some matrix
manipulation in MLlib can be sped up.
I googled a few but found no
Things will definitely compile, and apps compiled on 1.0.0 should even be able
to link against 1.0.2 without recompiling. The only problem is if you run your
driver with 1.0.0 on its classpath, but the cluster has 1.0.2 in executors.
For Mesos and YARN vs standalone, the difference is that they
hi, dear
Now I am working on a project in below scenario.
We will use Sparkingstreaming to receive data from IBM MQ, I checked the API
document of streaming, it's only support ZeroMQ, Kafka, etc. I have some
questions:
1. we can use MQTT protocol to get data in this scenario, right? any other
Hi, Victor,
the issue for you to have different version in driver and cluster is that you
the master will shutdown your application due to the inconsistent
SerialVersionID in ExecutorState
Best,
--
Nan Zhu
On Tuesday, August 26, 2014 at 10:10 PM, Matei Zaharia wrote:
Things will
hi, all :
I tried to use Spark SQL on spark-shell, as the spark-example.
When I execute :
*val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext._
hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
*then report error like below:
scala hiveContext.hql(CREATE
great work, Dibyendu. looks like this would be a popular contribution.
expanding on bharat's question a bit:
what happens if you submit multiple receivers to the cluster by creating
and unioning multiple DStreams as in the kinesis example here:
I would suggest you to use JDBC connector in mappartition instead of maps
as JDBC connections are costly can really impact your performance.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Aug 26, 2014 at 6:45 PM,
We're a single-app deployment so we want to launch as many executors as the
system has workers. We accomplish this by not configuring the max for the
application. However, is there really no way to inspect what
machines/executor ids/number of workers/etc is available in context? I'd
imagine that
62 matches
Mail list logo