Set the port using
spconf.set(spark.ui.port,);
where, is any port
spconf is your spark configuration object.
On Sun, Jan 11, 2015 at 2:08 PM, YaoPau [via Apache Spark User List]
ml-node+s1001560n21083...@n3.nabble.com wrote:
I have multiple Spark Streaming jobs running all day, and
Hi,
This is what I am trying to do:
karthik@s4:~/spark-1.2.0$ SPARK_HADOOP_VERSION=2.3.0 sbt/sbt clean
Using /usr/lib/jvm/java-7-oracle as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
[info] Loading project definition from
/home/karthik/spark-1.2.0/project/project
Hi Team,
Is any one done comparison(pros and cons ) study between GraphX ad GraphLab.
Could you please let me know any links for this comparison.
Regards,
Rajesh
Have you tried simple giving the path where you want to save the file ?
For instance in your case just do
*r.saveAsTextFile(home/cloudera/tmp/out1) *
Dont use* file*
This will create a folder with name out1. saveAsTextFile always write by
making a directory, it does not write data into a
Hi all:I am using spark sql to read and write hive tables. But There is a issue
that how to select the first row in each group by group?In hive, we could write
hql like this:SELECT imeiFROM (SELECT imei,
row_number() over (PARTITION BY imei ORDER BY login_time
Hi all
Is there efficient way to trigger RDD transformations? I'm now using count
action to achieve this.
Best regards
Kevin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Manually-trigger-RDD-map-function-without-action-tp21094.html
Sent from the Apache
Actually this code is producing error leader not found exception. I am
unable to find the reason
On Mon, Jan 12, 2015 at 4:03 PM, kevinkim [via Apache Spark User List]
ml-node+s1001560n21098...@n3.nabble.com wrote:
Well, you can use coalesce() to decrease number of partition to 1.
(It will
On Sun, Jan 11, 2015 at 9:46 PM, Christopher Thom
christopher.t...@quantium.com.au wrote:
Is there any plan to extend the data types that would be accepted by the Tree
models in Spark? e.g. Many models that we build contain a large number of
string-based categorical factors. Currently the
I have a rowMatrix on which I want to perform two multiplications. The first
is a right multiplication with a local matrix which is fine. But after that I
also wish to right multiply the transpose of my rowMatrix with a different
local matrix. I understand that there is no functionality to
If you don't care about the value that your map produced (because you're
not already collecting or saving it), then is foreach more appropriate to
what you're doing?
On Mon, Jan 12, 2015 at 4:08 AM, kevinkim kevin...@apache.org wrote:
Hi, answer from another Kevin.
I think you may already
You should take a look at
https://issues.apache.org/jira/browse/SPARK-4122
which is implementing writing to kafka in a pretty similar way (make a new
producer inside foreachPartition)
On Mon, Jan 12, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote:
Leader-not-found suggests a problem with
just an FYI: you can configure that using spark.jobserver.filedao.rootdir
On Mon, Jan 12, 2015 at 1:52 AM, abhishek reachabhishe...@gmail.com wrote:
Nice! Good to know
On 11 Jan 2015 21:10, Sasi [via Apache Spark User List] [hidden email]
http:///user/SendEmail.jtp?type=nodenode=21089i=0
Hi experts!
I have a schemaRDD of messages to be pushed in kafka. So I am using
following piece of code to do that
rdd.foreachPartition(itr = {
val props = new Properties()
props.put(metadata.broker.list, brokersList)
The EC2 versión is 1.1.0 and this is my build.sbt:
libraryDependencies ++= Seq(
jdbc,
anorm,
cache,
org.apache.spark %% spark-core % 1.1.0,
com.typesafe.akka %% akka-actor % 2.2.3,
com.typesafe.akka %% akka-slf4j % 2.2.3,
org.apache.spark %%
This was without using Kryo -- if I use kryo, I got errors about buffer
overflows (see above):
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
required: 8
Just calling colStats doesn't actually compute those statistics, does it?
It looks like the computation is only
The problem is there in the logs. When it went to clone some code,
something went wrong with the proxy:
Received HTTP code 407 from proxy after CONNECT
Probably you have an HTTP proxy and you have not authenticated. It's
specific to your environment.
Although it's unrelated, I'm curious how
yes , i am also facing same problem .. please any one help to get fast
execution.
thanks.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Parquet-data-are-reading-very-very-slow-tp21061p21100.html
Sent from the Apache Spark User List mailing list
Leader-not-found suggests a problem with zookeeper config. It depends
a lot on the specifics of your error. But this is really a Kafka
question, better for the Kafka list.
On Mon, Jan 12, 2015 at 11:10 AM, Hafiz Mujadid
hafizmujadi...@gmail.com wrote:
Actually this code is producing error leader
Hi, answer from another Kevin.
I think you may already know it,
'transformation' in spark
(http://spark.apache.org/docs/latest/programming-guide.html#transformations)
will be done in 'lazy' way, when you trigger 'actions'.
(http://spark.apache.org/docs/latest/programming-guide.html#actions)
So
Meanwhile, I have submitted a pull request (
https://github.com/awslabs/emr-bootstrap-actions/pull/37) that allows users
to place their jars ahead of all other jars in spark classpath. This should
serve as a temporary workaround for all class conflicts.
Thanks,
Aniket
On Mon Jan 05 2015 at
Hi,
I am observing some weird behavior with spark, it might be my
mis-interpretation of some fundamental concepts but I have look at it for 3
days and have not been able to solve it.
The source code is pretty long and complex so instead of posting it, I will
try to articulate the problem.
I am
I think you're confusing HDFS paths and local paths. You are cd'ing to
a directory and seem to want to write output there, but your path has
no scheme and defaults to being an HDFS path. When you use file: you
seem to have a permission error (perhaps).
On Mon, Jan 12, 2015 at 4:21 PM, NingjunWang
I am trying to use it, but without success. Any sample code that works with
Spark would be highly appreciated. :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-memcached-with-spark-tp13409p21103.html
Sent from the Apache Spark User List mailing
Hello all,
I have a naive question regarding how spark uses the executors in a cluster
of machines. Imagine the scenario in which I do not know the input size of
my data in execution A, so I set Spark to use 20 (out of my 25 nodes, for
instance). At the same time, I also launch a second execution
Hi Ganelin, sorry if it wasn't clear from my previous email, but that is
how I am creating a spark context. I just didn't write out the lines
where I create the new SparkConf and SparkContext. I am also upping the
driver memory when running.
Thanks,
David
On 01/12/2015 11:12 AM, Ganelin,
At a quick glance, I think you're misunderstanding some basic features.
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
Map is a transformation, it is lazy. You're not calling any action on the
result of map.
Also, closing over a mutable variable (like idx or
Isn't the syntax --conf property=value?
http://spark.apache.org/docs/latest/configuration.html
Yes, I think setting it after the driver is running is of course too late.
On Mon, Jan 12, 2015 at 4:01 PM, David McWhorter mcwhor...@ccri.com wrote:
Hi all,
I'm trying to figure out how to set
That's not quite what I'm looking for. Let me provide an example. I have a
rowmatrix A that is nxm and I have two local matrices b and c. b is mx1 and c
is nx1. In my spark job I wish to perform the following two computations
A*b
and
A^T*c
I don't think this is possible without being
Hello Everyone,
Quick followup, is there any way I can append output to one file rather
then create a new directory/file every X milliseconds?
Thanks!
Suhas Shekar
University of California, Los Angeles
B.A. Economics, Specialization in Computing 2014
On Thu, Jan 8, 2015 at 11:41 PM, Su She
Hi Alessandro,
You can look for a log line like this in your driver's output:
15/01/12 10:51:01 INFO storage.DiskBlockManager: Created local
directory at
/data/yarn/nm/usercache/systest/appcache/application_1421081007635_0002/spark-local-20150112105101-4f3d
If you're deploying your application
Hi all,
I am running a Spark streaming application with ReliableKafkaReceiver (Spark
1.2.0). Constantly I was getting the following exception:
15/01/12 19:07:06 ERROR receiver.BlockGenerator: Error in block pushing thread
java.util.concurrent.TimeoutException: Futures timed out after [30
?Good idea! Join each element of c with the corresponding row of A, multiply
through, then reduce. I'll give this a try.
Thanks,
Alex
From: Reza Zadeh r...@databricks.com
Sent: Monday, January 12, 2015 3:05 PM
To: Alex Minnaar
Cc:
Is there any body can help me with this? Thank you very much!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-recovery-application-running-records-when-I-restart-Spark-master-tp21088p21108.html
Sent from the Apache Spark User List mailing list
Hi,
My Spark job failed with no snappyjava in java.library.path as:
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in
java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1857)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at
I ran into this recently. Turned out we had an old
org-xerial-snappy.properties file in one of our conf directories that
had the setting:
# Disables loading Snappy-Java native library bundled in the
# snappy-java-*.jar file forcing to load the Snappy-Java native
# library from the
If you would like to work with an API, you can use the Spark Kernel found
here: https://github.com/ibm-et/spark-kernel
The kernel provides an API following the IPython message protocol as well
as a client library that can be used with Scala applications.
The kernel can also be plugged into the
We are running Spark and Spark Streaming on Mesos (with multiple masters for
HA).
At launch, our Spark jobs successfully look up the current Mesos master from
zookeeper and spawn tasks.
However, when the Mesos master changes while the spark job is executing, the
spark driver seems to interact
Hi all,
Due to some reasons, I restarted Spark master node.
Before I restart it, there were some application running records at the
bottom of the master web page. But they are gone after I restart the master
node. The records include application name, running time, status, and so
on. I am sure
There is no direct way of doing that. If you need a Single file for every
batch duration, then you can repartition the data to 1 before saving.
Another way would be to use hadoop's copy merge command/api(available from
2.0 versions)
On 13 Jan 2015 01:08, Su She suhsheka...@gmail.com wrote:
Hello
Okay, thanks Akhil!
Suhas Shekar
University of California, Los Angeles
B.A. Economics, Specialization in Computing 2014
On Mon, Jan 12, 2015 at 1:24 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
There is no direct way of doing that. If you need a Single file for every
batch duration, then
http://spark.apache.org/docs/latest/monitoring.html
http://spark.apache.org/docs/latest/configuration.html#spark-ui
spark.eventLog.enabled
On Mon, Jan 12, 2015 at 3:00 PM, ChongTang ct...@virginia.edu wrote:
Is there any body can help me with this? Thank you very much!
--
View this
Hi Ethan,
How are you specifying the master to spark?
Able to recover from master failover is already handled by the underlying
Mesos scheduler, but you have to use zookeeper instead of directly passing
in the master uris.
Tim
On Mon, Jan 12, 2015 at 12:44 PM, Ethan Wolf
Thank you, Cody! Actually, I have enabled this option, and I saved logs
into Hadoop file system. The problem is, how can I get the duration of an
application? The attached file is the log I copied from HDFS.
On Mon, Jan 12, 2015 at 4:36 PM, Cody Koeninger c...@koeninger.org wrote:
Thank you, Cody! Actually, I have enabled this option, and I saved logs
into Hadoop file system. The problem is, how can I get the duration of an
application? The attached file is the log I copied from HDFS.
On Mon, Jan 12, 2015 at 4:36 PM, Cody Koeninger c...@koeninger.org wrote:
Could you compare V directly and tell us more about the difference you
saw? The column of V should be the same subject to signs. For example,
the first column of V could be either [0.8, -0.6, 0.0] or [-0.8, 0.6,
0.0]. -Xiangrui
On Sat, Jan 10, 2015 at 8:08 PM, Upul Bandara upulband...@gmail.com
Greetings!
My executors apparently are being terminated because they are running beyond
physical memory limits according to the yarn-hadoop-nodemanager logs on the
worker nodes (/mnt/var/log/hadoop on AWS EMR). I'm setting the driver-memory
to 8G.However, looking at stdout in userlogs, I can
Short answer: yes.
Take a look at: http://spark.apache.org/docs/latest/running-on-yarn.html
Look for memoryOverhead.
On Mon, Jan 12, 2015 at 2:06 PM, Michael Albert
m_albert...@yahoo.com.invalid wrote:
Greetings!
My executors apparently are being terminated because they are
running beyond
Sorry, slightly misunderstood the question. I'm not sure if there's a way
to make the master UI read old log files after a restart, but the log files
themselves are human readable text.
If you just want application duration, the start and stop are timestamped,
look for lines like this in
Hi,
I am trying to build my own scala project using sbt. The project is
dependent on both spark-score and spark-mllib. I included the following two
dependencies in my build.sbt file
libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1
libraryDependencies += org.apache.spark %%
Hi,
When I am running a job, that is loading the data from Cassandra, Spark has
created almost 9million partitions. How spark decide the partition count? I
have read from one of the presentation that it is good to have 1000 to
10,000 partitions.
Regards
Raj
--
View this message in context:
No, colStats() computes all summary statistics in one pass and store
the values. It is not lazy.
On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar rokros...@gmail.com wrote:
This was without using Kryo -- if I use kryo, I got errors about buffer
overflows (see above):
I don't know the root cause. Could you try including only
libraryDependencies += org.apache.spark %% spark-mllib % 1.1.1
It should be sufficient because mllib depends on core.
-Xiangrui
On Mon, Jan 12, 2015 at 2:27 PM, Jianguo Li flyingfromch...@gmail.com wrote:
Hi,
I am trying to build my
Does anybody have insight on this? Thanks.
On Fri, Jan 9, 2015 at 6:30 PM, Pala M Muthaia mchett...@rocketfuelinc.com
wrote:
Hi,
I am using Spark 1.0.1. I am trying to debug a OOM exception i saw during
a join step.
Basically, i have a RDD of rows, that i am joining with another RDD of
First, you should collect().toMap() the small RDD, then you should use
broadcast followed by a map to do a map-side join
http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
(slide
10 has an example).
Spark SQL also does it by default for tables
Hey Kevin,
I assume you want to trigger the map() for a side effect (since you don't
care about the result). To Cody's point, you can use foreach() *instead* of
map(). So instead of e.g. x.map(a = foo(a)).foreach(a = a), you'd run
x.foreach(a = foo(a)).
Best,
-Sven
On Mon, Jan 12, 2015 at 5:13
Is there a way to compute the total number of records in each RDD partition?
So say I had 4 partitions.. I’d want to have
partition 0: 100 records
partition 1: 104 records
partition 2: 90 records
partition 3: 140 records
Kevin
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog:
Also a ready-to-use server with Spark MLlib:
http://docs.prediction.io/recommendation/quickstart/
The source code is here:
https://github.com/PredictionIO/PredictionIO/tree/develop/templates/scala-parallel-recommendation
Simon
On Sun, Nov 30, 2014 at 12:17 PM, Pat Ferrel p...@occamsmachete.com
I guess you're not using too many features (e.g. 10m), just that hashing
the index makes it look that way, is that correct?
If so, the simple dictionary that maps your feature index - rank can be
broadcast and used everywhere, so you can pass mllib just the feature's
rank as its index.
Reza
On
Hi,
Currently in GradientDescent.scala, weights is constructed as a dense
vector:
initialWeights = Vectors.dense(new Array[Double](numFeatures))
And the numFeatures is determined in the loadLibSVMFile as the max index of
features.
But in the case of using hash function to compute feature
Sean,
Thanks for the response. Is there some subtle difference between one model
partitioned by N users or N models per each 1 user? I think I'm missing
something with your question.
Looping through the RDD filtering one user at a time would certainly give
me the response that I am hoping for
Cody said If you don't care about the value that your map produced (because
you're not already collecting or saving it), then is foreach more
appropriate to what you're doing? but I can not see it from this thread.
Anyway, I performed small benchmark to test what function is the most
efficient
Yes, using mapPartitionsWithIndex, e.g. in PySpark:
sc.parallelize(xrange(0,1000), 4).mapPartitionsWithIndex(lambda
idx,iter: ((idx, len(list(iter))),)).collect()
[(0, 250), (1, 250), (2, 250), (3, 250)]
(This is not the most efficient way to get the length of an iterator, but
you get the
You are right... my code example doesn't work :)
I actually do want a decision tree per user. So, for 1 million users, I
want 1 million trees. We're training against time series data, so there are
still quite a few data points per users. My previous message where I
mentioned RDDs with no length
Hey Pala,
I also find it very hard to get to the bottom of memory issues such as this
one based on what's in the logs (so if you come up with some findings, then
please share here). In the interim, here are a few things you can try:
- Provision more memory per executor. While in theory (and
Hi,
How do i do broadcast/map join on RDDs? I have a large RDD that i want to
inner join with a small RDD. Instead of having the large RDD repartitioned
and shuffled for join, i would rather send a copy of a small RDD to each
task, and then perform the join locally.
How would i specify this in
A model partitioned by users?
I mean that if you have a million users surely you don't mean to build a
million models. There would be little data per user right? Sounds like you
have 0 sometimes.
You would typically be generalizing across users not examining them in
isolation. Models are built
Anders,
This could be related to this open ticket:
https://issues.apache.org/jira/browse/SPARK-5077. A call to coalesce() also
fixed that for us as a stopgap.
Best,
-Sven
On Mon, Jan 12, 2015 at 10:18 AM, Anders Arpteg arp...@spotify.com wrote:
Yes sure Sandy, I've checked the logs and it's
I have a Spark application that was assembled using sbt 0.13.7, Scala 2.11,
and Spark 1.2.0. In build.sbt, I am running on Mac OSX Yosemite.
I use provided for the Spark dependencies. I can run the application fine
within sbt.
I run into problems when I try to run it from the command line. Here
Yes, this proxy problem is resolved.
*how your build refers tohttps://github.com/ScrapCodes/sbt-pom-reader.git
https://github.com/ScrapCodes/sbt-pom-reader.git I don't see thisrepo
the project code base.*
I manually downloaded the sbt-pom-reader directory and moved into
.sbt/0.13/staging/*/
Got it, thanks!
On Tue, Jan 13, 2015 at 2:00 PM, Justin Yip yipjus...@gmail.com wrote:
Xuelin,
There is a function called emtpyRDD under spark context
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
which
serves this purpose.
Justin
On Mon, Jan
Xuelin,
There is a function called emtpyRDD under spark context
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
which
serves this purpose.
Justin
On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao xuelincao2...@gmail.com wrote:
Hi,
I'd like to create a
How about your scene? do you need use lots of Broadcast? If not, It will be
better to focus more on other thing.
At this time, there is not more better method than TorrentBroadcast. Though
one-by-one, but after one node get the data, it can act as the data source
immediately.
Hi,I am trying to read a parquet file using -val parquetFile =
sqlContext.parquetFile(people.parquet)
There is no way to specify that I am interested in reading only some columns
from disk. For example, If the parquet file has 10 columns and want to read
only 3 columns from disk.
We have done
Hi all,
I'm trying to figure out how to set this option:
spark.yarn.driver.memoryOverhead on Spark 1.2.0. I found this helpful
overview
http://apache-spark-user-list.1001560.n3.nabble.com/Stable-spark-streaming-app-td14105.html#a14476,
which suggests to launch with
There are two related options:
To solve your problem directly try:
val conf = new SparkConf().set(spark.yarn.driver.memoryOverhead, 1024)
val sc = new SparkContext(conf)
And the second, which increases the overall memory available on the driver, as
part of your spark-submit script add:
Prannoy
I tried this r.saveAsTextFile(home/cloudera/tmp/out1), it return without
error. But where does it saved to? The folder “/home/cloudera/tmp/out1” is not
cretaed.
I also tried the following
cd /home/cloudera/tmp/
spark-shell
scala val r = sc.parallelize(Array(a, b, c))
scala
Dear All
what i want to do is :
as the data is partitioned on many worker nodes I want to be able to process
this partition of data as a whole on each partition and then produce my
output using flatMap for example.
so can I loads all of the input records on one worker node and emitting any
output
Yes sure Sandy, I've checked the logs and it's not a OOM issue. I've
actually been able to solve the problem finally, and it seems to be an
issue with too many partitions. The repartitioning I tried initially did so
after the union, and then it's too late. By repartitioning as early as
possible,
Not sure I understand correctly, but it sounds like you're looking for
mapPartitions().
-Sven
On Mon, Jan 12, 2015 at 10:17 AM, maherrt mahe...@hotmail.com wrote:
Dear All
what i want to do is :
as the data is partitioned on many worker nodes I want to be able to
process
this partition of
Dear Spark Users,
I googled the web for several hours now but I don't find a solution for my
problem. So maybe someone from this list can help.
I have an RDD of case classes, generated from CSV files with Spark. When I used
the distinct operator, there were still duplicates. So I investigated
Hi Akhil,
Thank you for the pointers. Below is how we are saving data to Cassandra.
javaFunctions(rddToSave).writerBuilder(datapipelineKeyspace,
datapipelineOutputTable, mapToRow(Sample.class))
The data we are saving at this stage is ~200 million rows.
How do we control application threads
Is this in the Spark shell? Case classes don't work correctly in the Spark
shell unfortunately (though they do work in the Scala shell) because we change
the way lines of code compile to allow shipping functions across the network.
The best way to get case classes in there is to compile them
82 matches
Mail list logo