Hi All,
I wanted to try SparkR. Do we need preinstalled R on all the nodes of the
cluster before installing SparkR package ? Please guide me how to proceed with
this. As of now, I work with R only on single node.
Please suggest
Thanks
Stuti Awasthi
::DISCLAIMER::
Ah, right. So only the launch script has changed. Everything else is still
essentially binary compatible?
Well, that makes it too easy! Thanks!
On Wed, Jun 18, 2014 at 2:35 PM, Patrick Wendell pwend...@gmail.com wrote:
Actually you'll just want to clone the 1.0 branch then use the
spark-ec2
Thanks, I hope this problem will go away once I upgrade to spark 1.0 where we
can send the clusterwide classpaths using spark-submit command
--
View this message in context:
Gerard,
We haven't done a test on Calliope vs a driver.
The thing is Calliope builds on C* thrift (and latest build on DS driver)
and the performance in terms of simple write will be similar to any
existing driver. But then that is not the use case for Calliope.
It is built to be used from
Hi Andrew,
Strangely in my spark (1.0.0 compiled against hadoop 2.4.0) log, it says
file not found. I'll try again.
Jianshi
On Wed, Jun 18, 2014 at 12:36 PM, Andrew Ash and...@andrewash.com wrote:
In Spark you can use the normal globs supported by Hadoop's FileSystem,
which are documented
Hi all,
Thanks for the reply. I'm using parquetFile as input, is that a problem? In
hadoop fs -ls, the path
(hdfs://domain/user/jianshuang/data/parquet/table/month=2014*)
will get list all the files.
I'll test it again.
Jianshi
On Wed, Jun 18, 2014 at 2:23 PM, Jianshi Huang
Hello Xiangrui,
Thanks for sharing the roadmap. I really helped.
Regards,
Jayati
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Contribution-to-Spark-MLLib-tp7716p7826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
There is nothing special about CDH5 Spark in this regard. CDH 5.0.x has
Spark 0.9.0, and the imminent next release will have 1.0.0 + upstream
patches.
You're simply accessing a class that was not present in 0.9.0, but is
present after that:
Hi,
Thanks Andrew and Daniel for the response.
Setting spark.shuffle.spill to false didnt make any difference. 5 days
completed in 6 min and 10 days was stuck after around 1hr.
Daniel,in my current use case I cant read all the files to a single RDD.But I
have another use case where I did it
Hi,
Could your problem come from the fact that you run your tests in parallel ?
If you are spark in local mode, you cannot have concurrent spark instances
running. this means that your tests instantiating sparkContext cannot be
run in parallel. The easiest fix is to tell sbt to not run parallel
We just merged a feature into master that lets you print the schema or view
it as a string (printSchema() and schemaTreeString on SchemaRDD).
There is also this JIRA targeting 1.1 for presenting a nice programatic API
for this information: https://issues.apache.org/jira/browse/SPARK-2179
On
You cannot assume that caching would always reduce the execution time,
especially if the data-set is large. It appears that if too much memory is
used for caching, then less memory is left for the actual computation
itself. There has to be a balance between the two.
Page 33 of this thesis from
I guess this is a basic question about the usage of reduce. Please shed some
lights, thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-print-a-derived-DStream-after-reduceByKey-tp7834p7836.html
Sent from the Apache Spark User List mailing
In the test application, I create a DStream by connect with a socket.
Then I want to count the RDDs in the DStream which matches with another
reference RDD.
Below is the Java code for my application.
==
public class TestSparkStreaming {
public static void main(String[] args) {
Hi,
We are trying to implement a BSP model in Spark with the help of GraphX.
One thing I encountered is a Pregel operator in Graph class. But what I
fail to understand is how the Master and Worker needs to be assigned (BSP),
and how barrier synchronization would happen. The pregel operator
Patrick,
My team is using shuffle consolidation but not speculation. We are also
using persist(DISK_ONLY) for caching.
Here are some config changes that are in our work-in-progress.
We've been trying for 2 weeks to get our production flow (maybe around
50-70 stages, a few forks and joins with
Hello everybody,
Xiangrui, thanks for the link to roadmap. I saw it is planned to implement
LDA in the MLlib 1.1. What do you think about PLSA?
I understand that LDA is more popular now, but recent research shows that
modifications of PLSA sometimes performs better[1]. Furthermore, the most
Hi Gaurav, thanks for your pointer. The observation in the link is (at
least qualitatively) similar to mine.
Now the question is, if I do have big data (40GB, cached size is 60GB) and
even big memory (192 GB), I cannot benefit from RDD cache, and should
persist on disk and leverage filesystem
if I do have big data (40GB, cached size is 60GB) and even big memory (192
GB), I cannot benefit from RDD cache, and should persist on disk and
leverage filesystem cache?
The answer to the question of whether to persist (spill-over) data on disk
is not always immediately clear, because generally
Is that month= syntax something special, or do your files actually have
that string as part of their name?
On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi all,
Thanks for the reply. I'm using parquetFile as input, is that a problem?
In hadoop fs -ls, the
Hi Nicholas,
month= is for Hive to auto discover the partitions. It's part of the url of
my files.
Jianshi
On Wed, Jun 18, 2014 at 11:52 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Is that month= syntax something special, or do your files actually have
that string as part of
I wonder if that’s the problem. Is there an equivalent hadoop fs -ls
command you can run that returns the same files you want but doesn’t have
that month= string?
On Wed, Jun 18, 2014 at 12:25 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi Nicholas,
month= is for Hive to auto discover
Hi All,
Have anyone ran into the same problem? By looking at the source code in
official release (rc11),this property settings is set to false by default,
however, I'm seeing the .sparkStaging folder remains on the HDFS and causing it
to fill up the disk pretty fast since SparkContext deploys
Forgot to mention that I am using spark-submit to submit jobs, and a verbose
mode print out looks like this with the SparkPi examples.The .sparkStaging
won't be deleted. My thoughts is that this should be part of the staging and
should be cleaned up as well when sc gets terminated.
by the way, any idea how to sync the spark config dir with other nodes in the
cluster?
~santhosh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/question-about-setting-SPARK-CLASSPATH-IN-spark-env-sh-tp7809p7853.html
Sent from the Apache Spark User List
Disabling parallelExecution has worked for me.
Other alternatives I’ve tried that also work include:
1. Using a lock – this will let tests execute in parallel except for those
using a SparkContext. If you have a large number of tests that could execute
in parallel, this can shave off some
In my unit tests I have a base class that all my tests extend that has a
setup and teardown method that they inherit. They look something like this:
var spark: SparkContext = _
@Before
def setUp() {
Thread.sleep(100L) //this seems to give spark more time to
reset from the
OS X / Homebrew users,
It looks like you can now download Spark simply by doing:
brew install apache-spark
I’m new to Homebrew, so I’m not too sure how people are intended to use
this. I’m guessing this would just be a convenient way to get the latest
release onto your workstation, and from
Hi Naftali,
Yes you're right. For now please add a column of ones. We are working on
adding a weighted regularization term, and exposing the scala intercept
option in the python binding.
Best,
Reza
On Mon, Jun 16, 2014 at 12:19 PM, Naftali Harris naft...@affirm.com wrote:
Hi everyone,
The
Interesting, does anyone know the people over there who set it up? It would be
good if Apache itself could publish packages there, though I’m not sure what’s
involved. Since Spark just depends on Java and Python it should be easy for us
to update.
Matei
On Jun 18, 2014, at 1:37 PM, Nick
Thanks Reza! :-D
Naftali
On Wed, Jun 18, 2014 at 1:47 PM, Reza Zadeh r...@databricks.com wrote:
Hi Naftali,
Yes you're right. For now please add a column of ones. We are working on
adding a weighted regularization term, and exposing the scala intercept
option in the python binding.
Cool.
Looked at the Pull Requests, the upgrade to 1.0.0 was just merged
yesterday. https://github.com/Homebrew/homebrew/pull/30231
https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb
On Wed, Jun 18, 2014 at 1:57 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
I am trying to process a file that contains 4 log lines (not very long) and
then write my parsed out case classes to a destination folder, and I get
the following error:
java.lang.OutOfMemoryError: Java heap space
at
Agreed, it would be better if Apache controlled or managed this directly.
I think making such a change is just a matter of opening a new issue
https://github.com/Homebrew/homebrew/issues/new on the Homebrew issue
tracker. I believe that's how Spark made it in there in the first place--it
was just
Matei,
You might want to comment on that issue Sherl linked to, or perhaps this one
https://github.com/Homebrew/homebrew/issues/30228, to ask about how
Apache can manage this going forward. I know that mikemcquaid
https://github.com/mikemcquaid is very active on the Homebrew repo and is
one of
What's the advantage of Apache maintaining the brew installer vs users?
Apache handling it means more work on this dev team, but probably a better
experience for brew users. Just wanted to weigh pros/cons before
committing to support this installation method.
Andrew
On Wed, Jun 18, 2014 at
Wait, so the file only has four lines and the job running out of heap
space? Can you share the code you're running that does the processing?
I'd guess that you're doing some intense processing on every line but just
writing parsed case classes back to disk sounds very lightweight.
I
On Wed,
Hi,
I have a 5 million record, 300 column data set.
I am running a spark job in yarn-cluster mode, with the following args
--driver-memory 11G --executor-memory 11G --executor-cores 16 --num-executors
500
The spark job replaces all categorical variables with some integers.
I am getting the below
Hi to all,
in my use case I'd like to receive events and call an external service as
they pass through. Is it possible to limit the number of contemporaneous
call to that service (to avoid DoS) using Spark streaming? if so, limiting
the rate implies a possible buffer growth...how can I control the
Ok that patch does fix the key lookup exception. However, curious about the
time validity check..isValidTime (
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala#L264
)
Why does (time - zerotime) have to be a multiple of slide
You can add a back pressured enabled component in front that feeds data into
Spark. This component can control in input rate to spark.
On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote:
Hi to all,
in my use case I'd like to receive events and call an external
Thanks for the quick reply soumya. Unfortunately I'm a newbie with
Spark..what do you mean? is there any reference to how to do that?
On Thu, Jun 19, 2014 at 12:24 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
You can add a back pressured enabled component in front that feeds data
into
Hi all,
I am setting up a system where spark contexts would be created by a web
server that would handle the computation and return the results. I have the
following code (in python)
os.environ['SPARK_HOME'] = /home/spark/spark-1.0.0-bin-hadoop2/
sc =
I have a flow that ends with saveAsTextFile() to HDFS.
It seems all the expected files per partition have been written out, based
on the number of part files and the file sizes.
But the driver logs show 2 tasks still not completed and has no activity
and the worker logs show no activity for
The following is a simplified example of what I am trying to accomplish.
Say I have an RDD of objects like this:
{
country: USA,
name: Franklin,
age: 24,
hits: 224}
{
country: USA,
name: Bob,
age: 55,
hits: 108}
{
country: France,
name: Remi,
age:
Hi Nick,
Instead of using reduceByKey(), you might want to look into using
aggregateByKey(), which allows you to return a different value type U
instead of the input value type V for each input tuple (K, V). You can
define U to be a datatype that holds both the average and total and have
seqOp
Ah, this looks like exactly what I need! It looks like this was recently added
into PySpark https://github.com/apache/spark/pull/705/files#diff-6 (and
Spark Core), but it's not in the 1.0.0 release.
Thank you.
Nick
On Wed, Jun 18, 2014 at 7:42 PM, Doris Xin doris.s@gmail.com wrote:
Hi
This looks like a job for SparkSQL!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int, hits: Long)
val data = sc.parallelize(Array(MyRecord(USA, Franklin, 24, 234),
MyRecord(USA, Bob, 55, 108),
I was going to suggest the same thing :).
On Jun 18, 2014, at 4:56 PM, Evan R. Sparks evan.spa...@gmail.com wrote:
This looks like a job for SparkSQL!
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
case class MyRecord(country: String, name: String, age: Int,
That’s pretty neat! So I guess if you start with an RDD of objects, you’d
first do something like RDD.map(lambda x: Record(x['field_1'],
x['field_2'], ...)) in order to register it as a table, and from there run
your aggregates. Very nice.
On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks
If your input data is JSON, you can also try out the recently merged
in initial JSON support:
https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916
On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
That’s pretty neat! So I guess if you
Looks like eventually there was some type of reset or timeout and the tasks
have been reassigned. I'm guessing they'll keep failing until max failure
count.
The machine it disconnected from was a remote machine, though I've seen
such failures from connections to itself with other problems. The
This is exciting! Here is the relevant alpha doc
http://yhuai.github.io/site/sql-programming-guide.html#json-datasets for
this feature, for others reading this. I'm going to try this out.
Will this be released with 1.1.0?
On Wed, Jun 18, 2014 at 8:31 PM, Zongheng Yang zonghen...@gmail.com
Flavio - i'm new to Spark as well but I've done stream processing using
other frameworks. My comments below are not spark-streaming specific. Maybe
someone who know more can provide better insights.
I read your post on my phone and I believe my answer doesn't completely
address the issue you have
Hi Bharath,
This is related to SPARK-1112, which we already found the root cause.
I will let you know when this is fixed.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 7:37 PM, Bharath Ravi Kumar reachb...@gmail.com wrote:
Couple more points:
1)The inexplicable stalling of execution with large
Thanks. I'll await the fix to re-run my test.
On Thu, Jun 19, 2014 at 8:28 AM, Xiangrui Meng men...@gmail.com wrote:
Hi Bharath,
This is related to SPARK-1112, which we already found the root cause.
I will let you know when this is fixed.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 7:37 PM,
-- Forwarded message --
From: Ghousia ghousia.ath...@gmail.com
Date: Wed, Jun 18, 2014 at 5:41 PM
Subject: BSP realization on Spark
To: user@spark.apache.org
Hi,
We are trying to implement a BSP model in Spark with the help of GraphX.
One thing I encountered is a Pregel operator
Hi all,
I have a doubt regarding the options in spark-env.sh. I set the following
values in the file in master and 2 workers
SPARK_WORKER_MEMORY=7g
SPARK_EXECUTOR_MEMORY=6g
SPARK_DAEMON_JAVA_OPTS+=- Dspark.akka.timeout=30
-Dspark.akka.frameSize=1 -Dspark.blockManagerHeartBeatMs=80
Hi Roy,
Thanks for your help, I write a small code snippet that could reproduce the
problem.
Could you help me read through it and see if I did anything wrong?
Thanks!
def main(args: Array[String]) {
val conf = new SparkConf().setAppName(“TEST)
.setMaster(local[4])
59 matches
Mail list logo