a grep for SPARK_MASTER_IP shows that sbin/start-master.sh and
sbin/start-slaves.sh are the only ones that use it.
yet for example in CDH5 the spark-master is started from
/etc/init.d/spark-master by running bin/spark-class. does that means
SPARK_MASTER_IP is simply ignored? it looks like that to
there's no need to initialize StateDStream. Take a look at example
StatefulNetworkWordCount.scala, it's part of spark source code.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-initialize-StateDStream-tp14113p14146.html
Sent from the Apache Spark
Is it always true that whenever we apply operations on an RDD, we get
another RDD?
Or does it depend on the return type of the operation?
On Sat, Sep 13, 2014 at 9:45 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
An RDD is a fault-tolerant distributed structure. It is the primary
Shixiong,
These two snippets behave different in Scala.
In the second snippet, you define variable named m and does evaluate the
right hand size as part of the definition.
In other words, the variable was replaced by the pre-computed value of
Array(1.0) in the subsequently code.
So in the second
This is all covered in
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
By definition, RDD transformations take an RDD to another RDD; actions
produce some other type as a value on the driver program.
On Fri, Sep 12, 2014 at 11:15 PM, Deep Pradhan
You can cache data in memory query it using Spark Job Server.
Most folks dump data down to a queue/db for retrieval
You can batch up data store into parquet partitions as well. query it using
another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe.
--
Regards,
Mayur
Take for example this:
*val lines = sc.textFile(args(0))*
*val nodes = lines.map(s ={ *
*val fields = s.split(\\s+)*
*(fields(0),fields(1))*
*}).distinct().groupByKey().cache() *
*val nodeSizeTuple = nodes.map(node = (node._1.toInt, node._2.size))*
*val rootNode =
Thanks Xiangrui. This file already exists w/o escapes. I could probably try
to preprocess it and add the escaping.
On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng men...@gmail.com wrote:
I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option:
Again, RDD operations are of two basic varieties: transformations, that
produce further RDDs; and operations, that return values to the driver
program. You've used several RDD transformations and then finally the
top(1) action, which returns an array of one element to your driver
program. That
I also found
https://github.com/apache/spark/commit/8f6e2e9df41e7de22b1d1cbd524e20881f861dd0
had resolve this issue but it seems that right code snippet not occurs in
master or 1.1 release.
2014-09-13 17:12 GMT+08:00 Yanbo Liang yanboha...@gmail.com:
Hi All,
I found that
Hi All,
I found that LogisticRegressionWithLBFGS interface is not consistent
with LogisticRegressionWithSGD in master and 1.1 release.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L199
In the above code snippet,
Hi Yanbo,
We made the change here
https://github.com/apache/spark/commit/5d25c0b74f6397d78164b96afb8b8cbb1b15cfbd
Those apis to set the parameters are very difficult to maintain, so we
decide not to provide them. In next release, Spark 1.2, we will have a
better api design for parameter setting.
Hi,
We all know that RDDs are immutable.
There are not enough operations that can achieve anything and everything on
RDDs.
Take for example this:
I want an Array of Bytes filled with zeros which during the program should
change. Some elements of that Array should change to 1.
If I make an RDD with
Hello,
I am facing performance issues with reduceByKey. In know that this topic
has already been covered but I did not really find answers to my question.
I am using reduceByKey to remove entries with identical keys, using, as
reduce function, (a,b) = a. It seems to be a relatively
however, the cache is not guaranteed to remain, if other jobs are launched
in the cluster and require more memory than what's left in the overall
caching memory, previous RDDs will be discarded.
Using an off heap cache like tachyon as a dump repo can help.
In general, I'd say that using a
If you are just looking for distinct keys, .keys.distinct() should be
much better.
On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme julien.ca...@gmail.com wrote:
Hello,
I am facing performance issues with reduceByKey. In know that this topic has
already been covered but I did not really find
Hi,
Jerry said I'm guessing, so maybe the thing to try is to check if his
guess is correct.
What about running sudo lsof | grep metrics.properties ? I imagine you
should be able to see it if the file was found and read. If Jerry is
right, then I think you will NOT see it.
Next, how about
I need to remove objects with duplicate key, but I need the whole object.
Object which have the same key are not necessarily equal, though (but I can
dump any on the ones that have identical key).
2014-09-13 12:50 GMT+02:00 Sean Owen so...@cloudera.com:
If you are just looking for distinct
Your app is the running Spark Streaming system. It would be up to you
to build some mechanism that lets you cause it to call stop() in
response to some signal from you.
On Fri, Sep 12, 2014 at 3:59 PM, stanley wangshua...@yahoo.com wrote:
In spark streaming programming document
No, your error is right there in the logs. Unset SPARK_CLASSPATH.
On Fri, Sep 12, 2014 at 10:20 PM, freedafeng freedaf...@yahoo.com wrote:
: org.apache.spark.SparkException: Found both spark.driver.extraClassPath
and SPARK_CLASSPATH. Use only the former.
Hi,
I took am having problem with compiling Spark from source. However, my
problem is different. I downloaded latest version (1.1.0) and ran ./sbt/sbt
assembly from the command line. I end up with the following error
[info] SHA-1: 20abd673d1e0690a6d5b64951868eef8d332d084
[info] Packaging
I had looked at that.
If I have a set of saved word counts from previous run, and want to load
that in the next run, what is the best way to do it?
I am thinking of hacking the Spark code and have an initial rdd in
StateDStream,
and use that in for the first time.
On Fri, Sep 12, 2014 at 11:04
This is more concise:
x.groupBy(obj.fieldtobekey).values.map(_.head)
... but I doubt it's faster.
If all objects with the same fieldtobekey are within the same
partition, then yes I imagine your biggest speedup comes from
exploiting that. How about ...
x.mapPartitions(_.map(obj =
Have you tried using RDD.map() to transform some of the RDD elements from 0
to 1? Why doesn’t that work? That’s how you change data in Spark, by
defining a new RDD that’s a transformation of an old one.
On Sat, Sep 13, 2014 at 5:39 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
We all
bq. [error] (repl/compile:compile) Compilation failed
Can you pastebin more of the output ?
Cheers
Upgraded to 1.1 and the issue is resolved. Thanks.
I still wonder if there is a better way to approach a large attribute
dataset.
On Fri, Sep 12, 2014 at 12:20 PM, Prashant Sharma scrapco...@gmail.com
wrote:
What is your spark version ? This was fixed I suppose. Can you try it
with latest
OK, mapPartition seems to be the way to go. Thanks for the help!
Le 13 sept. 2014 16:41, Sean Owen so...@cloudera.com a écrit :
This is more concise:
x.groupBy(obj.fieldtobekey).values.map(_.head)
... but I doubt it's faster.
If all objects with the same fieldtobekey are within the same
Howdy doody Spark Users,
I’d like to somehow write out a single RDD to multiple paths in one go.
Here’s an example.
I have an RDD of (key, value) pairs like this:
a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda
x: x[0]) a.collect()
[('N', 'Nick'), ('N', 'Nancy'),
Hi Ted,
Thanks for the prompt reply :)
please find details of the issue at this url http://pastebin.com/Xt0hZ38q
http://pastebin.com/Xt0hZ38q
Kind Regards
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14175.html
I'm not sure what you mean by previous run. Is it previous batch? or
previous run of spark-submit?
If it's previous batch (spark streaming creates a batch every batch
interval), then there's nothing to do.
If it's previous run of spark-submit (assuming you are able to save the
result somewhere),
bq. [error] File name too long
It is not clear which file(s) loadfiles was loading.
Is the filename in earlier part of the output ?
Cheers
On Sat, Sep 13, 2014 at 10:58 AM, kkptninja kkptni...@gmail.com wrote:
Hi Ted,
Thanks for the prompt reply :)
please find details of the issue at this
Hi All:
We know some memory of spark are used for computing (e.g.,
spark.shuffle.memoryFraction) and some are used for caching RDD for future
use (e.g., spark.storage.memoryFraction).
Is there any existing workload which can utilize both of them during the
running left cycle? I want to do some
Can you try sbt/sbt clean first?
On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu yuzhih...@gmail.com wrote:
bq. [error] File name too long
It is not clear which file(s) loadfiles was loading.
Is the filename in earlier part of the output ?
Cheers
On Sat, Sep 13, 2014 at 10:58 AM, kkptninja
Thanks for the pointers. I meant previous run of spark-submit.
For 1: This would be a bit more computation in every batch.
2: Its a good idea, but it may be inefficient to retrieve each value.
In general, for a generic state machine the initialization and input
sequence is critical for
Hi Koert,
Thanks for reporting this. These tests have been flaky even on the master
branch for a long time. You can safely disregard these test failures, as
the root cause is port collisions from the many SparkContexts we create
over the course of the entire test. There is a patch that fixes this
val file =
sc.textFile(hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt)
1. val xyz = file.map(line = extractCurRate(sqlContext.sql(select rate
from CurrencyCodeRates where txCurCode = ' + line.substring(202,205) + '
and fxCurCode = ' +
36 matches
Mail list logo