I'm just getting started with Spark SQL and DataFrames in 1.3.0.
I notice that the Spark API shows a different syntax for referencing
columns in a dataframe than the Spark SQL Programming Guide.
For instance, the API docs for the select method show this:
df.select($colA, $colB)
Whereas the
I'm hoping someone can clear up some confusion for me.
When I view the Spark 1.0 docs online (http://spark.apache.org/docs/1.0.0/)
they are different than the docs which are packaged with the Spark 1.0.0
download (spark-1.0.0.tgz).
In particular, in the online docs, there's a single merged Spark
for each element of your RDD?
Nick
On Tue, May 6, 2014 at 3:31 PM, Diana Carroll dcarr...@cloudera.comwrote:
What should I do if I want to log something as part of a task?
This is what I tried. To set up a logger, I followed the advice here:
http://py4j.sourceforge.net/faq.html#how
Hey all, trying to set up a pretty simple streaming app and getting some
weird behavior.
First, a non-streaming job that works fine: I'm trying to pull out lines
of a log file that match a regex, for which I've set up a function:
def getRequestDoc(s: String):
String = {
What should I do if I want to log something as part of a task?
This is what I tried. To set up a logger, I followed the advice here:
http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off
logger = logging.getLogger(py4j)
logger.setLevel(logging.INFO)
recollection of other discussions on this topic on the list. However, going
back and looking at the programming guide, this is not the way the
cache/persist behavior is described. Does the guide need to be updated?
On Fri, May 2, 2014 at 9:04 AM, Diana Carroll dcarr...@cloudera.comwrote:
I'm just Posty
Anyone have any guidance on using a broadcast variable to ship data to
workers vs. an RDD?
Like, say I'm joining web logs in an RDD with user account data. I could
keep the account data in an RDD or if it's small, a broadcast variable
instead. How small is small? Small enough that I know it
I'm guessing your shell stopping when it attempts to connect to the RM is
not related to that warning. You'll get that message out of the box from
Spark if you don't have HADOOP_HOME set correctly. I'm using CDH 5.0
installed in default locations, and got rid of the warning by setting
Hi everyone. I'm trying to run some of the Spark example code, and most of
it appears to be undocumented (unless I'm missing something). Can someone
help me out?
I'm particularly interested in running SparkALS, which wants parameters:
M U F iter slices
What are these variables? They appear to
the bottleneck step...but making double to float is one way to scale
it even further...
Thanks.
Deb
On Mon, Apr 28, 2014 at 10:30 AM, Diana Carroll dcarr...@cloudera.comwrote:
Hi everyone. I'm trying to run some of the Spark example code, and most
of it appears to be undocumented (unless I'm
...
On Mon, Apr 28, 2014 at 10:47 AM, Diana Carroll dcarr...@cloudera.comwrote:
Thanks, Deb. But I'm looking at org.apache.spark.examples.SparkALS,
which is not in the mllib examples, and does not take any file parameters.
I don't see the class you refer to in the examples ...however, if I did
want
I'm trying to understand when I would want to checkpoint an RDD rather than
just persist to disk.
Every reference I can find to checkpoint related to Spark Streaming. But
the method is defined in the core Spark library, not Streaming.
Does it exist solely for streaming, or are there
:
Checkpoint clears dependencies. You might need checkpoint to cut a
long lineage in iterative algorithms. -Xiangrui
On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com
wrote:
I'm trying to understand when I would want to checkpoint an RDD rather
than
just persist to disk.
Every
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
Given the size, and that it is a single file, I assumed it would only be in
a single partition. But when I cache it, I can see in the Spark App UI
that it actually splits it into two partitions:
[image: Inline image 1]
Is this
I'm looking at the Tuning Guide suggestion to use Kryo instead of default
serialization. My questions:
Does pyspark use Java serialization by default, as Scala spark does? If
so, then...
can I use Kryo with pyspark instead? The instructions say I should
register my classes with the Kryo
Not sure what data you are sending in. You could try calling
lines.print() instead which should just output everything that comes in
on the stream. Just to test that your socket is receiving what you think
you are sending.
On Mon, Mar 31, 2014 at 12:18 PM, eric perler
If you are learning about Spark Streaming, as I am, you've probably use
netcat nc as mentioned in the spark streaming programming guide. I
wanted something a little more useful, so I modified the
ClickStreamGenerator code to make a very simple script that simply reads a
file off disk and passes
The API docs for ssc.awaitTermination say simply Wait for the execution to
stop. Any exceptions that occurs during the execution will be thrown in
this thread.
Can someone help me understand what this means? What causes execution to
stop? Why do we need to wait for that to happen?
I tried
I'm working with spark streaming using spark-shell, and hoping folks could
answer a few questions I have.
I'm doing WordCount on a socket stream:
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.Seconds
var
I'm trying to understand Spark streaming, hoping someone can help.
I've kinda-sorta got a version of Word Count running, and it looks like
this:
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
object StreamingWordCount {
def
Thanks, Tagatha, very helpful. A follow-up question below...
On Wed, Mar 26, 2014 at 2:27 PM, Tathagata Das
tathagata.das1...@gmail.comwrote:
*Answer 3:*You can do something like
wordCounts.foreachRDD((rdd: RDD[...], time: Time) = {
if (rdd.take(1).size == 1) {
// There exists
Has anyone successfully followed the instructions on the Quick Start page
of the Spark home page to run a standalone Scala application? I can't,
and I figure I must be missing something obvious!
I'm trying to follow the instructions here as close to word for word as
possible:
shows). But for that
part of the guide, it's not any different than building a scala app.
On Mon, Mar 24, 2014 at 3:44 PM, Diana Carroll dcarr...@cloudera.com
wrote:
Has anyone successfully followed the instructions on the Quick Start
page of
the Spark home page to run a standalone Scala
://www.scala-sbt.org/
On Mon, Mar 24, 2014 at 4:00 PM, Nan Zhu zhunanmcg...@gmail.com wrote:
Hi, Diana,
See my inlined answer
--
Nan Zhu
On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote:
Has anyone successfully followed the instructions on the Quick Start page
of the Spark
with the sbt project subrirectory but you can
read independently about sbt and what it can do, above is the bare minimum.
Let me know if that helped.
Ognen
On 3/24/14, 2:44 PM, Diana Carroll wrote:
Has anyone successfully followed the instructions on the Quick Start page
of the Spark home
it to your ~/.bashrc or ~/.bash_profile
or ??? depending on the system you are using. If you are on Windows, sorry,
I can't offer any help there ;)
Ognen
On 3/24/14, 3:16 PM, Diana Carroll wrote:
Thanks Ongen.
Unfortunately I'm not able to follow your instructions either. In
particular
, 2014 at 4:30 PM, Diana Carroll dcarr...@cloudera.com
wrote:
Yeah, that's exactly what I did. Unfortunately it doesn't work:
$SPARK_HOME/sbt/sbt package
awk: cmd. line:1: fatal: cannot open file `./project/build.properties'
for
reading (No such file or directory)
Attempting to fetch sbt
Ongen:
I don't know why your process is hanging, sorry. But I do know that the
way saveAsTextFile works is that you give it a path to a directory, not a
file. The file is saved in multiple parts, corresponding to the
partitions. (part-0, part-1 etc.)
(Presumably it does this because it
install.
On Monday, March 24, 2014, Nan Zhu zhunanmcg...@gmail.com wrote:
I found that I never read the document carefully and I never find that
Spark document is suggesting you to use Spark-distributed sbt..
Best,
--
Nan Zhu
On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote
18, 2014, at 7:49 AM, Diana Carroll dcarr...@cloudera.com wrote:
Well, if anyone is still following this, I've gotten the following code
working which in theory should allow me to parse whole XML files: (the
problem was that I can't return the tree iterator directly. I have to call
iter(). Why
, at 7:49 AM, Diana Carroll dcarr...@cloudera.com wrote:
Well, if anyone is still following this, I've gotten the following code
working which in theory should allow me to parse whole XML files: (the
problem was that I can't return the tree iterator directly. I have to call
iter(). Why?)
import
of the string s. How big
can my file be before converting it to a string becomes problematic?
On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll dcarr...@cloudera.comwrote:
Thanks, Matei.
In the context of this discussion, it would seem mapParitions is
essential, because it's the only way I'm
partition
instead of an array. You can then return an iterator or list of objects to
produce from that.
Matei
On Mar 17, 2014, at 10:46 AM, Diana Carroll dcarr...@cloudera.com wrote:
Thanks Matei. That makes sense. I have here a dataset of many many
smallish XML files, so using
33 matches
Mail list logo