Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Diana Carroll
I'm just getting started with Spark SQL and DataFrames in 1.3.0. I notice that the Spark API shows a different syntax for referencing columns in a dataframe than the Spark SQL Programming Guide. For instance, the API docs for the select method show this: df.select($colA, $colB) Whereas the

Spark 1.0 docs out of sync?

2014-06-30 Thread Diana Carroll
I'm hoping someone can clear up some confusion for me. When I view the Spark 1.0 docs online (http://spark.apache.org/docs/1.0.0/) they are different than the docs which are packaged with the Spark 1.0.0 download (spark-1.0.0.tgz). In particular, in the online docs, there's a single merged Spark

Re: logging in pyspark

2014-05-14 Thread Diana Carroll
for each element of your RDD? Nick On Tue, May 6, 2014 at 3:31 PM, Diana Carroll dcarr...@cloudera.comwrote: What should I do if I want to log something as part of a task? This is what I tried. To set up a logger, I followed the advice here: http://py4j.sourceforge.net/faq.html#how

NotSerializableException in Spark Streaming

2014-05-14 Thread Diana Carroll
Hey all, trying to set up a pretty simple streaming app and getting some weird behavior. First, a non-streaming job that works fine: I'm trying to pull out lines of a log file that match a regex, for which I've set up a function: def getRequestDoc(s: String): String = {

logging in pyspark

2014-05-06 Thread Diana Carroll
What should I do if I want to log something as part of a task? This is what I tried. To set up a logger, I followed the advice here: http://py4j.sourceforge.net/faq.html#how-to-turn-logging-on-off logger = logging.getLogger(py4j) logger.setLevel(logging.INFO)

Re: performance improvement on second operation...without caching?

2014-05-05 Thread Diana Carroll
recollection of other discussions on this topic on the list. However, going back and looking at the programming guide, this is not the way the cache/persist behavior is described. Does the guide need to be updated? On Fri, May 2, 2014 at 9:04 AM, Diana Carroll dcarr...@cloudera.comwrote: I'm just Posty

when to use broadcast variables

2014-05-02 Thread Diana Carroll
Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD? Like, say I'm joining web logs in an RDD with user account data. I could keep the account data in an RDD or if it's small, a broadcast variable instead. How small is small? Small enough that I know it

Re: the spark configuage

2014-04-30 Thread Diana Carroll
I'm guessing your shell stopping when it attempts to connect to the RM is not related to that warning. You'll get that message out of the box from Spark if you don't have HADOOP_HOME set correctly. I'm using CDH 5.0 installed in default locations, and got rid of the warning by setting

running SparkALS

2014-04-28 Thread Diana Carroll
Hi everyone. I'm trying to run some of the Spark example code, and most of it appears to be undocumented (unless I'm missing something). Can someone help me out? I'm particularly interested in running SparkALS, which wants parameters: M U F iter slices What are these variables? They appear to

Re: running SparkALS

2014-04-28 Thread Diana Carroll
the bottleneck step...but making double to float is one way to scale it even further... Thanks. Deb On Mon, Apr 28, 2014 at 10:30 AM, Diana Carroll dcarr...@cloudera.comwrote: Hi everyone. I'm trying to run some of the Spark example code, and most of it appears to be undocumented (unless I'm

Re: running SparkALS

2014-04-28 Thread Diana Carroll
... On Mon, Apr 28, 2014 at 10:47 AM, Diana Carroll dcarr...@cloudera.comwrote: Thanks, Deb. But I'm looking at org.apache.spark.examples.SparkALS, which is not in the mllib examples, and does not take any file parameters. I don't see the class you refer to in the examples ...however, if I did want

checkpointing without streaming?

2014-04-21 Thread Diana Carroll
I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every reference I can find to checkpoint related to Spark Streaming. But the method is defined in the core Spark library, not Streaming. Does it exist solely for streaming, or are there

Re: checkpointing without streaming?

2014-04-21 Thread Diana Carroll
: Checkpoint clears dependencies. You might need checkpoint to cut a long lineage in iterative algorithms. -Xiangrui On Mon, Apr 21, 2014 at 11:34 AM, Diana Carroll dcarr...@cloudera.com wrote: I'm trying to understand when I would want to checkpoint an RDD rather than just persist to disk. Every

partitioning of small data sets

2014-04-15 Thread Diana Carroll
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: [image: Inline image 1] Is this

using Kryo with pyspark?

2014-04-14 Thread Diana Carroll
I'm looking at the Tuning Guide suggestion to use Kryo instead of default serialization. My questions: Does pyspark use Java serialization by default, as Scala spark does? If so, then... can I use Kryo with pyspark instead? The instructions say I should register my classes with the Kryo

Re: network wordcount example

2014-03-31 Thread Diana Carroll
Not sure what data you are sending in. You could try calling lines.print() instead which should just output everything that comes in on the stream. Just to test that your socket is receiving what you think you are sending. On Mon, Mar 31, 2014 at 12:18 PM, eric perler

streaming: code to simulate a network socket data source

2014-03-28 Thread Diana Carroll
If you are learning about Spark Streaming, as I am, you've probably use netcat nc as mentioned in the spark streaming programming guide. I wanted something a little more useful, so I modified the ClickStreamGenerator code to make a very simple script that simply reads a file off disk and passes

spark streaming: what is awaitTermination()?

2014-03-27 Thread Diana Carroll
The API docs for ssc.awaitTermination say simply Wait for the execution to stop. Any exceptions that occurs during the execution will be thrown in this thread. Can someone help me understand what this means? What causes execution to stop? Why do we need to wait for that to happen? I tried

spark streaming and the spark shell

2014-03-27 Thread Diana Carroll
I'm working with spark streaming using spark-shell, and hoping folks could answer a few questions I have. I'm doing WordCount on a socket stream: import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.Seconds var

streaming questions

2014-03-26 Thread Diana Carroll
I'm trying to understand Spark streaming, hoping someone can help. I've kinda-sorta got a version of Word Count running, and it looks like this: import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ object StreamingWordCount { def

Re: streaming questions

2014-03-26 Thread Diana Carroll
Thanks, Tagatha, very helpful. A follow-up question below... On Wed, Mar 26, 2014 at 2:27 PM, Tathagata Das tathagata.das1...@gmail.comwrote: *Answer 3:*You can do something like wordCounts.foreachRDD((rdd: RDD[...], time: Time) = { if (rdd.take(1).size == 1) { // There exists

quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a standalone Scala application? I can't, and I figure I must be missing something obvious! I'm trying to follow the instructions here as close to word for word as possible:

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
shows). But for that part of the guide, it's not any different than building a scala app. On Mon, Mar 24, 2014 at 3:44 PM, Diana Carroll dcarr...@cloudera.com wrote: Has anyone successfully followed the instructions on the Quick Start page of the Spark home page to run a standalone Scala

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
://www.scala-sbt.org/ On Mon, Mar 24, 2014 at 4:00 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Diana, See my inlined answer -- Nan Zhu On Monday, March 24, 2014 at 3:44 PM, Diana Carroll wrote: Has anyone successfully followed the instructions on the Quick Start page of the Spark

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
with the sbt project subrirectory but you can read independently about sbt and what it can do, above is the bare minimum. Let me know if that helped. Ognen On 3/24/14, 2:44 PM, Diana Carroll wrote: Has anyone successfully followed the instructions on the Quick Start page of the Spark home

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
it to your ~/.bashrc or ~/.bash_profile or ??? depending on the system you are using. If you are on Windows, sorry, I can't offer any help there ;) Ognen On 3/24/14, 3:16 PM, Diana Carroll wrote: Thanks Ongen. Unfortunately I'm not able to follow your instructions either. In particular

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
, 2014 at 4:30 PM, Diana Carroll dcarr...@cloudera.com wrote: Yeah, that's exactly what I did. Unfortunately it doesn't work: $SPARK_HOME/sbt/sbt package awk: cmd. line:1: fatal: cannot open file `./project/build.properties' for reading (No such file or directory) Attempting to fetch sbt

Re: Writing RDDs to HDFS

2014-03-24 Thread Diana Carroll
Ongen: I don't know why your process is hanging, sorry. But I do know that the way saveAsTextFile works is that you give it a path to a directory, not a file. The file is saved in multiple parts, corresponding to the partitions. (part-0, part-1 etc.) (Presumably it does this because it

Re: quick start guide: building a standalone scala program

2014-03-24 Thread Diana Carroll
install. On Monday, March 24, 2014, Nan Zhu zhunanmcg...@gmail.com wrote: I found that I never read the document carefully and I never find that Spark document is suggesting you to use Spark-distributed sbt.. Best, -- Nan Zhu On Monday, March 24, 2014 at 5:47 PM, Diana Carroll wrote

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
18, 2014, at 7:49 AM, Diana Carroll dcarr...@cloudera.com wrote: Well, if anyone is still following this, I've gotten the following code working which in theory should allow me to parse whole XML files: (the problem was that I can't return the tree iterator directly. I have to call iter(). Why

Re: example of non-line oriented input data?

2014-03-19 Thread Diana Carroll
, at 7:49 AM, Diana Carroll dcarr...@cloudera.com wrote: Well, if anyone is still following this, I've gotten the following code working which in theory should allow me to parse whole XML files: (the problem was that I can't return the tree iterator directly. I have to call iter(). Why?) import

Re: example of non-line oriented input data?

2014-03-18 Thread Diana Carroll
of the string s. How big can my file be before converting it to a string becomes problematic? On Tue, Mar 18, 2014 at 9:41 AM, Diana Carroll dcarr...@cloudera.comwrote: Thanks, Matei. In the context of this discussion, it would seem mapParitions is essential, because it's the only way I'm

Re: example of non-line oriented input data?

2014-03-17 Thread Diana Carroll
partition instead of an array. You can then return an iterator or list of objects to produce from that. Matei On Mar 17, 2014, at 10:46 AM, Diana Carroll dcarr...@cloudera.com wrote: Thanks Matei. That makes sense. I have here a dataset of many many smallish XML files, so using