Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
intrinsic reasons for this to be impossible? Sorry again for the giant mail, and thanks for any insights! Andras -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com

Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
comment? Sent from Windows Mail *From:* Dean Wampler deanwamp...@gmail.com *Sent:* Thursday, April 10, 2014 7:39 AM *To:* Spark Users user@spark.apache.org *Cc:* Daniel Darabos daniel.dara...@lynxanalytics.com, Andras Barjakandras.bar...@lynxanalytics.com Spark has been endorsed by Cloudera

Re: Using Spark for Divide-and-Conquer Algorithms

2014-04-11 Thread Dean Wampler
and Distributed Systems Shanghai Jiao Tong University Email: yanzhe...@gmail.com Sent with Sparrow http://www.sparrowmailapp.com/?sig -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com

Re: Hybrid GPU CPU computation

2014-04-11 Thread Dean Wampler
, 2014 at 2:38 PM, Jaonary Rabarisoa jaon...@gmail.comwrote: Hi all, I'm just wondering if hybrid GPU/CPU computation is something that is feasible with spark ? And what should be the best way to do it. Cheers, Jaonary -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http

Re: K-means with large K

2014-04-28 Thread Dean Wampler
of this? Thanks, Dave -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com

My talk on Spark: The Next Top (Compute) Model

2014-04-30 Thread Dean Wampler
I meant to post this last week, but this is a talk I gave at the Philly ETE conf. last week: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model Also here: http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf dean -- Dean Wampler, Ph.D. Typesafe

Re: My talk on Spark: The Next Top (Compute) Model

2014-05-01 Thread Dean Wampler
/GCS). Why configure Hadoop if you don't have to. On Thu, May 1, 2014 at 12:25 AM, Dean Wampler deanwamp...@gmail.comwrote: I meant to post this last week, but this is a talk I gave at the Philly ETE conf. last week: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model

Re: My talk on Spark: The Next Top (Compute) Model

2014-05-01 Thread Dean Wampler
have to. On Thu, May 1, 2014 at 12:25 AM, Dean Wampler deanwamp...@gmail.comwrote: I meant to post this last week, but this is a talk I gave at the Philly ETE conf. last week: http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model Also here: http

Re: Spark Training

2014-05-01 Thread Dean Wampler
in context: Spark Traininghttp://apache-spark-user-list.1001560.n3.nabble.com/Spark-Training-tp5166.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/at Nabble.com. -- Dean Wampler, Ph.D. Typesafe @deanwampler http

Re: Announcing Spark 1.0.0

2014-05-30 Thread Dean Wampler
scratch the surface - check out the release notes here: http://spark.apache.org/releases/spark-release-1-0-0.html Note that since release artifacts were posted recently, certain mirrors may not have working downloads for a few hours. - Patrick -- Dean Wampler, Ph.D. Typesafe @deanwampler

Re: Spark vs Google cloud dataflow

2014-06-27 Thread Dean Wampler
in the Hadoop ecosystem. I think Dataflows is more than that but yeah that seems to be some of the 'language'. It is similar in that it is a distributed collection abstraction. -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com

Re: Recommended pipeline automation tool? Oozie?

2014-07-15 Thread Dean Wampler
: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com

Re: Issue with Spark on EC2 using spark-ec2 script

2014-08-01 Thread Dean Wampler
It looked like you were running in standalone mode (master set to local[4]). That's how I ran it. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Can you post your whole SBT build file(s)? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Sep 10, 2014 at 6:48

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Sorry, I meant any *other* SBT files. However, what happens if you remove the line: exclude(org.eclipse.jetty.orbit, javax.servlet) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

Re: SparkSQL Thriftserver in Mesos

2014-09-22 Thread Dean Wampler
. https://spark.apache.org/docs/latest/running-on-mesos.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Sep 22

Re: scala Vector vs mllib Vector

2014-10-04 Thread Dean Wampler
(the unchanged parts) to make efficient copies. Also, Scala Vector isn't designed to represent sparse vectors. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com

Re: [SQL] Self join with ArrayType columns problems

2015-01-26 Thread Dean Wampler
You are creating a HiveContext, then using the sql method instead of hql. Is that deliberate? The code doesn't work if you replace HiveContext with SQLContext. Lots of exceptions are thrown, but I don't have time to investigate now. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd

Re: Spark Project Fails to run multicore in local mode.

2015-01-08 Thread Dean Wampler
archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: Error while installing Spark 1.3.0 on local machine

2015-03-22 Thread Dean Wampler
Any particular reason you're not just downloading a build from http://spark.apache.org/downloads.html Even if you aren't using Hadoop, any of those builds will work. If you want to build from source, the Maven build is more reliable. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd

Re: can distinct transform applied on DStream?

2015-03-22 Thread Dean Wampler
aDstream.transform(_.distinct()) will only make the elements of each RDD in the DStream distinct, not for the whole DStream globally. Is that what you're seeing? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http

Re: How Does aggregate work

2015-03-22 Thread Dean Wampler
+ ... (2 + (2 + (2 + 0 + p_1) + p_2) + p_3) ...) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Sun, Mar 22, 2015

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread Dean Wampler
Both spark-submit and spark-shell have a --jars option for passing additional jars to the cluster. They will be added to the appropriate classpaths. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
the first element (being careful that the partition isn't empty!) and then determine which of those first lines has the header info. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
For the Spark SQL parts, 1.3 breaks backwards compatibility, because before 1.3, Spark SQL was considered experimental where API changes were allowed. So, H2O and ADA compatible with 1.2.X might not work with 1.3. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
, but that shouldn't be this issue. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25, 2015 at 12:09 PM, roni

Re: newbie quesiton - spark with mesos

2015-03-25 Thread Dean Wampler
:51849/), Path(/user/MapOutputTracker)] It's trying to connect to an Akka actor on itself, using the loopback address. Try changing SPARK_LOCAL_IP to the publicly routable IP address. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do

Re: Date and decimal datatype not working

2015-03-25 Thread Dean Wampler
this is such a common problem, I usually define a parse method that converts input text to the desired schema. It catches parse exceptions like this and reports the bad line at least. If you can return a default long in this case, say 0, that makes it easier to return something. dean Dean Wampler, Ph.D. Author

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
Yes, that's the problem. The RDD class exists in both binary jar files, but the signatures probably don't match. The bottom line, as always for tools like this, is that you can't mix versions. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073

Re: Saving Dstream into a single file

2015-03-23 Thread Dean Wampler
You can use the coalesce method to reduce the number of partitions. You can reduce to one if the data is not too big. Then write the output. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread Dean Wampler
In 1.2 it's a member of SchemaRDD and it becomes available on RDD (through the type class mechanism) when you add a SQLContext, like so. val sqlContext = new SQLContext(sc)import sqlContext._ In 1.3, the method has moved to the new DataFrame type. Dean Wampler, Ph.D. Author: Programming Scala

Re: Optimal solution for getting the header from CSV with Spark

2015-03-24 Thread Dean Wampler
needed to satisfy the limit. In this case, it will trivially stop at the first. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: upgrade from spark 1.2.1 to 1.3 on EC2 cluster and problems

2015-03-25 Thread Dean Wampler
them as the same user. Or look at what the EC2 scripts do. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Mar 25

Re: newbie quesiton - spark with mesos

2015-03-23 Thread Dean Wampler
. Actually only one would be enough, but the default number of partitions will be used. I believe 8 is the default for Mesos. For local mode (local[*]), it's the number of cores. You can also set the propoerty spark.default.parallelism. HTH, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd

Re: Getting around Serializability issues for types not in my control

2015-03-23 Thread Dean Wampler
closures passed to Spark methods, but that's probably not what you want. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: Converting SparkSQL query to Scala query

2015-03-23 Thread Dean Wampler
SQL keyword. HTH, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Mar 23, 2015 at 11:42 AM, nishitd nishitde

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Dean Wampler
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection in the Spark SQL library. Make sure it's in the classpath and the version is correct, too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly

Re: How to learn Spark ?

2015-04-02 Thread Dean Wampler
I have a self-study workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: ArrayBuffer within a DataFrame

2015-04-03 Thread Dean Wampler
to the left of the = pattern matches on the input tuples. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Thu, Apr 2, 2015 at 10

Re: Need some guidance

2015-04-13 Thread Dean Wampler
That appears to work, with a few changes to get the types correct: input.distinct().combineByKey((s: String) = 1, (agg: Int, s: String) = agg + 1, (agg1: Int, agg2: Int) = agg1 + agg2) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073

Re: Need some guidance

2015-04-13 Thread Dean Wampler
of the second (each CompactBuffer). An alternative pattern match syntax would be. scala val i2 = i1.map { case (key, buffer) = (key, buffer.size) } This should work as long as none of the CompactBuffers are too large, which could happen for extremely large data sets. dean Dean Wampler, Ph.D. Author

Re: Spark Scala Version?

2015-04-21 Thread Dean Wampler
Without the rest of your code it's hard to make sense of errors. Why do you need to use reflection? ​Make sure you use the same Scala versions throughout and 2.10.4 is recommended. That's still the official version for Spark, even though provisional​ support for 2.11 exists. Dean Wampler, Ph.D

Re: Instantiating/starting Spark jobs programmatically

2015-04-23 Thread Dean Wampler
and interoperating with external processes. Perhaps Java has something similar these days? dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Dean Wampler
set, if necessary. HOWEVER, it actually returns a CompactBuffer. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L444 Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly

Re: Spark Cluster Setup

2015-04-24 Thread Dean Wampler
It's mostly manual. You could try automating with something like Chef, of course, but there's nothing already available in terms of automation. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

Re: Spark Cluster Setup

2015-04-24 Thread Dean Wampler
The convention for standalone cluster is to use Zookeeper to manage master failover. http://spark.apache.org/docs/latest/spark-standalone.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Apr 29, 2015 at 6:19 AM, Anshul Singhle ans...@betaglide.com wrote

Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Apr 29, 2015 at 6:25 AM, selim namsi selim.na...@gmail.com wrote

Re: A problem of using spark streaming to capture network packets

2015-04-29 Thread Dean Wampler
I would use the ps command on each machine while the job is running to confirm that every process involved is running as root. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Dean Wampler
Are the tasks on the slaves also running as root? If not, that might explain the problem. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: A problem of using spark streaming to capture network packets

2015-04-28 Thread Dean Wampler
= Pcaps.findAllDevs(); dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Apr 27, 2015 at 4:03 AM, Hai Shan Wu wuh

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
, i do not see it. On Sun, May 3, 2015 at 9:15 PM, Dean Wampler deanwamp...@gmail.com wrote: IMHO, you are trying waaay to hard to optimize work on what is really a small data set. 25G, even 250G, is not that much data, especially if you've spent a month trying to get something to work

Re: Questions about Accumulators

2015-05-03 Thread Dean Wampler
last week I wrote one that used a hash map to track the latest timestamps seen for specific keys. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning and this talk by Michael Armbrust for example, http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf. dean Dean Wampler, Ph.D. Author: Programming

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
all the optimizations: Kryo, partitionBy, etc. Just use the simplest code you can. Make it work first. Then, if it really isn't fast enough, look for actual evidence of bottlenecks and optimize those. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product

Re: Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread Dean Wampler
Note that each JSON object has to be on a single line in the files. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com

Re: value toDF is not a member of RDD object

2015-05-12 Thread Dean Wampler
It's the import statement Olivier showed that makes the method available. Note that you can also use `sc.createDataFrame(myRDD)`, without the need for the import statement. I personally prefer this approach. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com

Re: Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Dean Wampler
) 21).show() I tested and both the $column and df(column) syntax works, but I'm wondering which is *preferred*. Is one the original and one a new feature we should be using? Thanks, Diana (Spark Curriculum Developer for Cloudera) -- Dean Wampler, Ph.D. Author: Programming Scala, 2nd

Re: Spark on Windows

2015-04-16 Thread Dean Wampler
If you're running Hadoop, too, now that Hortonworks supports Spark, you might be able to use their distribution. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com

Re: task not serialize

2015-04-06 Thread Dean Wampler
. Whatever you can do to make this work like table scans and joins will probably be most efficient. dean On 7 April 2015 at 03:33, Dean Wampler deanwamp...@gmail.com wrote: The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method

Re: task not serialize

2015-04-06 Thread Dean Wampler
connection, same problem. You can't suppress the warning because it's actually an error. The VoidFunction can't be serialized to send it over the cluster's network. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http

Re: conversion from java collection type to scala JavaRDDObject

2015-04-05 Thread Dean Wampler
The runtime attempts to serialize everything required by records, and also any lambdas/closures you use. Small, simple types are less likely to run into this problem. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe

Re: task not serialize

2015-04-07 Thread Dean Wampler
Foreach() runs in parallel across the cluster, like map, flatMap, etc. You'll only run into problems if you call collect(), which brings the entire RDD into memory in the driver program. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do

Re: FlatMapPair run for longer time

2015-04-07 Thread Dean Wampler
the way to Scala, so all that noisy code shrinks down to simpler expressions. You'll be surprised how helpful that is for comprehending your code and reasoning about it. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe

Re: conversion from java collection type to scala JavaRDDObject

2015-04-04 Thread Dean Wampler
Without the rest of your code, it's hard to know what might be unserializable. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: workers no route to host

2015-04-02 Thread Dean Wampler
spark.apache.org. Even the Hadoop builds there will work okay, as they don't actually attempt to run Hadoop commands. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com

Re: conversion from java collection type to scala JavaRDDObject

2015-04-02 Thread Dean Wampler
Use JavaSparkContext.parallelize. http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Dean Wampler
Are you allocating 1 core per input stream plus additional cores for the rest of the processing? Each input stream Reader requires a dedicated core. So, if you have two input streams, you'll need local[3] at least. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Dean Wampler
calling take(1) to grab the first element should also work, even if the RDD is empty. (It will return an empty RDD in that case, but not throw an exception.) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http

Re: How to learn Spark ?

2015-04-02 Thread Dean Wampler
You're welcome. Two limitations to know about: 1. I haven't updated it to 1.3 2. It uses Scala for all examples (my bias ;), so less useful if you don't want to use Scala. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Dean Wampler
to process it. What's your streaming batch window size? See also here for ideas: http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#performance-tuning Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-01 Thread Dean Wampler
Is it possible tbBER is empty? If so, it shouldn't fail like this, of course. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: Reading a large file (binary) into RDD

2015-04-03 Thread Dean Wampler
This might be overkill for your needs, but the scodec parser combinator library might be useful for creating a parser. https://github.com/scodec/scodec Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http

Re: UNRESOLVED DEPENDENCIES while building Spark 1.3.0

2015-04-04 Thread Dean Wampler
Use the MVN build instead. From the README in the git repo ( https://github.com/apache/spark) mvn -DskipTests clean package Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http

Re: Spark 1.3.1 On Mesos Issues.

2015-06-01 Thread Dean Wampler
running Spark in Mesos, but accessing data in MapR-FS? Perhaps the MapR shim library doesn't support Spark 1.3.1. HTH, dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http

Please add the Chicago Spark Users' Group to the community page

2015-07-06 Thread Dean Wampler
Here's our home page: http://www.meetup.com/Chicago-Spark-Users/ Thanks, Dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: Tasks randomly stall when running on mesos

2015-05-25 Thread Dean Wampler
: spark.mesos.coarse true Or, from this page http://spark.apache.org/docs/latest/running-on-mesos.html, set the property in a SparkConf object used to construct the SparkContext: conf.set(spark.mesos.coarse, true) dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http

Re: Recommended Scala version

2015-05-26 Thread Dean Wampler
Most of the 2.11 issues are being resolved in Spark 1.4. For a while, the Spark project has published maven artifacts that are compiled with 2.11 and 2.10, although the downloads at http://spark.apache.org/downloads.html are still all for 2.10. Dean Wampler, Ph.D. Author: Programming Scala, 2nd

Re: Run scala code with spark submit

2015-08-20 Thread Dean Wampler
at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org javascript:; For additional commands, e-mail: user-h...@spark.apache.org javascript:; -- Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Spark - Eclipse IDE - Maven

2015-07-29 Thread Dean Wampler
If you don't mind using SBT with your Scala instead of Maven, you can see the example I created here: https://github.com/deanwampler/spark-workshop It can be loaded into Eclipse or IntelliJ Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073

Re: Multiple operations on same DStream in Spark Streaming

2015-07-28 Thread Dean Wampler
integer, then do the filtering and final averaging downstream if you can, i.e., where you actually need the final value. If you need it on every batch iteration, then you'll have to do a reduce per iteration. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product

Re: Clustetr setup for SPARK standalone application:

2015-07-28 Thread Dean Wampler
either the master service isn't running or isn't reachable over your network. Is hadoopm0 publicly routable? Is port 7077 blocked? As a test, can you telnet to it? telnet hadoopm0 7077 Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do

Re: How to set log level in spark-submit ?

2015-07-30 Thread Dean Wampler
Did you use an absolute path in $path_to_file? I just tried this with spark-shell v1.4.1 and it worked for me. If the URL is wrong, you should see an error message from log4j that it can't find the file. For windows it would be something like file:/c:/path/to/file, I believe. Dean Wampler, Ph.D

[POWERED BY] Please add Typesafe to the list of organizations

2015-07-31 Thread Dean Wampler
Typesafe (http://typesafe.com). We provide commercial support for Spark on Mesos and Mesosphere DCOS. We contribute to Spark's Mesos integration and Spark Streaming enhancements. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly

Re: spark config

2015-08-07 Thread Dean Wampler
That's the correct URL. Recent change? The last time I looked, earlier this week, it still had the obsolete artifactory URL for URL1 ;) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Dean Wampler
It should work fine. I have an example script here: https://github.com/deanwampler/spark-workshop/blob/master/src/main/scala/sparkworkshop/SparkSQLParquet10-script.scala (Spark 1.4.X) What does I am failing to do so mean? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http

Re: Spark-submit fails when jar is in HDFS

2015-08-09 Thread Dean Wampler
Also, Spark on Mesos supports cluster mode: http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
Following Hadoop conventions, Spark won't overwrite an existing directory. You need to provide a unique output path every time you run the program, or delete or rename the target directory before you run the job. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http

Re: question about spark streaming

2015-08-10 Thread Dean Wampler
. Is that really a mandatory requirement for this problem? HTH, dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com

Re: Spark Cassandra Connector issue

2015-08-10 Thread Dean Wampler
Add the other Cassandra dependencies (dse.jar, spark-cassandra-connect-java_2.10) to your --jars argument on the command line. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http

Re: Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Dean Wampler
org.apache.spark.streaming.twitter.TwitterInputDStream is a small class. You could write your own that lets you change the filters at run time. Then provide a mechanism in your app, like periodic polling of a database table or file for the list of filters. Dean Wampler, Ph.D. Author: Programming

Re: EC2 cluster doesn't work saveAsTextFile

2015-08-10 Thread Dean Wampler
So, just before running the job, if you run the HDFS command at a shell prompt: hdfs dfs -ls hdfs://172.31.42.10:54310/./weblogReadResult. Does it say the path doesn't exist? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly

Re: Spark Cassandra Connector issue

2015-08-10 Thread Dean Wampler
where HelloWorld is found. Confusing, yes it is... dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Aug 10

Re: Comparison between Standalone mode and YARN mode

2015-07-27 Thread Dean Wampler
of Zookeeper if you need master failover. Hence, you don't see it often in production scenarios. The Spark page on cluster deployments has more details: http://spark.apache.org/docs/latest/cluster-overview.html dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com

Re: Spark Streaming Checkpointing solutions

2015-07-21 Thread Dean Wampler
of network overhead. In some situations, a high performance file system appliance, e.g., NAS, could suffice. My $0.02, dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http

Re: Programmatically launch several hundred Spark Streams in parallel

2015-07-24 Thread Dean Wampler
to write output? Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Fri, Jul 24, 2015 at 11:23 AM, Brandon White bwwintheho

Re: Mesos + Spark

2015-07-24 Thread Dean Wampler
is running, then you can use the Spark web UI on that machine to see what the Spark job is doing. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: Mesos + Spark

2015-07-22 Thread Dean Wampler
, if that works for your every 10-min. need. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Wed, Jul 22, 2015 at 3:53 AM, boci

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Dean Wampler
/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/Partition/Column for full details. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http

Re: Mesos + Spark

2015-07-24 Thread Dean Wampler
You can certainly start jobs without Chronos, but to automatically restart finished jobs or to run jobs at specific times or periods, you'll want something like Chronos. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly

  1   2   >