Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Steve Nunez
Interesting. What where the Hive settings? Specifically it would be useful to know if this was Hive on Tez. - Steve From: Sanjay Subramanian Reply-To: Sanjay Subramanian Date: Thursday, June 18, 2015 at 11:08 To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark-sql versus Impala

Re: Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
...@mc10inc.commailto:jl...@mc10inc.com Date: Sunday, January 25, 2015 at 17:17 To: Steve Nunez snu...@hortonworks.commailto:snu...@hortonworks.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Pairwise Processing of a List So you've got a point

Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
Spark Experts, I've got a list of points: List[(Float, Float)]) that represent (x,y) coordinate pairs and need to sum the distance. It's easy enough to compute the distance: case class Point(x: Float, y: Float) { def distance(other: Point): Float = sqrt(pow(x - other.x, 2) + pow(y -

Directory / File Reading Patterns

2015-01-17 Thread Steve Nunez
Hello Users, I've got a real-world use case that seems common enough that its pattern would be documented somewhere, but I can't find any references to a simple solution. The challenge is that data is getting dumped into a directory structure, and that directory structure itself contains

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Steve Nunez
Great stuff. Wonderful to see such progress in so short a time. How about some links to code and instructions so that these benchmarks can be reproduced? Regards, - Steve From: Debasish Das debasish.da...@gmail.com Date: Friday, October 10, 2014 at 8:17 To: Matei Zaharia

FW: Reference Accounts Large Node Deployments

2014-08-28 Thread Steve Nunez
Anyone? No customers using streaming at scale? From: Steve Nunez snu...@hortonworks.com Date: Wednesday, August 27, 2014 at 9:08 To: user@spark.apache.org user@spark.apache.org Subject: Reference Accounts Large Node Deployments All, Does anyone have specific references to customers

Reference Accounts Large Node Deployments

2014-08-27 Thread Steve Nunez
All, Does anyone have specific references to customers, use cases and large-scale deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and number of nodes. I¹m attempting an objective comparison of Streaming and Storm and while this data is known for Storm, there appears to

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
I don’t think there is an hwx profile, but there probably should be. - Steve From: Patrick Wendell pwend...@gmail.com Date: Monday, August 4, 2014 at 10:08 To: Ron's Yahoo! zlgonza...@yahoo.com Cc: Ron's Yahoo! zlgonza...@yahoo.com.invalid, Steve Nunez snu...@hortonworks.com, user

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
purist but just that I am not sure these are things that the project can meaningfully bother with. It makes sense to set vendor repos in the pom for convenience, and makes sense to run smoke tests in Jenkins against particular versions. $0.02 Sean On Mon, Aug 4, 2014 at 6:21 PM, Steve Nunez snu

MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Steve Nunez
).distinct.count Cheers, - Steve Nunez -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader

Emacs Setup Anyone?

2014-07-24 Thread Steve Nunez
Anyone out there have a good configuration for emacs? Scala-mode sort of works, but I¹d love to see a fully-supported spark-mode with an inferior shell. Searching didn¹t turn up much of anything. Any emacs users out there? What setup are you using? Cheers, - SteveN -- CONFIDENTIALITY

Re: Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Steve Nunez
I¹m also in early stages of setting up long running Spark jobs. Easiest way I¹ve found is to set up a cluster and submit the job via YARN. Then I can come back and check in on progress when I need to. Seems the trick is tuning the queue priority and YARN preemption to get the job to run in a