Re: Spark-sql versus Impala versus Hive

2015-06-18 Thread Steve Nunez
Interesting. What where the Hive settings? Specifically it would be useful to 
know if this was Hive on Tez.

- Steve

From: Sanjay Subramanian
Reply-To: Sanjay Subramanian
Date: Thursday, June 18, 2015 at 11:08
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark-sql versus Impala versus Hive

I just published results of my findings here
https://bigdatalatte.wordpress.com/2015/06/18/spark-sql-versus-impala-versus-hive/




Re: Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
Not combinations, linear distances, e.g., given: List[ (x1,y1), (x2,y2), 
(x3,y3) ], compute the sum of:

distance (x1,y2) and (x2,y2) and
distance (x2,y2) and (x3,y3)

Imagine that the list of coordinate point comes from a GPS and describes a trip.

- Steve

From: Joseph Lust jl...@mc10inc.commailto:jl...@mc10inc.com
Date: Sunday, January 25, 2015 at 17:17
To: Steve Nunez snu...@hortonworks.commailto:snu...@hortonworks.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Pairwise Processing of a List

So you've got a point A and you want the sum of distances between it and all 
other points? Or am I misunderstanding you?

// target point, can be Broadcast global sent to all workers
val tarPt = (10,20)
val pts = Seq((2,2),(3,3),(2,3),(10,2))
val rdd= sc.parallelize(pts)
rdd.map( pt = Math.sqrt( Math.pow(tarPt._1 - pt._1,2) + Math.pow(tarPt._2 - 
pt._2,2)) ).reduce( (d1,d2) = d1+d2)

-Joe

From: Steve Nunez snu...@hortonworks.commailto:snu...@hortonworks.com
Date: Sunday, January 25, 2015 at 7:32 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Pairwise Processing of a List

Spark Experts,

I've got a list of points: List[(Float, Float)]) that represent (x,y) 
coordinate pairs and need to sum the distance. It's easy enough to compute the 
distance:

case class Point(x: Float, y: Float) {
  def distance(other: Point): Float =
sqrt(pow(x - other.x, 2) + pow(y - other.y, 2)).toFloat
}

(in this case I create a 'Point' class, but the maths are the same).

What I can't figure out is the 'right' way to sum distances between all the 
points. I can make this work by traversing the list with a for loop and using 
indices, but this doesn't seem right.

Anyone know a clever way to process List[(Float, Float)]) in a pairwise fashion?

Regards,
- Steve



CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.


Pairwise Processing of a List

2015-01-25 Thread Steve Nunez
Spark Experts,

I've got a list of points: List[(Float, Float)]) that represent (x,y) 
coordinate pairs and need to sum the distance. It's easy enough to compute the 
distance:

case class Point(x: Float, y: Float) {
  def distance(other: Point): Float =
sqrt(pow(x - other.x, 2) + pow(y - other.y, 2)).toFloat
}

(in this case I create a 'Point' class, but the maths are the same).

What I can't figure out is the 'right' way to sum distances between all the 
points. I can make this work by traversing the list with a for loop and using 
indices, but this doesn't seem right.

Anyone know a clever way to process List[(Float, Float)]) in a pairwise fashion?

Regards,
- Steve




Directory / File Reading Patterns

2015-01-17 Thread Steve Nunez
Hello Users,

I've got a real-world use case that seems common enough that its pattern would 
be documented somewhere, but I can't find any references to a simple solution. 
The challenge is that data is getting dumped into a directory structure, and 
that directory structure itself contains features that I need in my model. For 
example:

bank_code
Trader
Day-1.csv
Day-2.csv
...

Each CVS file contains a list of all the trades made by that individual each 
day. The problem is that the bank  trader should be part of the feature set. 
I.e. We need the RDD to look like:
(bank, trader, day, list-of-trades)

Anyone got any elegant solutions for doing this?

Cheers,
- SteveN






Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Steve Nunez
Great stuff. Wonderful to see such progress in so short a time.

How about some links to code and instructions so that these benchmarks can
be reproduced?

Regards,
- Steve

From:  Debasish Das debasish.da...@gmail.com
Date:  Friday, October 10, 2014 at 8:17
To:  Matei Zaharia matei.zaha...@gmail.com
Cc:  user user@spark.apache.org, dev d...@spark.apache.org
Subject:  Re: Breaking the previous large-scale sort record with Spark

 Awesome news Matei !
 
 Congratulations to the databricks team and all the community members...
 
 On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some pretty
 cool news for the project, which is that we've been able to use Spark to
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x
 fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-
 record.html. Summary: while Hadoop MapReduce held last year's 100 TB world
 record by sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23
 minutes on 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few weeks,
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In
 addition, we'd really like to thank Amazon's EC2 team for providing the
 machines to make this possible. Finally, this result would of course not be
 possible without the many many other contributions, testing and feature
 requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and it's
 thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


FW: Reference Accounts Large Node Deployments

2014-08-28 Thread Steve Nunez
Anyone? No customers using streaming at scale?


From:  Steve Nunez snu...@hortonworks.com
Date:  Wednesday, August 27, 2014 at 9:08
To:  user@spark.apache.org user@spark.apache.org
Subject:  Reference Accounts  Large Node Deployments

 All,
 
 Does anyone have specific references to customers, use cases and large-scale
 deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and
 number of nodes. I¹m attempting an objective comparison of Streaming and Storm
 and while this data is known for Storm, there appears to be little for Spark
 Streaming. If you know of any such deployments, please post them here because
 I am sure I¹m not the only one wondering about this. If customer
 confidentially prevents mentioning them by name, consider identifying them by
 industry, e.g. OEtelco doing X with streaming using Y nodes¹.
 
 Any information at all will be welcome. I¹ll feed back a summary and/or update
 a wiki page once I collate the information.
 
 Cheers,
 - Steve
 
 
 



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Reference Accounts Large Node Deployments

2014-08-27 Thread Steve Nunez
All,

Does anyone have specific references to customers, use cases and large-scale
deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and
number of nodes. I¹m attempting an objective comparison of Streaming and
Storm and while this data is known for Storm, there appears to be little for
Spark Streaming. If you know of any such deployments, please post them here
because I am sure I¹m not the only one wondering about this. If customer
confidentially prevents mentioning them by name, consider identifying them
by industry, e.g. OEtelco doing X with streaming using Y nodes¹.

Any information at all will be welcome. I¹ll feed back a summary and/or
update a wiki page once I collate the information.

Cheers,
- Steve






-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
I don’t think there is an hwx profile, but there probably should be.

- Steve

From:  Patrick Wendell pwend...@gmail.com
Date:  Monday, August 4, 2014 at 10:08
To:  Ron's Yahoo! zlgonza...@yahoo.com
Cc:  Ron's Yahoo! zlgonza...@yahoo.com.invalid, Steve Nunez
snu...@hortonworks.com, user@spark.apache.org, d...@spark.apache.org
d...@spark.apache.org
Subject:  Re: Issues with HDP 2.4.0.2.1.3.0-563

Ah I see, yeah you might need to set hadoop.version and yarn.version. I
thought he profile set this automatically.


On Mon, Aug 4, 2014 at 10:02 AM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
 I meant yarn and hadoop defaulted to 1.0.4 so the yarn build fails since 1.0.4
 doesn’t exist for yarn...
 
 Thanks,
 Ron
 
 On Aug 4, 2014, at 10:01 AM, Ron's Yahoo! zlgonza...@yahoo.com wrote:
 
 That failed since it defaulted the versions for yarn and hadoop
 I’ll give it a try with just 2.4.0 for both yarn and hadoop…
 
 Thanks,
 Ron
 
 On Aug 4, 2014, at 9:44 AM, Patrick Wendell pwend...@gmail.com wrote:
 
 Can you try building without any of the special `hadoop.version` flags and
 just building only with -Phadoop-2.4? In the past users have reported issues
 trying to build random spot versions... I think HW is supposed to be
 compatible with the normal 2.4.0 build.
 
 
 On Mon, Aug 4, 2014 at 8:35 AM, Ron's Yahoo! zlgonza...@yahoo.com.invalid
 wrote:
 Thanks, I ensured that $SPARK_HOME/pom.xml had the HDP repository under the
 repositories element. I also confirmed that if the build couldn’t find the
 version, it would fail fast so it seems as if it’s able to get the versions
 it needs to build the distribution.
 I ran the following (generated from make-distribution.sh), but it did not
 address the problem, while building with an older version
 (2.4.0.2.1.2.0-402) worked. Any other thing I can try?
 
 mvn clean package -Phadoop-2.4 -Phive -Pyarn
 -Dyarn.version=2.4.0.2.1.2.0-563 -Dhadoop.version=2.4.0.2.1.3.0-563
 -DskipTests
 
 
 Thanks,
 Ron
 
 
 On Aug 4, 2014, at 7:13 AM, Steve Nunez snu...@hortonworks.com wrote:
 
 Provided you¹ve got the HWX repo in your pom.xml, you can build with this
 line:
 
 mvn -Pyarn -Phive -Phadoop-2.4 -Dhadoop.version=2.4.0.2.1.1.0-385
 -DskipTests clean package
 
 I haven¹t tried building a distro, but it should be similar.
 
 
 - SteveN
 
 On 8/4/14, 1:25, Sean Owen so...@cloudera.com wrote:
 
 For any Hadoop 2.4 distro, yes, set hadoop.version but also set
 -Phadoop-2.4.
 http://spark.apache.org/docs/latest/building-with-maven.html
 
 On Mon, Aug 4, 2014 at 9:15 AM, Patrick Wendell pwend...@gmail.com
 wrote:
 For hortonworks, I believe it should work to just link against the
 corresponding upstream version. I.e. just set the Hadoop version to
 2.4.0
 
 Does that work?
 
 - Patrick
 
 
 On Mon, Aug 4, 2014 at 12:13 AM, Ron's Yahoo!
 zlgonza...@yahoo.com.invalid
 wrote:
 
 Hi,
  Not sure whose issue this is, but if I run make-distribution using
 HDP
 2.4.0.2.1.3.0-563 as the hadoop version (replacing it in
 make-distribution.sh), I get a strange error with the exception below.
 If I
 use a slightly older version of HDP (2.4.0.2.1.2.0-402) with
 make-distribution, using the generated assembly all works fine for me.
 Either 1.0.0 or 1.0.1 will work fine.
 
  Should I file a JIRA or is this a known issue?
 
 Thanks,
 Ron
 
 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 0.0:0 failed 1 times, most recent failure:
 Exception failure in TID 0 on host localhost:
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 
 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyI
 nputFormat.java:47)
 
 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav
 a:1145)
 
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja
 va:615)
java.lang.Thread.run(Thread.java:745

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Steve Nunez
Hmm. Fair enough. I hadn¹t given that answer much thought and on
reflection think you¹re right in that a profile would just be a bad hack.



On 8/4/14, 10:35, Sean Owen so...@cloudera.com wrote:

What would such a profile do though? In general building for a
specific vendor version means setting hadoop.verison and/or
yarn.version. Any hard-coded value is unlikely to match what a
particular user needs. Setting protobuf versions and so on is already
done by the generic profiles.

In a similar vein, I am not clear on why there's a mapr profile in the
build. Its versions are about to be out of date and won't work with
upcoming Hbase changes for example.

(Elsewhere in the build I think it wouldn't hurt to clear out
cloudera-specific profiles and releases too -- they're not in the pom
but are in the distribution script. It's the vendor's problem.)

This isn't any argument about being purist but just that I am not sure
these are things that the project can meaningfully bother with.

It makes sense to set vendor repos in the pom for convenience, and
makes sense to run smoke tests in Jenkins against particular versions.

$0.02
Sean

On Mon, Aug 4, 2014 at 6:21 PM, Steve Nunez snu...@hortonworks.com
wrote:
 I don¹t think there is an hwx profile, but there probably should be.




-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MovieLensALS - Scala Pattern Magic

2014-08-04 Thread Steve Nunez
Can one of the Scala experts please explain this bit of pattern magic from
the Spark ML tutorial: _._2.user ?

As near as I can tell, this is applying the _2 function to the wildcard, and
then applying the Œuser¹ function to that. In a similar way the Œproduct¹
function is applied in the next line, yet these functions don¹t seem to
exist anywhere in the project, nor are they used anywhere else in the code.
It almost makes sense, but not quite. Code below:


val ratings = sc.textFile(new File(movieLensHomeDir,
ratings.dat).toString).map { line =
  val fields = line.split(::)
  // format: (timestamp % 10, Rating(userId, movieId, rating))
  (fields(3).toLong % 10, Rating(fields(0).toInt, fields(1).toInt,
fields(2).toDouble))
}
Š
val numRatings = ratings.count
val numUsers = ratings.map(_._2.user).distinct.count
val numMovies = ratings.map(_._2.product).distinct.count

Cheers,
- Steve Nunez



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Emacs Setup Anyone?

2014-07-24 Thread Steve Nunez
Anyone out there have a good configuration for emacs? Scala-mode sort of
works, but I¹d love to see a fully-supported spark-mode with an inferior
shell. Searching didn¹t turn up much of anything.

Any emacs users out there? What setup are you using?

Cheers,
- SteveN





-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Steve Nunez
I¹m also in early stages of setting up long running Spark jobs. Easiest way
I¹ve found is to set up a cluster and submit the job via YARN. Then I can
come back and check in on progress when I need to. Seems the trick is tuning
the queue priority and YARN preemption to get the job to run in a reasonable
amount of time without disrupting the other jobs.

- SteveN


From:  Chris Schneider ch...@christopher-schneider.com
Reply-To:  user@spark.apache.org
Date:  Wednesday, July 23, 2014 at 7:39
To:  user@spark.apache.org
Subject:  Cluster submit mode - only supported on Yarn?

We are starting to use Spark, but we don't have any existing infrastructure
related to big-data, so we decided to setup the standalone cluster, rather
than mess around with Yarn or Mesos.

But it appears like the driver program has to stay up on the client for the
full duration of the job (client mode).

What is the simplest way to setup cluster submission mode, to allow our
client boxes to submit jobs and then move on with the other work they need
to do without keeping a potentially long running java process up?

Thanks,
Chris






-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.