branch-1.2 has been cut

2014-11-03 Thread Patrick Wendell
Hi All,

I've just cut the release branch for Spark 1.2, consistent with then
end of the scheduled feature window for the release. New commits to
master will need to be explicitly merged into branch-1.2 in order to
be in the release.

This begins the transition into a QA period for Spark 1.2, with a
focus on testing and fixes. A few smaller features may still go in as
folks wrap up loose ends in the next 48 hours (or for developments in
alpha components).

To help with QA, I'll try to package up a SNAPSHOT release soon for
community testing; this worked well when testing Spark 1.1 before
official votes started. I might give it a few days to allow committers
to merge in back-logged fixes and other patches that were punted to
after the feature freeze.

Thanks to everyone who helped author and review patches over the last few weeks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: sbt scala compiler crashes on spark-sql

2014-11-03 Thread Imran Rashid
thanks everyone, that worked.  I had been just cleaning the sql project,
which wasn't enough, but a full clean of everything and its happy now.

just in case this helps anybody else come up with steps to reproduce, for
me the error was always in DataTypeConversions.scala, and I think it
*might* have started after I did a maven build as well.


Re: branch-1.2 has been cut

2014-11-03 Thread Nicholas Chammas
Minor question, but when would be the right time to update the default
Spark version
https://github.com/apache/spark/blob/76386e1a23c55a58c0aeea67820aab2bac71b24b/ec2/spark_ec2.py#L42
in the EC2 script?

On Mon, Nov 3, 2014 at 3:55 AM, Patrick Wendell pwend...@gmail.com wrote:

 Hi All,

 I've just cut the release branch for Spark 1.2, consistent with then
 end of the scheduled feature window for the release. New commits to
 master will need to be explicitly merged into branch-1.2 in order to
 be in the release.

 This begins the transition into a QA period for Spark 1.2, with a
 focus on testing and fixes. A few smaller features may still go in as
 folks wrap up loose ends in the next 48 hours (or for developments in
 alpha components).

 To help with QA, I'll try to package up a SNAPSHOT release soon for
 community testing; this worked well when testing Spark 1.1 before
 official votes started. I might give it a few days to allow committers
 to merge in back-logged fixes and other patches that were punted to
 after the feature freeze.

 Thanks to everyone who helped author and review patches over the last few
 weeks!

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Surprising Spark SQL benchmark

2014-11-03 Thread ozgun
Hey Patrick,

It's Ozgun from Citus Data. We'd like to make these benchmark results fair,
and have tried different config settings for SparkSQL over the past month.
We picked the best config settings we could find, and also contacted the
Spark users list about running TPC-H numbers.

http://goo.gl/IU5Hw0
http://goo.gl/WQ1kML
http://goo.gl/ihLzgh

We also received advice at the Spark Summit '14 to wait until v1.1, and
therefore re-ran our tests on SparkSQL 1.1. On the specific optimizations,
Marco and Samay from our team have much more context, and I'll let them
answer your questions on the different settings we tried.

Our intent is to be fair and not misrepresent SparkSQL's performance. On
that front, we used publicly available documentation and user lists, and
spent about a month trying to get the best Spark performance results. If
there are specific optimizations we should have applied and missed, we'd
love to be involved with the community in re-running the numbers.

Is this email thread the best place to continue the conversation?

Best,
Ozgun



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Surprising-Spark-SQL-benchmark-tp9041p9073.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: matrix factorization cross validation

2014-11-03 Thread Debasish Das
I added the drivers for precisionAt(k: Int) driver for the movielens
test-cases...Although I am a bit confused on precisionAt(k: Int) code from
RankingMetrics.scala...

While cross validating, I am really not sure how to set K...

if (labSet.nonEmpty) { val n = math.min(pred.length, k) ... }

If I make k as a function of pred.length val n = math.min(pred.length,
k*pred.length) then I can vary k between 0 and 1 and choose the sweet spot
for K on a given dataset but I am not sure if it is a measure that makes
sense for recommendation...

MAP is something that makes sense as it is average over all test set...


On Fri, Oct 31, 2014 at 1:26 AM, Sean Owen so...@cloudera.com wrote:

 No, excepting approximate methods like LSH to figure out the
 relatively small set of candidates for the users in the partition, and
 broadcast or join those.

 On Fri, Oct 31, 2014 at 5:45 AM, Nick Pentreath
 nick.pentre...@gmail.com wrote:
  Sean, re my point earlier do you know a more efficient way to compute
 top k
  for each user, other than to broadcast the item factors?
 
  (I guess one can use the new asymmetric lsh paper perhaps to assist)
 
  —
  Sent from Mailbox
 
 
  On Thu, Oct 30, 2014 at 11:24 PM, Sean Owen so...@cloudera.com wrote:
 
  MAP is effectively an average over all k from 1 to min(#
  recommendations, # items rated) Getting first recommendations right is
  more important than the last.
 
  On Thu, Oct 30, 2014 at 10:21 PM, Debasish Das 
 debasish.da...@gmail.com
  wrote:
   Does it make sense to have a user specific K or K is considered same
   over
   all users ?
  
   Intuitively the users who watches more movies should get a higher K
 than
   the
   others...
  
 
 



MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Debasish Das
Hi,

I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but
the code fails on userFeatures.lookup(user).head

In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
called and in all the test-cases that API has been used...

I can perhaps refactor my code to do the same but I was wondering whether
people test the lookup(user) version of the code..

Do I need to cache the model to make it work ? I think right now default is
STORAGE_AND_DISK...

Thanks.
Deb


Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matt Cheah
Hi everyone,

I'm running into more and more cases where too many files are opened when
spark.shuffle.consolidateFiles is turned off.

I was wondering if this is a common scenario among the rest of the
community, and if so, if it is worth considering the setting to be turned on
by default. From the documentation, it seems like the performance could be
hurt on ext3 file systems. However, what are the concrete numbers of
performance degradation that is seen typically? A 2x slowdown in the average
job? 3x? Also, what cause the performance degradation on ext3 file systems
specifically?

Thanks,

-Matt Cheah






smime.p7s
Description: S/MIME cryptographic signature


Spark shuffle consolidateFiles performance degradation quantification

2014-11-03 Thread Matt Cheah
Hi everyone,

I'm running into more and more cases where too many files are opened when
spark.shuffle.consolidateFiles is turned off.

I was wondering if this is a common scenario among the rest of the
community, and if so, if it is worth considering the setting to be turned on
by default. From the documentation, it seems like the performance could be
hurt on ext3 file systems. However, what are the concrete numbers of
performance degradation that is seen typically? A 2x slowdown in the average
job? 3x? Also, what cause the performance degradation on ext3 file systems
specifically?

Thanks,

-Matt Cheah




smime.p7s
Description: S/MIME cryptographic signature


Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
better performance while creating fewer files. So I'd suggest trying that too.

Matei

 On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote:
 
 Hey Matt,
 
 There's some prior work that compares consolidation performance on some
 medium-scale workload:
 http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
 
 There we noticed about 2x performance degradation in the reduce phase on
 ext3. I am not aware of any other concrete numbers. Maybe others have more
 experiences to add.
 
 -Andrew
 
 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com:
 
 Hi everyone,
 
 I'm running into more and more cases where too many files are opened when
 spark.shuffle.consolidateFiles is turned off.
 
 I was wondering if this is a common scenario among the rest of the
 community, and if so, if it is worth considering the setting to be turned
 on by default. From the documentation, it seems like the performance could
 be hurt on ext3 file systems. However, what are the concrete numbers of
 performance degradation that is seen typically? A 2x slowdown in the
 average job? 3x? Also, what cause the performance degradation on ext3 file
 systems specifically?
 
 Thanks,
 
 -Matt Cheah
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Matei Zaharia
(BTW this had a bug with negative hash codes in 1.1.0 so you should try 
branch-1.1 for it).

Matei

 On Nov 3, 2014, at 6:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have 
 better performance while creating fewer files. So I'd suggest trying that too.
 
 Matei
 
 On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote:
 
 Hey Matt,
 
 There's some prior work that compares consolidation performance on some
 medium-scale workload:
 http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
 
 There we noticed about 2x performance degradation in the reduce phase on
 ext3. I am not aware of any other concrete numbers. Maybe others have more
 experiences to add.
 
 -Andrew
 
 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com:
 
 Hi everyone,
 
 I'm running into more and more cases where too many files are opened when
 spark.shuffle.consolidateFiles is turned off.
 
 I was wondering if this is a common scenario among the rest of the
 community, and if so, if it is worth considering the setting to be turned
 on by default. From the documentation, it seems like the performance could
 be hurt on ext3 file systems. However, what are the concrete numbers of
 performance degradation that is seen typically? A 2x slowdown in the
 average job? 3x? Also, what cause the performance degradation on ext3 file
 systems specifically?
 
 Thanks,
 
 -Matt Cheah
 
 
 
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark shuffle consolidateFiles performance degradation numbers

2014-11-03 Thread Zach Fry
Hey Andrew, Matei,

Thanks for responding.

For some more context, we were running into Too many open files issues
where we were seeing this happen immediately after the Collect phase
(about 30 seconds into a run) on a decently sized dataset (14 MM rows).
The ulimit set in the spark-env was 256,000 which we believe should have
been enough, but even with it set at that number, we were still seeing
issues. 
Can you comment on what a good ulimit should be in these cases?

We believe what might have caused this is  some process got orphaned
without cleaning up its open file handles.
However, other than anecdotal evidence and some speculation, we don't have
much evidence to expand on this further.

We were wondering if we could get some more information about how many
files get opened during a shuffle.
We discussed that it is going to be around N x M, where N is the number of
Tasks and M is the number of Reducers.
Does this sound about right?


Are there any other considerations we should be aware of when setting
consolidateFiles to True?

Thanks, 
Zach Fry
Palantir | Developer Support Engineer
z...@palantir.com mailto:em...@palantir.com | 650.226.6338



On 11/3/14 6:28 09PM, Matei Zaharia matei.zaha...@gmail.com wrote:

In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will
have better performance while creating fewer files. So I'd suggest trying
that too.

Matei

 On Nov 3, 2014, at 6:12 PM, Andrew Or and...@databricks.com wrote:
 
 Hey Matt,
 
 There's some prior work that compares consolidation performance on some
 medium-scale workload:
 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cs.berkeley.edu_-
7Ekubitron_courses_cs262a-2DF13_projects_reports_project16-5Freport.pdfd
=AAIFAgc=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8r=0Yj0NJdi423O9rGnW
Dox5yE_2OXftYbKeoFygDwj99Um=fQgGKwxzg3lfq5XUaEZy674jjtWDSrFOHIrIDFEGpQc
s=ukSpYSbxzzrYdHJEXPMx3gGsErP2vA2PMdBVsY3EOnAe=
 
 There we noticed about 2x performance degradation in the reduce phase on
 ext3. I am not aware of any other concrete numbers. Maybe others have
more
 experiences to add.
 
 -Andrew
 
 2014-11-03 17:26 GMT-08:00 Matt Cheah mch...@palantir.com:
 
 Hi everyone,
 
 I'm running into more and more cases where too many files are opened
when
 spark.shuffle.consolidateFiles is turned off.
 
 I was wondering if this is a common scenario among the rest of the
 community, and if so, if it is worth considering the setting to be
turned
 on by default. From the documentation, it seems like the performance
could
 be hurt on ext3 file systems. However, what are the concrete numbers of
 performance degradation that is seen typically? A 2x slowdown in the
 average job? 3x? Also, what cause the performance degradation on ext3
file
 systems specifically?
 
 Thanks,
 
 -Matt Cheah
 
 
 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-03 Thread Xiangrui Meng
Was user presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui

On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote:
 Hi,

 I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but
 the code fails on userFeatures.lookup(user).head

 In computeRmse MatrixFactorizationModel.predict(RDD[(Int, Int)]) has been
 called and in all the test-cases that API has been used...

 I can perhaps refactor my code to do the same but I was wondering whether
 people test the lookup(user) version of the code..

 Do I need to cache the model to make it work ? I think right now default is
 STORAGE_AND_DISK...

 Thanks.
 Deb

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org