Re: Temporary jenkins issue

2015-02-08 Thread Josh Rosen
It looks like this may be fixed soon in Jenkins:

https://issues.jenkins-ci.org/browse/JENKINS-25446
https://github.com/jenkinsci/flaky-test-handler-plugin/pull/1

On February 2, 2015 at 7:38:19 PM, Patrick Wendell (pwend...@gmail.com) wrote:

Hey All, 

I made a change to the Jenkins configuration that caused most builds 
to fail (attempting to enable a new plugin), I've reverted the change 
effective about 10 minutes ago. 

If you've seen recent build failures like below, this was caused by 
that change. Sorry about that. 

 
ERROR: Publisher 
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver 
aborted due to exception 
java.lang.NoSuchMethodError: 
hudson.model.AbstractBuild.getTestResultAction()Lhudson/tasks/test/AbstractTestResultAction;
 
at 
com.google.jenkins.flakyTestHandler.plugin.FlakyTestResultAction.init(FlakyTestResultAction.java:78)
 
at 
com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver.perform(JUnitFlakyResultArchiver.java:89)
 
at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) 
at 
hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770)
 
at 
hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734)
 
at hudson.model.Build$BuildExecution.post2(Build.java:183) 
at 
hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683) 
at hudson.model.Run.execute(Run.java:1784) 
at hudson.matrix.MatrixRun.run(MatrixRun.java:146) 
at hudson.model.ResourceController.execute(ResourceController.java:89) 
at hudson.model.Executor.run(Executor.java:240) 
 

- Patrick 

- 
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
For additional commands, e-mail: dev-h...@spark.apache.org 



Re: Data source API | sizeInBytes should be to *Scan

2015-02-08 Thread Aniket Bhatnagar
Thanks for looking into this. If this true, isn't this an issue today? The
default implementation of sizeInBytes is 1 + broadcast threshold. So, if
catalyst's cardinality estimation estimates even a small filter
selectivity, it will result in broadcasting the relation. Therefore,
shouldn't the default be much higher than broadcast threshold?

Also, since the default implementation of sizeInBytes already exists in
BaseRelation, I am not sure why the same/similar default implementation
can't be provided with in *Scan specific sizeInBytes functions and have
Catalyst always trust the size returned by DataSourceAPI (with default
implementation being to never broadcast). Another thing that could be done
is have sizeInBytes return Option[Long] so that Catalyst explicitly knows
when DataSource was able to optimize the size. The reason why I would push
for sizeInBytes in *Scan interfaces is because at times the data source
implementation can more accurately predict the size output. For example,
DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can
easy use filter push downs to query the underlying storage to predict the
size. Such predictions will be more accurate than Catalyst's prediction.
Therefore, if its not a fundamental change in Catalyst, I would think this
makes sense.


Thanks,
Aniket


On Sat, Feb 7, 2015, 4:50 AM Reynold Xin r...@databricks.com wrote:

 We thought about this today after seeing this email. I actually built a
 patch for this (adding filter/column to data source stat estimation), but
 ultimately dropped it due to the potential problems the change the cause.

 The main problem I see is that column pruning/predicate pushdowns are
 advisory, i.e. the data source might or might not apply those filters.

 Without significantly complicating the data source API, it is hard for the
 optimizer (and future cardinality estimation) to know whether the
 filter/column pushdowns are advisory, and whether to incorporate that in
 cardinality estimation.

 Imagine this scenario: a data source applies a filter and estimates the
 filter's selectivity is 0.1, then the data set is reduced to 10% of the
 size. Catalyst's own cardinality estimation estimates the filter
 selectivity to 0.1 again, and thus the estimated data size is now 1% of the
 original data size, lowering than some threshold. Catalyst decides to
 broadcast the table. The actual table size is actually 10x the size.





 On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 Hi Spark SQL committers

 I have started experimenting with data sources API and I was wondering if
 it makes sense to move the method sizeInBytes from BaseRelation to Scan
 interfaces. This is because that a relation may be able to leverage filter
 push down to estimate size potentially making a very large relation
 broadcast-able. Thoughts?

 Aniket





Re: Improving metadata in Spark JIRA

2015-02-08 Thread Patrick Wendell
I think we already have a YARN component.

https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20YARN

I don't think JIRA allows it to be mandatory, but if it does, that
would be useful.

On Sat, Feb 7, 2015 at 5:08 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 By the way, isn't it possible to make the Component field mandatory when
 people open new issues? Shouldn't we do that?

 Btw Patrick, don't we need a YARN component? I think our JIRA components
 should roughly match the components on the PR dashboard.

 Nick

 On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell pwend...@gmail.com
 wrote:

 Per Nick's suggestion I added two components:

 1. Spark Submit
 2. Spark Scheduler

 I figured I would just add these since if we decide later we don't
 want them, we can simply merge them into Spark Core.

 On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Do we need some new components to be added to the JIRA project?
 
  Like:
 
 -
 
 scheduler
  -
 
 YARN
  - spark-submit
 - ...?
 
  Nick
 
 
  On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  +9000 on cleaning up JIRA.
 
  Thank you Sean for laying out some specific things to tackle. I will
  assist with this.
 
  Regarding email, I think Sandy is right. I only get JIRA email for
  issues
  I'm watching.
 
  Nick
 
  On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com
  wrote:
 
  JIRA updates don't go to this list, they go to
  iss...@spark.apache.org.
  I
  don't think many are signed up for that list, and those that are
  probably
  have a flood of emails anyway.
 
  So I'd definitely be in favor of any JIRA cleanup that you're up for.
 
  -Sandy
 
  On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote:
 
   I've wasted no time in wielding the commit bit to complete a number
   of
   small, uncontroversial changes. I wouldn't commit anything that
   didn't
   already appear to have review, consensus and little risk, but please
   let me know if anything looked a little too bold, so I can
   calibrate.
  
  
   Anyway, I'd like to continue some small house-cleaning by improving
   the state of JIRA's metadata, in order to let it give us a little
   clearer view on what's happening in the project:
  
   a. Add Component to every (open) issue that's missing one
   b. Review all Critical / Blocker issues to de-escalate ones that
   seem
   obviously neither
   c. Correct open issues that list a Fix version that has already been
   released
   d. Close all issues Resolved for a release that has already been
  released
  
   The problem with doing so is that it will create a tremendous amount
   of email to the list, like, several hundred. It's possible to make
   bulk changes and suppress e-mail though, which could be done for all
   but b.
  
   Better to suppress the emails when making such changes? or just not
   bother on some of these?
  
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
 

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Improving metadata in Spark JIRA

2015-02-08 Thread Nicholas Chammas
Oh derp, missed the YARN component.

JIRA, does allow admins to make fields mandatory:
https://confluence.atlassian.com/display/JIRA/Specifying+Field+Behavior#SpecifyingFieldBehavior-Makingafieldrequiredoroptional

Nick

On Sat Feb 07 2015 at 5:23:10 PM Patrick Wendell pwend...@gmail.com wrote:

 I think we already have a YARN component.

 https://issues.apache.org/jira/issues/?jql=project%20%
 3D%20SPARK%20AND%20component%20%3D%20YARN

 I don't think JIRA allows it to be mandatory, but if it does, that
 would be useful.

 On Sat, Feb 7, 2015 at 5:08 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  By the way, isn't it possible to make the Component field mandatory
 when
  people open new issues? Shouldn't we do that?
 
  Btw Patrick, don't we need a YARN component? I think our JIRA components
  should roughly match the components on the PR dashboard.
 
  Nick
 
  On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell pwend...@gmail.com
  wrote:
 
  Per Nick's suggestion I added two components:
 
  1. Spark Submit
  2. Spark Scheduler
 
  I figured I would just add these since if we decide later we don't
  want them, we can simply merge them into Spark Core.
 
  On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas
  nicholas.cham...@gmail.com wrote:
   Do we need some new components to be added to the JIRA project?
  
   Like:
  
  -
  
  scheduler
   -
  
  YARN
   - spark-submit
  - ...?
  
   Nick
  
  
   On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas 
   nicholas.cham...@gmail.com wrote:
  
   +9000 on cleaning up JIRA.
  
   Thank you Sean for laying out some specific things to tackle. I will
   assist with this.
  
   Regarding email, I think Sandy is right. I only get JIRA email for
   issues
   I'm watching.
  
   Nick
  
   On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com
 
   wrote:
  
   JIRA updates don't go to this list, they go to
   iss...@spark.apache.org.
   I
   don't think many are signed up for that list, and those that are
   probably
   have a flood of emails anyway.
  
   So I'd definitely be in favor of any JIRA cleanup that you're up
 for.
  
   -Sandy
  
   On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com
 wrote:
  
I've wasted no time in wielding the commit bit to complete a
 number
of
small, uncontroversial changes. I wouldn't commit anything that
didn't
already appear to have review, consensus and little risk, but
 please
let me know if anything looked a little too bold, so I can
calibrate.
   
   
Anyway, I'd like to continue some small house-cleaning by
 improving
the state of JIRA's metadata, in order to let it give us a little
clearer view on what's happening in the project:
   
a. Add Component to every (open) issue that's missing one
b. Review all Critical / Blocker issues to de-escalate ones that
seem
obviously neither
c. Correct open issues that list a Fix version that has already
 been
released
d. Close all issues Resolved for a release that has already been
   released
   
The problem with doing so is that it will create a tremendous
 amount
of email to the list, like, several hundred. It's possible to make
bulk changes and suppress e-mail though, which could be done for
 all
but b.
   
Better to suppress the emails when making such changes? or just
 not
bother on some of these?
   
   

 -
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org
   
   
  
  



Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
I would build OpenBLAS yourself, since good BLAS performance comes from
getting cache sizes, etc. set up correctly for your particular hardware -
this is often a very tricky process (see, e.g. ATLAS), but we found that on
relatively modern Xeon chips, OpenBLAS builds quickly and yields
performance competitive with MKL.

To make sure the right library is getting used, you have to make sure it's
first on the search path - export LD_LIBRARY_PATH=/path/to/blas/library.so
will do the trick here.

For some examples of getting netlib-java setup on an ec2 node and some
example benchmarking code we ran a while back, see:
https://github.com/shivaram/matrix-bench

In particular - build-openblas-ec2.sh shows you how to build the library
and set up symlinks correctly, and scala/run-netlib.sh shows you how to get
the path setup and get that library picked up by netlib-java.

In this way - you could probably get cuBLAS set up to be used by
netlib-java as well.

- Evan

On Fri, Feb 6, 2015 at 5:43 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

  Evan, could you elaborate on how to force BIDMat and netlib-java to
 force loading the right blas? For netlib, I there are few JVM flags, such
 as -Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I
 can force it to use Java implementation. Not sure I understand how to force
 use a specific blas (not specific wrapper for blas).



 Btw. I have installed openblas (yum install openblas), so I suppose that
 netlib is using it.



 *From:* Evan R. Sparks [mailto:evan.spa...@gmail.com]
 *Sent:* Friday, February 06, 2015 5:19 PM
 *To:* Ulanov, Alexander
 *Cc:* Joseph Bradley; dev@spark.apache.org

 *Subject:* Re: Using CUDA within Spark / boosting linear algebra



 Getting breeze to pick up the right blas library is critical for
 performance. I recommend using OpenBLAS (or MKL, if you already have it).
 It might make sense to force BIDMat to use the same underlying BLAS library
 as well.



 On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org

 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly 

[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Patrick Wendell
This vote passes with 5 +1 votes (3 binding) and no 0 or -1 votes.

+1 Votes:
Krishna Sankar
Sean Owen*
Chip Senkbeil
Matei Zaharia*
Patrick Wendell*

0 Votes:
(none)

-1 Votes:
(none)

On Fri, Feb 6, 2015 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
 I'll add a +1 as well.

 On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 +1

 Tested on Mac OS X.

 Matei


 On Feb 2, 2015, at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!

 The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1065/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/

 Changes from rc2:
 A single patch fixing a windows issue.

 Please vote on releasing this package as Apache Spark 1.2.1!

 The vote is open until Friday, February 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...

 For a list of fixes in this release, see http://s.apache.org/Mpn.

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Data source API | sizeInBytes should be to *Scan

2015-02-08 Thread Reynold Xin
We thought about this today after seeing this email. I actually built a
patch for this (adding filter/column to data source stat estimation), but
ultimately dropped it due to the potential problems the change the cause.

The main problem I see is that column pruning/predicate pushdowns are
advisory, i.e. the data source might or might not apply those filters.

Without significantly complicating the data source API, it is hard for the
optimizer (and future cardinality estimation) to know whether the
filter/column pushdowns are advisory, and whether to incorporate that in
cardinality estimation.

Imagine this scenario: a data source applies a filter and estimates the
filter's selectivity is 0.1, then the data set is reduced to 10% of the
size. Catalyst's own cardinality estimation estimates the filter
selectivity to 0.1 again, and thus the estimated data size is now 1% of the
original data size, lowering than some threshold. Catalyst decides to
broadcast the table. The actual table size is actually 10x the size.





On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com
 wrote:

 Hi Spark SQL committers

 I have started experimenting with data sources API and I was wondering if
 it makes sense to move the method sizeInBytes from BaseRelation to Scan
 interfaces. This is because that a relation may be able to leverage filter
 push down to estimate size potentially making a very large relation
 broadcast-able. Thoughts?

 Aniket



Re: Improving metadata in Spark JIRA

2015-02-08 Thread Nicholas Chammas
By the way, isn't it possible to make the Component field mandatory when
people open new issues? Shouldn't we do that?

Btw Patrick, don't we need a YARN component? I think our JIRA components
should roughly match the components on the PR dashboard
https://spark-prs.appspot.com/.

Nick

On Fri Feb 06 2015 at 12:25:52 PM Patrick Wendell pwend...@gmail.com
wrote:

 Per Nick's suggestion I added two components:

 1. Spark Submit
 2. Spark Scheduler

 I figured I would just add these since if we decide later we don't
 want them, we can simply merge them into Spark Core.

 On Fri, Feb 6, 2015 at 11:53 AM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
  Do we need some new components to be added to the JIRA project?
 
  Like:
 
 -
 
 scheduler
  -
 
 YARN
  - spark-submit
 - ...?
 
  Nick
 
 
  On Fri Feb 06 2015 at 10:50:41 AM Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
  +9000 on cleaning up JIRA.
 
  Thank you Sean for laying out some specific things to tackle. I will
  assist with this.
 
  Regarding email, I think Sandy is right. I only get JIRA email for
 issues
  I'm watching.
 
  Nick
 
  On Fri Feb 06 2015 at 9:52:58 AM Sandy Ryza sandy.r...@cloudera.com
  wrote:
 
  JIRA updates don't go to this list, they go to iss...@spark.apache.org
 .
  I
  don't think many are signed up for that list, and those that are
 probably
  have a flood of emails anyway.
 
  So I'd definitely be in favor of any JIRA cleanup that you're up for.
 
  -Sandy
 
  On Fri, Feb 6, 2015 at 6:45 AM, Sean Owen so...@cloudera.com wrote:
 
   I've wasted no time in wielding the commit bit to complete a number
 of
   small, uncontroversial changes. I wouldn't commit anything that
 didn't
   already appear to have review, consensus and little risk, but please
   let me know if anything looked a little too bold, so I can calibrate.
  
  
   Anyway, I'd like to continue some small house-cleaning by improving
   the state of JIRA's metadata, in order to let it give us a little
   clearer view on what's happening in the project:
  
   a. Add Component to every (open) issue that's missing one
   b. Review all Critical / Blocker issues to de-escalate ones that seem
   obviously neither
   c. Correct open issues that list a Fix version that has already been
   released
   d. Close all issues Resolved for a release that has already been
  released
  
   The problem with doing so is that it will create a tremendous amount
   of email to the list, like, several hundred. It's possible to make
   bulk changes and suppress e-mail though, which could be done for all
   but b.
  
   Better to suppress the emails when making such changes? or just not
   bother on some of these?
  
   
 -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
 
 



Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Patrick Wendell
I'll add a +1 as well.

On Fri, Feb 6, 2015 at 2:38 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 +1

 Tested on Mac OS X.

 Matei


 On Feb 2, 2015, at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote:

 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!

 The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1065/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/

 Changes from rc2:
 A single patch fixing a windows issue.

 Please vote on releasing this package as Apache Spark 1.2.1!

 The vote is open until Friday, February 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...

 For a list of fixes in this release, see http://s.apache.org/Mpn.

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Evan R. Sparks
Getting breeze to pick up the right blas library is critical for
performance. I recommend using OpenBLAS (or MKL, if you already have it).
It might make sense to force BIDMat to use the same underlying BLAS library
as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
 

Spark SQL Window Functions

2015-02-08 Thread Evan R. Sparks
Currently there's no standard way of handling time series data in Spark. We
were kicking around some ideas in the lab today and one thing that came up
was SQL Window Functions as a way to support them and query over time
series (do things like moving average, etc.)

These don't seem to be implemented in Spark SQL yet, but there's some
discussion on JIRA (https://issues.apache.org/jira/browse/SPARK-3587)
asking for them, and there have also been a couple of pull requests -
https://github.com/apache/spark/pull/3703 and
https://github.com/apache/spark/pull/2953.

Is any work currently underway here?


Pull Requests on github

2015-02-08 Thread fommil
Hi all,

I'm the author of netlib-java and I noticed that the documentation in MLlib
was out of date and misleading, so I submitted a pull request on github
which will hopefully make things easier for everybody to understand the
benefits of system optimised natives and how to use them :-)

  https://github.com/apache/spark/pull/4448

However, it looks like there are a *lot* of outstanding PRs and that this is
just a mirror repository.

Will somebody please look at my PR and merge into the canonical source (and
let me know)?

Best regards,
Sam



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander
Evan, could you elaborate on how to force BIDMat and netlib-java to force 
loading the right blas? For netlib, I there are few JVM flags, such as 
-Dcom.github.fommil.netlib.BLAS=com.github.fommil.netlib.F2jBLAS, so I can 
force it to use Java implementation. Not sure I understand how to force use a 
specific blas (not specific wrapper for blas).

Btw. I have installed openblas (yum install openblas), so I suppose that netlib 
is using it.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Friday, February 06, 2015 5:19 PM
To: Ulanov, Alexander
Cc: Joseph Bradley; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Getting breeze to pick up the right blas library is critical for performance. I 
recommend using OpenBLAS (or MKL, if you already have it). It might make sense 
to force BIDMat to use the same underlying BLAS library as well.

On Fri, Feb 6, 2015 at 4:42 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas |
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose.

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley 
[mailto:jos...@databricks.commailto:jos...@databricks.com]
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 

Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread Matei Zaharia
+1

Tested on Mac OS X.

Matei


 On Feb 2, 2015, at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Please vote on releasing the following candidate as Apache Spark version 
 1.2.1!
 
 The tag to be voted on is v1.2.1-rc3 (commit b6eaf77):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97
 
 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3/
 
 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc
 
 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1065/
 
 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/
 
 Changes from rc2:
 A single patch fixing a windows issue.
 
 Please vote on releasing this package as Apache Spark 1.2.1!
 
 The vote is open until Friday, February 06, at 05:00 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.
 
 [ ] +1 Release this package as Apache Spark 1.2.1
 [ ] -1 Do not release this package because ...
 
 For a list of fixes in this release, see http://s.apache.org/Mpn.
 
 To learn more about Apache Spark, please see
 http://spark.apache.org/
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC3)

2015-02-08 Thread WangTaoTheTonic
Should we merge this commit into branch1.2 too?

https://github.com/apache/spark/commit/2483c1efb6429a7d8a20c96d18ce2fec93a1aff9



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC3-tp10405p10503.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Unit tests

2015-02-08 Thread Patrick Wendell
Hey All,

The tests are in a not-amazing state right now due to a few compounding factors:

1. We've merged a large volume of patches recently.
2. The load on jenkins has been relatively high, exposing races and
other behavior not seen at lower load.

For those not familiar, the main issue is flaky (non deterministic)
test failures. Right now I'm trying to prioritize keeping the
PullReqeustBuilder in good shape since it will block development if it
is down.

For other tests, let's try to keep filing JIRA's when we see issues
and use the flaky-test label (see http://bit.ly/1yRif9S):

I may contact people regarding specific tests. This is a very high
priority to get in good shape. This kind of thing is no one's fault
but just the result of a lot of concurrent development, and everyone
needs to pitch in to get back in a good place.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Ulanov, Alexander
Hi Evan, Joseph

I did few matrix multiplication test and BIDMat seems to be ~10x faster than 
netlib-java+breeze (sorry for weird table formatting):

|A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64| 
Breeze+Netlib-java f2jblas | 
+---+
|100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
|1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
|1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19 Linux, 
Scala 2.11.

Later I will make tests with Cuda. I need to install new Cuda version for this 
purpose. 

Do you have any ideas why breeze-netlib with native blas is so much slower than 
BIDMat MKL?

Best regards, Alexander

From: Joseph Bradley [mailto:jos...@databricks.com] 
Sent: Thursday, February 05, 2015 5:29 PM
To: Ulanov, Alexander
Cc: Evan R. Sparks; dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

Hi Alexander,

Using GPUs with Spark would be very exciting.  Small comment: Concerning your 
question earlier about keeping data stored on the GPU rather than having to 
move it between main memory and GPU memory on each iteration, I would guess 
this would be critical to getting good performance.  If you could do multiple 
local iterations before aggregating results, then the cost of data movement to 
the GPU could be amortized (and I believe that is done in practice).  Having 
Spark be aware of the GPU and using it as another part of memory sounds like a 
much bigger undertaking.

Joseph

On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com 
wrote:
Thank you for explanation! I’ve watched the BIDMach presentation by John Canny 
and I am really inspired by his talk and comparisons with Spark MLlib.

I am very interested to find out what will be better within Spark: BIDMat or 
netlib-java with CPU or GPU natives. Could you suggest a fair way to benchmark 
them? Currently I do benchmarks on artificial neural networks in batch mode. 
While it is not a “pure” test of linear algebra, it involves some other things 
that are essential to machine learning.

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 1:29 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd be surprised of BIDMat+OpenBLAS was significantly faster than 
netlib-java+OpenBLAS, but if it is much faster it's probably due to data layout 
and fewer levels of indirection - it's definitely a worthwhile experiment to 
run. The main speedups I've seen from using it come from highly optimized GPU 
code for linear algebra. I know that in the past Canny has gone as far as to 
write custom GPU kernels for performance-critical regions of code.[1]

BIDMach is highly optimized for single node performance or performance on small 
clusters.[2] Once data doesn't fit easily in GPU memory (or can be batched in 
that way) the performance tends to fall off. Canny argues for hardware/software 
codesign and as such prefers machine configurations that are quite different 
than what we find in most commodity cluster nodes - e.g. 10 disk cahnnels and 4 
GPUs.

In contrast, MLlib was designed for horizontal scalability on commodity 
clusters and works best on very big datasets - order of terabytes.

For the most part, these projects developed concurrently to address slightly 
different use cases. That said, there may be bits of BIDMach we could repurpose 
for MLlib - keep in mind we need to be careful about maintaining cross-language 
compatibility for our Java and Python-users, though.

- Evan

[1] - http://arxiv.org/abs/1409.5402
[2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander 
alexander.ula...@hp.commailto:alexander.ula...@hp.com wrote:
Hi Evan,

Thank you for suggestion! BIDMat seems to have terrific speed. Do you know what 
makes them faster than netlib-java?

The same group has BIDMach library that implements machine learning. For some 
examples they use Caffe convolutional neural network library owned by another 
group in Berkeley. Could you elaborate on how these all might be connected with 
Spark Mllib? If you take BIDMat for linear algebra why don’t you take BIDMach 
for optimization and learning?

Best regards, Alexander

From: Evan R. Sparks 
[mailto:evan.spa...@gmail.commailto:evan.spa...@gmail.com]
Sent: Thursday, February 05, 2015 12:09 PM
To: Ulanov, Alexander
Cc: dev@spark.apache.orgmailto:dev@spark.apache.org
Subject: Re: Using CUDA within Spark / boosting linear algebra

I'd expect that we can make GPU-accelerated BLAS faster than CPU blas in many 
cases.

You might consider taking a look at the codepaths that BIDMat 
(https://github.com/BIDData/BIDMat) takes and comparing them to 
netlib-java/breeze. John Canny et. al. have done a bunch of work optimizing 

Re: Using CUDA within Spark / boosting linear algebra

2015-02-08 Thread Nicholas Chammas
Lemme butt in randomly here and say there is an interesting discussion on
this Spark PR https://github.com/apache/spark/pull/4448 about
netlib-java, JBLAS, Breeze, and other things I know nothing of, that y'all
may find interesting. Among the participants is the author of netlib-java.

On Sun Feb 08 2015 at 2:48:19 AM Ulanov, Alexander alexander.ula...@hp.com
wrote:

 Hi Evan, Joseph

 I did few matrix multiplication test and BIDMat seems to be ~10x faster
 than netlib-java+breeze (sorry for weird table formatting):

 |A*B  size | BIDMat MKL | Breeze+Netlib-java native_system_linux_x86-64|
 Breeze+Netlib-java f2jblas |
 +---+
 |100x100*100x100 | 0,00205596 | 0,03810324 | 0,002556 |
 |1000x1000*1000x1000 | 0,018320947 | 0,51803557 |1,638475459 |
 |1x1*1x1 | 23,78046632 | 445,0935211 | 1569,233228 |

 Configuration: Intel(R) Xeon(R) CPU E31240 3.3 GHz, 6GB RAM, Fedora 19
 Linux, Scala 2.11.

 Later I will make tests with Cuda. I need to install new Cuda version for
 this purpose.

 Do you have any ideas why breeze-netlib with native blas is so much slower
 than BIDMat MKL?

 Best regards, Alexander

 From: Joseph Bradley [mailto:jos...@databricks.com]
 Sent: Thursday, February 05, 2015 5:29 PM
 To: Ulanov, Alexander
 Cc: Evan R. Sparks; dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 Hi Alexander,

 Using GPUs with Spark would be very exciting.  Small comment: Concerning
 your question earlier about keeping data stored on the GPU rather than
 having to move it between main memory and GPU memory on each iteration, I
 would guess this would be critical to getting good performance.  If you
 could do multiple local iterations before aggregating results, then the
 cost of data movement to the GPU could be amortized (and I believe that is
 done in practice).  Having Spark be aware of the GPU and using it as
 another part of memory sounds like a much bigger undertaking.

 Joseph

 On Thu, Feb 5, 2015 at 4:59 PM, Ulanov, Alexander alexander.ula...@hp.com
 wrote:
 Thank you for explanation! I’ve watched the BIDMach presentation by John
 Canny and I am really inspired by his talk and comparisons with Spark MLlib.

 I am very interested to find out what will be better within Spark: BIDMat
 or netlib-java with CPU or GPU natives. Could you suggest a fair way to
 benchmark them? Currently I do benchmarks on artificial neural networks in
 batch mode. While it is not a “pure” test of linear algebra, it involves
 some other things that are essential to machine learning.

 From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 1:29 PM
 To: Ulanov, Alexander
 Cc: dev@spark.apache.org
 Subject: Re: Using CUDA within Spark / boosting linear algebra

 I'd be surprised of BIDMat+OpenBLAS was significantly faster than
 netlib-java+OpenBLAS, but if it is much faster it's probably due to data
 layout and fewer levels of indirection - it's definitely a worthwhile
 experiment to run. The main speedups I've seen from using it come from
 highly optimized GPU code for linear algebra. I know that in the past Canny
 has gone as far as to write custom GPU kernels for performance-critical
 regions of code.[1]

 BIDMach is highly optimized for single node performance or performance on
 small clusters.[2] Once data doesn't fit easily in GPU memory (or can be
 batched in that way) the performance tends to fall off. Canny argues for
 hardware/software codesign and as such prefers machine configurations that
 are quite different than what we find in most commodity cluster nodes -
 e.g. 10 disk cahnnels and 4 GPUs.

 In contrast, MLlib was designed for horizontal scalability on commodity
 clusters and works best on very big datasets - order of terabytes.

 For the most part, these projects developed concurrently to address
 slightly different use cases. That said, there may be bits of BIDMach we
 could repurpose for MLlib - keep in mind we need to be careful about
 maintaining cross-language compatibility for our Java and Python-users,
 though.

 - Evan

 [1] - http://arxiv.org/abs/1409.5402
 [2] - http://eecs.berkeley.edu/~hzhao/papers/BD.pdf

 On Thu, Feb 5, 2015 at 1:00 PM, Ulanov, Alexander alexander.ula...@hp.com
 mailto:alexander.ula...@hp.com wrote:
 Hi Evan,

 Thank you for suggestion! BIDMat seems to have terrific speed. Do you know
 what makes them faster than netlib-java?

 The same group has BIDMach library that implements machine learning. For
 some examples they use Caffe convolutional neural network library owned by
 another group in Berkeley. Could you elaborate on how these all might be
 connected with Spark Mllib? If you take BIDMat for linear algebra why don’t
 you take BIDMach for optimization and learning?

 Best regards, Alexander

 From: Evan R. Sparks [mailto:evan.spa...@gmail.commailto:
 evan.spa...@gmail.com]
 Sent: Thursday, February 05, 2015 12:09 PM
 To: Ulanov, 

Re: Pull Requests on github

2015-02-08 Thread Akhil Das
You can open a Jira issue pointing this PR to get it processed faster. :)

Thanks
Best Regards

On Sat, Feb 7, 2015 at 7:07 AM, fommil sam.halli...@gmail.com wrote:

 Hi all,

 I'm the author of netlib-java and I noticed that the documentation in MLlib
 was out of date and misleading, so I submitted a pull request on github
 which will hopefully make things easier for everybody to understand the
 benefits of system optimised natives and how to use them :-)

   https://github.com/apache/spark/pull/4448

 However, it looks like there are a *lot* of outstanding PRs and that this
 is
 just a mirror repository.

 Will somebody please look at my PR and merge into the canonical source (and
 let me know)?

 Best regards,
 Sam



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Pull-Requests-on-github-tp10502.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark SQL Window Functions

2015-02-08 Thread Reynold Xin
This is the original ticket:
https://issues.apache.org/jira/browse/SPARK-1442

I believe it will happen, one way or another :)


On Fri, Feb 6, 2015 at 5:29 PM, Evan R. Sparks evan.spa...@gmail.com
wrote:

 Currently there's no standard way of handling time series data in Spark. We
 were kicking around some ideas in the lab today and one thing that came up
 was SQL Window Functions as a way to support them and query over time
 series (do things like moving average, etc.)

 These don't seem to be implemented in Spark SQL yet, but there's some
 discussion on JIRA (https://issues.apache.org/jira/browse/SPARK-3587)
 asking for them, and there have also been a couple of pull requests -
 https://github.com/apache/spark/pull/3703 and
 https://github.com/apache/spark/pull/2953.

 Is any work currently underway here?



Re: Welcoming three new committers

2015-02-08 Thread Likun (Jacky)
Congratulations guys! Keep helping this awesome community.

BR,
Jacky Li

- 发自 Smartisan T1 -

2015年2月4日,上午6:36于 Matei Zaharia matei.zaha...@gmail.com 写道:

Hi all,

The PMC recently voted to add three new committers: Cheng Lian, Joseph Bradley 
and Sean Owen. All three have been major contributors to Spark in the past 
year: Cheng on Spark SQL, Joseph on MLlib, and Sean on ML and many pieces 
throughout Spark Core. Join me in welcoming them as committers!

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org