Re: Why spark-submit works with package not with jar

2024-05-06 Thread David Rabinowitz
Hi,

It seems this library is several years old. Have you considered using the
Google provided connector? You can find it in
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

Regards,
David Rabinowitz

On Sun, May 5, 2024 at 6:07 PM Jeff Zhang  wrote:

> Are you sure com.google.api.client.http.HttpRequestInitialize is in
> the spark-bigquery-latest.jar or it may be in the transitive dependency
> of spark-bigquery_2.11?
>
> On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh 
> wrote:
>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> -- Forwarded message -
>> From: Mich Talebzadeh 
>> Date: Tue, 20 Oct 2020 at 16:50
>> Subject: Why spark-submit works with package not with jar
>> To: user @spark 
>>
>>
>> Hi,
>>
>> I have a scenario that I use in Spark submit as follows:
>>
>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>
>> As you can see the jar files needed are added.
>>
>>
>> This comes back with error message as below
>>
>>
>> Creating model test.weights_MODEL
>>
>> java.lang.NoClassDefFoundError:
>> com/google/api/client/http/HttpRequestInitializer
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>
>>   at
>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>
>>   ... 76 elided
>>
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.api.client.http.HttpRequestInitializer
>>
>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>
>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>
>>
>>
>> So there is an issue with finding the class, although the jar file used
>>
>>
>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>
>> has it.
>>
>>
>> Now if *I remove the above jar file and replace it with the same version
>> but package* it works!
>>
>>
>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>
>>
>> I have read the write-ups about packages searching the maven
>> libraries etc. Not convinced why using the package should make so much
>> difference between a failure and success. In other words, when to use a
>> package rather than a jar.
>>
>>
>> Any ideas will be appreciated.
>>
>>
>> Thanks
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: JDK version support policy?

2023-06-13 Thread David Li
Thanks all for the discussion here. Based on this I think we'll stick with Java 
8 for now and then upgrade to Java 11 around or after Spark 4.

On Thu, Jun 8, 2023, at 07:17, Sean Owen wrote:
> Noted, but for that you'd simply run your app on Java 17. If Spark works, and 
> your app's dependencies work on Java 17 because you compile it for 17 (and 
> jakarta.* classes for example) then there's no issue.
> 
> On Thu, Jun 8, 2023 at 3:13 AM Martin Andersson  
> wrote:
>> There are some reasons to drop Java 11 as well. Java 17 included a large 
>> change, breaking backwards compatibility with their transition from Java EE 
>> to Jakarta EE 
>> <https://blogs.oracle.com/javamagazine/post/transition-from-java-ee-to-jakarta-ee>.
>>  This means that any users using Spark 4.0 together with Spring 6.x or any 
>> recent version of servlet containers such as Tomcat or Jetty will experience 
>> issues. (For security reasons it's beneficial to float your dependencies to 
>> the latest version of these libraries/frameworks)
>> 
>> I'm not explicitly saying Java 11 should be dropped in Spark 4, just thought 
>> I'd bring this issue to your attention.
>> 
>> Best Regards, Martin
>> 
>> 
>> *From:* Jungtaek Lim 
>> *Sent:* Wednesday, June 7, 2023 23:19
>> *To:* Sean Owen 
>> *Cc:* Dongjoon Hyun ; Holden Karau 
>> ; dev 
>> *Subject:* Re: JDK version support policy?
>>  
>> 
>> 
>> EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
>> the sender and know the content is safe. DO NOT provide your username or 
>> password.
>> 
>> 
>> 
>> +1 to drop Java 8 but +1 to set the lowest support version to Java 11.
>> 
>> Considering the phase for only security updates, 11 LTS would not be EOLed 
>> in very long time. Unless that’s coupled with other deps which require 
>> bumping JDK version (hope someone can bring up lists), it doesn’t seem to 
>> buy much. And given the strong backward compatibility JDK provides, that’s 
>> less likely.
>> 
>> Purely from the project’s source code view, does anyone know how much 
>> benefits we can leverage for picking up 17 rather than 11? I lost the track, 
>> but some of their proposals are more likely catching up with other 
>> languages, which don’t make us be happy since Scala provides them for years.
>> 
>> 2023년 6월 8일 (목) 오전 2:35, Sean Owen 님이 작성:
>>> I also generally perceive that, after Java 9, there is much less breaking 
>>> change. So working on Java 11 probably means it works on 20, or can be 
>>> easily made to without pain. Like I think the tweaks for Java 17 were quite 
>>> small. 
>>> 
>>> Targeting Java >11 excludes Java 11 users and probably wouldn't buy much. 
>>> Keeping the support probably doesn't interfere with working on much newer 
>>> JVMs either. 
>>> 
>>> On Wed, Jun 7, 2023, 12:29 PM Holden Karau  wrote:
>>>> So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're 
>>>> going to see enough folks moving to JRE17 by the Spark 4 release unless we 
>>>> have a strong benefit from dropping 11 support I'd be inclined to keep it.
>>>> 
>>>> On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun  wrote:
>>>>> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too.
>>>>> 
>>>>> Dongjoon.
>>>>> 
>>>>> On 2023/06/07 02:42:19 yangjie01 wrote:
>>>>> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only 
>>>>> > support Java 17 and the upcoming Java 21.
>>>>> > 
>>>>> > 发件人: Denny Lee 
>>>>> > 日期: 2023年6月7日 星期三 07:10
>>>>> > 收件人: Sean Owen 
>>>>> > 抄送: David Li , "dev@spark.apache.org" 
>>>>> > 
>>>>> > 主题: Re: JDK version support policy?
>>>>> > 
>>>>> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the 
>>>>> > fast-paced (positive) updates to Arrow, eh?!
>>>>> > 
>>>>> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen 
>>>>> > mailto:sro...@gmail.com>> wrote:
>>>>> > I haven't followed this discussion closely, but I think we could/should 
>>>>> > drop Java 8 in Spark 4.0, which is up next after 3.5?
>>>>> > 
>>>>> > On Tue, Jun 6, 2023 at 2:44 PM David Li 
>>>>> > mailto:lidav...@apache.org>> wrote:
>>>>

JDK version support policy?

2023-06-06 Thread David Li
Hello Spark developers,

I'm from the Apache Arrow project. We've discussed Java version support [1], 
and crucially, whether to continue supporting Java 8 or not. As Spark is a big 
user of Arrow in Java, I was curious what Spark's policy here was.

If Spark intends to stay on Java 8, for instance, we may also want to stay on 
Java 8 or otherwise provide some supported version of Arrow for Java 8.

We've seen dependencies dropping or planning to drop support. gRPC may drop 
Java 8 at any time [2], possibly this September [3], which may affect Spark 
(due to Spark Connect). And today we saw that Arrow had issues running tests 
with Mockito on Java 20, but we couldn't update Mockito since it had dropped 
Java 8 support. (We pinned the JDK version in that CI pipeline for now.)

So at least, I am curious if Arrow could start the long process of migrating 
Java versions without impacting Spark, or if we should continue to cooperate. 
Arrow Java doesn't see quite so much activity these days, so it's not quite 
critical, but it's possible that these dependency issues will start to affect 
us more soon. And looking forward, Java is working on APIs that should also 
allow us to ditch the --add-opens flag requirement too.

[1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj
[2]: https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md
[3]: https://github.com/grpc/grpc-java/issues/9386

SPARK-22256

2020-12-11 Thread David McWhorter
Hello, my name is David McWhorter and I created a new pull request to address 
the SPARK-22256 ticket at https://github.com/apache/spark/pull/30739. This 
change adds a memory overhead setting for the spark driver running on mesos. 
This is a reopening of a prior pull request that was never merged. I am 
wondering if I could request that someone review the change and merge it if it 
looks acceptable? I'm happy to do any further testing or make further changes 
if they are needed but it looked good from the prior review.

Many thanks,
David McWhorter

Unsubscribe

2020-12-08 Thread David Zhou
Unsubscribe


[DISCUSS] Reducing memory usage of toPandas with Arrow "self_destruct" option

2020-09-10 Thread David Li
Hello all,

We've been working with PySpark and Pandas, and have found that to
convert a dataset using N bytes of memory to Pandas, we need to have
2N bytes free, even with the Arrow optimization enabled. The
fundamental reason is ARROW-3789[1]: Arrow does not free the Arrow
table until conversion finishes, so there are 2 copies of the dataset
in memory.

We'd like to improve this by taking advantage of the Arrow
"self_destruct" option available in Arrow >= 0.16. When converting a
suitable[*] Arrow table to a Pandas dataframe, it avoids the
worst-case 2x memory usage, with something more like ~25% overhead
instead, by freeing the columns in the Arrow table after converting
each column instead of at the end of conversion.

Does this sound like a desirable optimization to have in Spark? If so,
how should it be exposed to users? As discussed below, there are cases
where a user may or may not want it enabled.

Here's a proof-of-concept patch, along with a demonstration, and a
comparison of memory usage (via memory_profiler[2]) with and without
the flag enabled:
https://gist.github.com/lidavidm/289229caa022358432f7deebe26a9bd3

There are some cases where you may _not_ want this optimization,
however, so the patch leaves it as a toggle. Is this the API we'd
want, or would we prefer a different API (e.g. a configuration flag)?

The reason we may not want this enabled by default is that the related
split_blocks option is more likely to find zero-copy opportunities,
which will result in the Pandas dataframe being backed by immutable
buffers. Some Pandas operations will error in these cases, e.g. [3].
Also, to minimize memory usage, we set use_threads=False to converts
each column sequentially, rather than in parallel, but this slows down
the conversion somewhat. One option here may be to set self_destruct
by default, but relegate the other two options (which further save
memory) to a toggle, and I can measure the impact of this if desired.

[1]: https://issues.apache.org/jira/browse/ARROW-3789
[2]: https://github.com/pythonprofilers/memory_profiler
[3]: https://github.com/pandas-dev/pandas/issues/35530
[*] See my comment in https://issues.apache.org/jira/browse/ARROW-9878.

Thanks,
David

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark 3.0 and ORC 1.6

2020-01-28 Thread David Christle
Hi all,

I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard 
compression option recently incorporated into ORC 1.6. I would love to explore 
using it for storing/querying of large (>10 TB) tables for my own disk I/O 
intensive workloads, and other users & companies may be interested in adopting 
ZStandard more broadly, since it seems to offer faster compression speeds at 
higher compression ratios with better multi-threaded support than zlib/Snappy. 
At scale, improvements of even ~10% on disk and/or compute, hopefully just from 
setting the “orc.compress” flag to a different value, could translate into 
palpable gains in capacity/cost cluster wide without requiring broad 
engineering migrations. See a somewhat recent FB Engineering blog post on the 
topic for their reported experiences: 
https://engineering.fb.com/core-data/zstandard/

Do we know if ORC 1.6.x will make the cut for Spark 3.0?

A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to 1.5.8, 
but I don’t have a good understanding of how difficult incorporating ORC 1.6.x 
into Spark will be. For instance, in the PRs for enabling Java Zstd in ORC 
(https://github.com/apache/orc/pull/306 & 
https://github.com/apache/orc/pull/412), some additional work/discussion around 
Hadoop shims occurred to maintain compatibility across different versions of 
Hadoop (e.g. 2.7) and aircompressor (a library containing Java implementations 
of various compression codecs, so that dependence on Hadoop 2.9 is not 
required). Again, these may be non-issues, but I wanted to kindle discussion 
around whether this can make the cut for 3.0, since I imagine it’s a major 
upgrade many users will focus on migrating to once released.

Kind regards,
David Christle


Re: [Kubernetes] Resource requests and limits for Driver and Executor Pods

2018-03-30 Thread David Vogelbacher
Thanks for linking that PR Kimoon.


It actually does mostly address the issue I was referring to. As the issue I 
linked in my first email states, one physical cpu might not be enough to 
execute a task in a performant way.

 

So if I set spark.executor.cores=1 and spark.task.cpus=1 , I will get 1 core 
from Kubernetes and execute one task per Executor and run into performance 
problems.

Being able to specify `spark.kubernetes.executor.cores=1.2` would fix the issue 
(1.2 is just an example).


I am curious as to why you, Yinan, would want to use this property to request 
less than 1 physical cpu (that is how it sounds to me on the PR). 

Do you have testing that indicates that less than 1 physical CPU is enough for 
executing tasks?

 

In the end it boils down to the question proposed by Yinan:

> A relevant question is should Spark on Kubernetes really be opinionated on 
> how to set the cpu request and limit and even try to determine this 
> automatically?

 

And I completely agree with your answer Kimoon, we should provide sensible 
defaults and make it configurable, as Yinan’s PR does. 

The only remaining question would then be what a sensible default for 
spark.kubernetes.executor.cores would be. Seeing that I wanted more than 1 and 
Yinan wants less, leaving it at 1 night be best.

 

Thanks,

David 

 

From: Kimoon Kim <kim...@pepperdata.com>
Date: Friday, March 30, 2018 at 4:28 PM
To: Yinan Li <liyinan...@gmail.com>
Cc: David Vogelbacher <dvogelbac...@palantir.com>, "dev@spark.apache.org" 
<dev@spark.apache.org>
Subject: Re: [Kubernetes] Resource requests and limits for Driver and Executor 
Pods

 

I see. Good to learn the interaction between spark.task.cpus and 
spark.executor.cores. But am I right to say that PR #20553 can be still used as 
an additional knob on top of those two? Say a user wants 1.5 core per executor 
from Kubernetes, not the rounded up integer value 2? 

 

> A relevant question is should Spark on Kubernetes really be opinionated on 
> how to set the cpu request and limit and even try to determine this 
> automatically?

 

Personally, I don't see how this can be auto-determined at all. I think the 
best we can do is to come up with sensible default values for the most common 
case, and provide and well-document other knobs for edge cases.


Thanks,

Kimoon

 

On Fri, Mar 30, 2018 at 12:37 PM, Yinan Li <liyinan...@gmail.com> wrote:

PR #20553 [github.com] is more for allowing users to use a fractional value for 
cpu requests. The existing spark.executor.cores is sufficient for specifying 
more than one cpus. 




> One way to solve this could be to request more than 1 core from Kubernetes 
> per task. The exact amount we should request is unclear to me (it largely 
> depends on how many threads actually get spawned for a task).

A good indication is spark.task.cpus, and on average how many tasks are 
expected to run by a single executor at any point in time. If each executor is 
only expected to run one task at most at any point in time, 
spark.executor.cores can be set to be equal to spark.task.cpus.

A relevant question is should Spark on Kubernetes really be opinionated on how 
to set the cpu request and limit and even try to determine this automatically?

 

On Fri, Mar 30, 2018 at 11:40 AM, Kimoon Kim <kim...@pepperdata.com> wrote:

> Instead of requesting `[driver,executor].memory`, we should just request 
> `[driver,executor].memory + [driver,executor].memoryOverhead `. I think this 
> case is a bit clearer than the CPU case, so I went ahead and filed an issue 
> [issues.apache.org] with more details and made a PR [github.com].

I think this suggestion makes sense. 

 

> One way to solve this could be to request more than 1 core from Kubernetes 
> per task. The exact amount we should request is unclear to me (it largely 
> depends on how many threads actually get spawned for a task).

 

I wonder if this is being addressed by PR #20553 [github.com] written by Yinan. 
Yinan? 


Thanks,

Kimoon

 

On Thu, Mar 29, 2018 at 5:14 PM, David Vogelbacher <dvogelbac...@palantir.com> 
wrote:

Hi,

 

At the moment driver and executor pods are created using the following requests 
and limits:

 CPUMemory
Request[driver,executor].cores[driver,executor].memory
LimitUnlimited (but can be specified using 
spark.[driver,executor].cores)[driver,executor].memory + 
[driver,executor].memoryOverhead

 

Specifying the requests like this leads to problems if the pods only get the 
requested amount of resources and nothing of the optional (limit) resources, as 
it can happen in a fully utilized cluster.

 

For memory:

Let’s say we have a node with 100GiB memory and 5 pods with 20 GiB memory and 5 
GiB memoryOverhead. 

At the beginning all 5 pods use 20 GiB of memory and all is well. If a pod then 
starts using its overhead memory it will get killed as there is no more memor

[Kubernetes] Resource requests and limits for Driver and Executor Pods

2018-03-29 Thread David Vogelbacher
Hi,

 

At the moment driver and executor pods are created using the following requests 
and limits:

 CPUMemory
Request[driver,executor].cores[driver,executor].memory
LimitUnlimited (but can be specified using 
spark.[driver,executor].cores)[driver,executor].memory + 
[driver,executor].memoryOverhead

 

Specifying the requests like this leads to problems if the pods only get the 
requested amount of resources and nothing of the optional (limit) resources, as 
it can happen in a fully utilized cluster.

 

For memory:

Let’s say we have a node with 100GiB memory and 5 pods with 20 GiB memory and 5 
GiB memoryOverhead. 

At the beginning all 5 pods use 20 GiB of memory and all is well. If a pod then 
starts using its overhead memory it will get killed as there is no more memory 
available, even though we told spark

that it can use 25 GiB of memory.

 

Instead of requesting `[driver,executor].memory`, we should just request 
`[driver,executor].memory + [driver,executor].memoryOverhead `.

I think this case is a bit clearer than the CPU case, so I went ahead and filed 
an issue with more details and made a PR.

 

For CPU:

As it turns out, there can be performance problems if we only have 
`executor.cores` available (which means we have one core per task). This was 
raised here and is the reason that the cpu limit was set to unlimited.

This issue stems from the fact that in general there will be more than one 
thread per task, resulting in performance impacts if there is only one core 
available.

However, I am not sure that just setting the limit to unlimited is the best 
solution because it means that even if the Kubernetes cluster can perfectly 
satisfy the resource requests, performance might be very bad.

 

I think we should guarantee that an executor is able to do its work well 
(without performance issues or getting killed - as could happen in the memory 
case) with the resources it gets guaranteed from Kubernetes.

 

One way to solve this could be to request more than 1 core from Kubernetes per 
task. The exact amount we should request is unclear to me (it largely depends 
on how many threads actually get spawned for a task). 

We would need to find a way to determine this somehow automatically or at least 
come up with a better default value than 1 core per task.

 

Does somebody have ideas or thoughts on how to solve this best?

 

Best,

David



smime.p7s
Description: S/MIME cryptographic signature


RE: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread David Hodeffi
I am not familiar of any problem with that.
Anyway, If you run spark applicaction you would have multiple jobs, which makes 
sense that it is not a problem.

Thanks David.

From: Naveen [mailto:hadoopst...@gmail.com]
Sent: Wednesday, December 21, 2016 9:18 AM
To: dev@spark.apache.org; u...@spark.apache.org
Subject: Launching multiple spark jobs within a main spark job.

Hi Team,

Is it ok to spawn multiple spark jobs within a main spark job, my main spark 
job's driver which was launched on yarn cluster, will do some preprocessing and 
based on it, it needs to launch multilple spark jobs on yarn cluster. Not sure 
if this right pattern.

Please share your thoughts.
Sample code i ve is as below for better understanding..
-

Object Mainsparkjob {

main(...){

val sc=new SparkContext(..)

Fetch from hive..using hivecontext
Fetch from hbase

//spawning multiple Futures..
Val future1=Future{
Val sparkjob= SparkLauncher(...).launch; spark.waitFor
}

Similarly, future2 to futureN.

future1.onComplete{...}
}

}// end of mainsparkjob
--

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.  
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.


Re: SPARK-13843 and future of streaming backends

2016-03-25 Thread David Nalley

> As far as group / artifact name compatibility, at least in the case of
> Kafka we need different artifact names anyway, and people are going to
> have to make changes to their build files for spark 2.0 anyway.   As
> far as keeping the actual classes in org.apache.spark to not break
> code despite the group name being different, I don't know whether that
> would be enforced by maven central, just looked at as poor taste, or
> ASF suing for trademark violation :)


Sonatype, has strict instructions to only permit org.apache.* to originate from 
repository.apache.org. Exceptions to that must be approved by VP, 
Infrastructure. 
--
Sent via Pony Mail for dev@spark.apache.org. 
View this email online at:
https://pony-poc.apache.org/list.html?dev@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread David Russell
Hi Ben,

> My company uses Lamba to do simple data moving and processing using python
> scripts. I can see using Spark instead for the data processing would make it
> into a real production level platform.

That may be true. Spark has first class support for Python which
should make your life easier if you do go this route. Once you've
fleshed out your ideas I'm sure folks on this mailing list can provide
helpful guidance based on their real world experience with Spark.

> Does this pave the way into replacing
> the need of a pre-instantiated cluster in AWS or bought hardware in a
> datacenter?

In a word, no. SAMBA is designed to extend-not-replace the traditional
Spark computation and deployment model. At it's most basic, the
traditional Spark computation model distributes data and computations
across worker nodes in the cluster.

SAMBA simply allows some of those computations to be performed by AWS
Lambda rather than locally on your worker nodes. There are I believe a
number of potential benefits to using SAMBA in some circumstances:

1. It can help reduce some of the workload on your Spark cluster by
moving that workload onto AWS Lambda, an infrastructure on-demand
compute service.

2. It allows Spark applications written in Java or Scala to make use
of libraries and features offered by Python and JavaScript (Node.js)
today, and potentially, more libraries and features offered by
additional languages in the future as AWS Lambda language support
evolves.

3. It provides a simple, clean API for integration with REST APIs that
may be a benefit to Spark applications that form part of a broader
data pipeline or solution.

> If so, then this would be a great efficiency and make an easier
> entry point for Spark usage. I hope the vision is to get rid of all cluster
> management when using Spark.

You might find one of the hosted Spark platform solutions such as
Databricks or Amazon EMR that handle cluster management for you a good
place to start. At least in my experience, they got me up and running
without difficulty.

David

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-01 Thread David Russell
Hi all,

Just sharing news of the release of a newly available Spark package, SAMBA
<https://github.com/onetapbeyond/lambda-spark-executor>.
<http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor>

https://github.com/onetapbeyond/lambda-spark-executor

SAMBA is an Apache Spark package offering seamless integration with the AWS
Lambda <https://aws.amazon.com/lambda/> compute service for Spark batch and
streaming applications on the JVM.

Within traditional Spark deployments RDD tasks are executed using fixed
compute resources on worker nodes within the Spark cluster. With SAMBA,
application developers can delegate selected RDD tasks to execute using
on-demand AWS Lambda compute infrastructure in the cloud.

Not unlike the recently released ROSE
<https://github.com/onetapbeyond/opencpu-spark-executor> package that
extends the capabilities of traditional Spark applications with support for
CRAN R analytics, SAMBA provides another (hopefully) useful extension for
Spark application developers on the JVM.

SAMBA Spark Package: https://github.com/onetapbeyond/lambda-spark-executor
<https://github.com/onetapbeyond/lambda-spark-executor>
ROSE Spark Package: https://github.com/onetapbeyond/opencpu-spark-executor
<https://github.com/onetapbeyond/opencpu-spark-executor>

Questions, suggestions, feedback welcome.

David

-- 
"*All that is gold does not glitter,** Not all those who wander are lost."*


Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread David Russell
Hi Richard,

Thanks for providing the background on your application.

> the user types or copy-pastes his R code,
> the system should then send this R code (using ROSE) to R

Unfortunately this type of ad hoc R analysis is not supported. ROSE supports 
the execution of any R function or script within an existing R package on CRAN, 
Bioc, or github. It does not support the direct execution of arbitrary blocks 
of R code as you described.

You may want to look at [DeployR](http://deployr.revolutionanalytics.com/), 
it's an open source R integration server that provides APIs in Java, JavaScript 
and .NET that can easily support your use case. The outputs of your DeployR 
integration could then become inputs to your data processing system.

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.
Local Time: January 13 2016 3:18 am
UTC Time: January 13 2016 8:18 am
From: rsiebel...@gmail.com
To: themarchoffo...@protonmail.com
CC: 
m...@vijaykiran.com,cjno...@gmail.com,u...@spark.apache.org,dev@spark.apache.org


Hi David,

the use case is that we're building a data processing system with an intuitive 
user interface where Spark is used as the data processing framework.
We would like to provide a HTML user interface to R where the user types or 
copy-pastes his R code, the system should then send this R code (using ROSE) to 
R, process it and give the results back to the user. The RDD would be used so 
that the data can be further processed by the system but we would like to also 
show or be able to show the messages printed to STDOUT and also the images 
(plots) that are generated by R. The plots seems to be available in the OpenCPU 
API, see below

Inline image 1

So the case is not that we're trying to process millions of images but rather 
that we would like to show the generated plots (like a regression plot) that's 
generated in R to the user. There could be several plots generated by the code, 
but certainly not thousands or even hundreds, only a few.

Hope that this would be possible using ROSE because it seems a really good fit,
thanks in advance,
Richard



On Wed, Jan 13, 2016 at 3:39 AM, David Russell <themarchoffo...@protonmail.com> 
wrote:

Hi Richard,


> Would it be possible to access the session API from within ROSE,
> to get for example the images that are generated by R / openCPU

Technically it would be possible although there would be some potentially 
significant runtime costs per task in doing so, primarily those related to 
extracting image data from the R session, serializing and then moving that data 
across the cluster for each and every image.

From a design perspective ROSE was intended to be used within Spark scale 
applications where R object data was seen as the primary task output. An output 
in a format that could be rapidly serialized and easily processed. Are there 
real world use cases where Spark scale applications capable of generating 10k, 
100k, or even millions of image files would actually need to capture and store 
images? If so, how practically speaking, would these images ever be used? I'm 
just not sure. Maybe you could describe your own use case to provide some 
insights?


> and the logging to stdout that is logged by R?

If you are referring to the R console output (generated within the R session 
during the execution of an OCPUTask) then this data could certainly 
(optionally) be captured and returned on an OCPUResult. Again, can you provide 
any details for how you might use this console output in a real world 
application?

As an aside, for simple standalone Spark applications that will only ever run 
on a single host (no cluster) you could consider using an alternative library 
called fluent-r. This library is also available under my GitHub repo, [see 
here](https://github.com/onetapbeyond/fluent-r). The fluent-r library already 
has support for the retrieval of R objects, R console output and R graphics 
device image/plots. However it is not as lightweight as ROSE and it not 
designed to work in a clustered environment. ROSE on the other hand is designed 
for scale.


David

"All that is gold does not glitter, Not all those who wander are lost."




 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.


Local Time: January 12 2016 6:56 pm
UTC Time: January 12 2016 11:56 pm
From: rsiebel...@gmail.com
To: m...@vijaykiran.com
CC: 
cjno...@gmail.com,themarchoffo...@protonmail.com,u...@spark.apache.org,dev@spark.apache.org



Hi,

this looks great and seems to be very usable.
Would it be possible to access the session API from within ROSE, to get for 
example the images that are generated by R / openCPU and the logging to stdout 
that is logged by R?

thanks in advance,
Richard



On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran <m...@vijaykiran.com> wrote:

I think it would be this: https:/

Re: Eigenvalue solver

2016-01-12 Thread David Hall
(I don't know anything spark specific, so I'm going to treat it like a
Breeze question...)

As I understand it, Spark uses ARPACK via Breeze for SVD, and presumably
the same approach can be used for EVD. Basically, you make a function that
multiplies your "matrix" (which might be represented
implicitly/distributed, whatever) by a breeze.linalg.DenseVector.

This is the Breeze implementation for sparse SVD (which is fully generic
and might be hard to follow if you're not used to Breeze/typeclass-heavy
Scala...)

https://github.com/dlwh/breeze/blob/aa958688c428db581d853fd92eb35e82f80d8b5c/math/src/main/scala/breeze/linalg/functions/svd.scala#L205-L205

The difference between SVD and EVD in arpack (to a first approximation) is
that you need to multiple by A.t * A * x for SVD, and just A * x for EVD.

The basic idea is to implement a Breeze UFunc eig.Impl2 implicit following
the svd code (or you could just copy out the body of the function and
specialize it.) The signature you're looking to implement is:

implicit def Eig_Sparse_Impl[Mat](implicit mul: OpMulMatrix.Impl2[Mat,
DenseVector[Double], DenseVector[Double]],
  dimImpl: dim.Impl[Mat, (Int, Int)])
  : eig.Impl3[Mat, Int, Double, EigenvalueResult] = {

The type parameters of Impl3 are: the matrix type, the number of
eigenvalues you want, and a tolerance, and a result type. If you implement
this signature, then you can call eig on anything that can be multiplied by
a dense vector and that implements dim (to get the number of outputs).

(You'll need to define the class eigenvalue result to be what you want. I
don't immediately know how to unpack ARPACK's answers, but you might look
at this scipy thing:
https://github.com/thomasnat1/cdcNewsRanker/blob/71b0ff3989d5191dc6a78c40c4a7a9967cbb0e49/venv/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py#L1049
)

I'm happy to help more if you decide to go this route, here, or on the
scala-breeze google group, or on github.

-- David


On Tue, Jan 12, 2016 at 10:28 AM, Lydia Ickler <ickle...@googlemail.com>
wrote:

> Hi,
>
> I wanted to know if there are any implementations yet within the Machine
> Learning Library or generally that can efficiently solve eigenvalue
> problems?
> Or if not do you have suggestions on how to approach a parallel execution
> maybe with BLAS or Breeze?
>
> Thanks in advance!
> Lydia
>
>
> Von meinem iPhone gesendet
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


ROSE: Spark + R on the JVM.

2016-01-12 Thread David
Hi all,

I'd like to share news of the recent release of a new Spark package, 
[ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor).

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented [on 
GitHub](https://github.com/onetapbeyond/opencpu-spark-executor) and I would 
encourage you to [take a 
look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, 
questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Corey,

> Would you mind providing a link to the github?

Sure, here is the github link you're looking for:

https://github.com/onetapbeyond/opencpu-spark-executor

David

"All that is gold does not glitter, Not all those who wander are lost."



 Original Message 
Subject: Re: ROSE: Spark + R on the JVM.
Local Time: January 12 2016 12:32 pm
UTC Time: January 12 2016 5:32 pm
From: cjno...@gmail.com
To: themarchoffo...@protonmail.com
CC: u...@spark.apache.org,dev@spark.apache.org



David,
Thank you very much for announcing this! It looks like it could be very useful. 
Would you mind providing a link to the github?



On Tue, Jan 12, 2016 at 10:03 AM, David <themarchoffo...@protonmail.com> wrote:

Hi all,

I'd like to share news of the recent release of a new Spark package, ROSE.

ROSE is a Scala library offering access to the full scientific computing power 
of the R programming language to Apache Spark batch and streaming applications 
on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is 
designed to let Scala and Java developers use R from Spark.

The project is available and documented on GitHub and I would encourage you to 
take a look. Any feedback, questions etc very welcome.

David

"All that is gold does not glitter, Not all those who wander are lost."

Re: [discuss] dropping Python 2.6 support

2016-01-11 Thread David Chin
FWIW, RHEL 6 still uses Python 2.6, although 2.7.8 and 3.3.2 are available
through Red Hat Software Collections. See:
https://www.softwarecollections.org/en/

I run an academic compute cluster on RHEL 6. We do, however, provide Python
2.7.x and 3.5.x via modulefiles.

On Tue, Jan 5, 2016 at 8:45 AM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020
> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>, but
> otherwise yes, Python 2.6 is ancient history and the core Python developers
> stopped supporting it in 2013. REHL 5 is not a good enough reason to
> continue support for Python 2.6 IMO.
>
> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
> currently do).
>
> Nick
>
> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang <allenzhang...@126.com> wrote:
>
>> plus 1,
>>
>> we are currently using python 2.7.2 in production environment.
>>
>>
>>
>>
>>
>> 在 2016-01-05 18:11:45,"Meethu Mathew" <meethu.mat...@flytxt.com> 写道:
>>
>> +1
>> We use Python 2.7
>>
>> Regards,
>>
>> Meethu Mathew
>>
>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin <r...@databricks.com> wrote:
>>
>>> Does anybody here care about us dropping support for Python 2.6 in Spark
>>> 2.0?
>>>
>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>>


-- 
David Chin, Ph.D.
david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U.
http://www.drexel.edu/research/urcf/
https://linuxfollies.blogspot.com/
+1.215.221.4747 (mobile)
https://github.com/prehensilecode


Differing performance in self joins

2015-08-26 Thread David Smith
I've noticed that two queries, which return identical results, have very
different performance. I'd be interested in any hints about how avoid
problems like this.

The DataFrame df contains a string field series and an integer eday, the
number of days since (or before) the 1970-01-01 epoch.

I'm doing some analysis over a sliding date window and, for now, avoiding
UDAFs. I'm therefore using a self join. First, I create 

val laggard = df.withColumnRenamed(series,
p_series).withColumnRenamed(eday, p_eday)

Then, the following query runs in 16s:

df.join(laggard, (df(series) === laggard(p_series))  (df(eday) ===
(laggard(p_eday) + 1))).count

while the following query runs in 4 - 6 minutes:

df.join(laggard, (df(series) === laggard(p_series))  ((df(eday) -
laggard(p_eday)) === 1)).count

It's worth noting that the series term is necessary to keep the query from
doing a complete cartesian product over the data.

Ideally, I'd like to look at lags of more than one day, but the following is
equally slow:

df.join(laggard, (df(series) === laggard(p_series))  (df(eday) -
laggard(p_eday)).between(1,7)).count

Any advice about the general principle at work here would be welcome.

Thanks,
David



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-18 Thread David Hall
sure.

On Wed, Mar 18, 2015 at 12:19 AM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi David,

 We are stress testing breeze.optimize.proximal and nnls...if you are
 cutting a release now, we will need another release soon once we get the
 runtime optimizations in place and merged to breeze.

 Thanks.
 Deb
  On Mar 15, 2015 9:39 PM, David Hall david.lw.h...@gmail.com wrote:

 snapshot is pushed. If you verify I'll publish the new artifacts.

 On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa 
 yuu.ishikawa+sp...@gmail.com
 wrote:

  David Hall who is a breeze creator told me that it's a bug. So, I made a
  jira
  ticket about this issue. We need to upgrade breeze from 0.11.1 to
 0.11.2 or
  later in order to fix the bug, when the new version of breeze will be
  released.
 
  [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA
  https://issues.apache.org/jira/browse/SPARK-6341
 
  Thanks,
  Yu Ishikawa
 
 
 
  -
  -- Yu Ishikawa
  --
  View this message in context:
 
 http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html
  Sent from the Apache Spark Developers List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 




Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-17 Thread David Hall
ping?

On Sun, Mar 15, 2015 at 9:38 PM, David Hall david.lw.h...@gmail.com wrote:

 snapshot is pushed. If you verify I'll publish the new artifacts.

 On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
  wrote:

 David Hall who is a breeze creator told me that it's a bug. So, I made a
 jira
 ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2
 or
 later in order to fix the bug, when the new version of breeze will be
 released.

 [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA
 https://issues.apache.org/jira/browse/SPARK-6341

 Thanks,
 Yu Ishikawa



 -
 -- Yu Ishikawa
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org





Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread David Hall
snapshot is pushed. If you verify I'll publish the new artifacts.

On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com
wrote:

 David Hall who is a breeze creator told me that it's a bug. So, I made a
 jira
 ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or
 later in order to fix the bug, when the new version of breeze will be
 released.

 [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA
 https://issues.apache.org/jira/browse/SPARK-6341

 Thanks,
 Yu Ishikawa



 -
 -- Yu Ishikawa
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Implementing TinkerPop on top of GraphX

2015-01-15 Thread David Robinson
I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and
think the idea of using Tinkerpop as the API for GraphX is a great idea and
hope you are still headed in that direction.  I noticed that Tinkerpop 3 is
moving into the Apache family:
http://wiki.apache.org/incubator/TinkerPopProposal  which might alleviate
concerns about having an API definition outside of Spark.

Thanks,




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: spark-yarn_2.10 1.2.0 artifacts

2014-12-22 Thread David McWhorter

Thank you, Sean, using spark-network-yarn seems to do the trick.

On 12/19/2014 12:13 PM, Sean Owen wrote:

I believe spark-yarn does not exist from 1.2 onwards. Have a look at
spark-network-yarn for where some of that went, I believe.

On Fri, Dec 19, 2014 at 5:09 PM, David McWhorter mcwhor...@ccri.com wrote:

Hi all,

Thanks for your work on spark!  I am trying to locate spark-yarn jars for
the new 1.2.0 release.  The jars for spark-core, etc, are on maven central,
but the spark-yarn jars are missing.

Confusingly and perhaps relatedly, I also can't seem to get the spark-yarn
artifact to install on my local computer when I run 'mvn -Pyarn -Phadoop-2.2
-Dhadoop.version=2.2.0 -DskipTests clean install'.  At the install plugin
stage, maven reports:

[INFO] --- maven-install-plugin:2.5.1:install (default-install) @
spark-yarn_2.10 ---
[INFO] Skipping artifact installation

Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven build
would be appreciated.

David

--

David McWhorter
Software Engineer
Commonwealth Computer Research, Inc.
1422 Sachem Place, Unit #1
Charlottesville, VA 22901
mcwhor...@ccri.com | 434.299.0090x204


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



--

David McWhorter
Software Engineer
Commonwealth Computer Research, Inc.
1422 Sachem Place, Unit #1
Charlottesville, VA 22901
mcwhor...@ccri.com | 434.299.0090x204


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



spark-yarn_2.10 1.2.0 artifacts

2014-12-19 Thread David McWhorter

Hi all,

Thanks for your work on spark!  I am trying to locate spark-yarn jars 
for the new 1.2.0 release.  The jars for spark-core, etc, are on maven 
central, but the spark-yarn jars are missing.


Confusingly and perhaps relatedly, I also can't seem to get the 
spark-yarn artifact to install on my local computer when I run 'mvn 
-Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean install'.  
At the install plugin stage, maven reports:


[INFO] --- maven-install-plugin:2.5.1:install (default-install) @ 
spark-yarn_2.10 ---

[INFO] Skipping artifact installation

Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven 
build would be appreciated.


David

--

David McWhorter
Software Engineer
Commonwealth Computer Research, Inc.
1422 Sachem Place, Unit #1
Charlottesville, VA 22901
mcwhor...@ccri.com | 434.299.0090x204


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: EC2 clusters ready in launch time + 30 seconds

2014-10-06 Thread David Rowe
I agree with this - there is also the issue of different sized masters and
slaves, and numbers of executors for hefty machines (e.g. r3.8xlarges),
tagging of instances and volumes (we use this for cost attribution at my
workplace), and running in VPCs.

I think think it might be useful to take a layered approach: the first step
could be getting a good reliable image produced - Nick's ticket - then
doing some work on the launch script.

Regarding the EMR like service - I think I heard that AWS is planning to
add spark support to EMR, but as usual there's nothing firm until it's
released.


On Tue, Oct 7, 2014 at 7:48 AM, Daniil Osipov daniil.osi...@shazam.com
wrote:

 I've also been looking at this. Basically, the Spark EC2 script is
 excellent for small development clusters of several nodes, but isn't
 suitable for production. It handles instance setup in a single threaded
 manner, while it can easily be parallelized. It also doesn't handle failure
 well, ex when an instance fails to start or is taking too long to respond.

 Our desire was to have an equivalent to Amazon EMR[1] API that would
 trigger Spark jobs, including specified cluster setup. I've done some work
 towards that end, and it would benefit from an updated AMI greatly.

 Dan

 [1]

 http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html

 On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com
  wrote:

  Thanks for posting that script, Patrick. It looks like a good place to
  start.
 
  Regarding Docker vs. Packer, as I understand it you can use Packer to
  create Docker containers at the same time as AMIs and other image types.
 
  Nick
 
 
  On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com
  wrote:
 
   Hey All,
  
   Just a couple notes. I recently posted a shell script for creating the
   AMI's from a clean Amazon Linux AMI.
  
   https://github.com/mesos/spark-ec2/blob/v3/create_image.sh
  
   I think I will update the AMI's soon to get the most recent security
   updates. For spark-ec2's purpose this is probably sufficient (we'll
   only need to re-create them every few months).
  
   However, it would be cool if someone wanted to tackle providing a more
   general mechanism for defining Spark-friendly images that can be
   used more generally. I had thought that docker might be a good way to
   go for something like this - but maybe this packer thing is good too.
  
   For one thing, if we had a standard image we could use it to create
   containers for running Spark's unit test, which would be really cool.
   This would help a lot with random issues around port and filesystem
   contention we have for unit tests.
  
   I'm not sure if the long term place for this would be inside the spark
   codebase or a community library or what. But it would definitely be
   very valuable to have if someone wanted to take it on.
  
   - Patrick
  
   On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas
   nicholas.cham...@gmail.com wrote:
FYI: There is an existing issue -- SPARK-3314
https://issues.apache.org/jira/browse/SPARK-3314 -- about
 scripting
   the
creation of Spark AMIs.
   
With Packer, it looks like we may be able to script the creation of
multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from
 a
single Packer template. That's very cool.
   
I'll be looking into this.
   
Nick
   
   
On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas 
   nicholas.cham...@gmail.com
wrote:
   
Thanks for the update, Nate. I'm looking forward to seeing how these
projects turn out.
   
David, Packer looks very, very interesting. I'm gonna look into it
  more
next week.
   
Nick
   
   
On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com
  wrote:
   
Bit of progress on our end, bit of lagging as well.  Our guy
 leading
effort got little bogged down on client project to update hive/sql
   testbed
to latest spark/sparkSQL, also launching public service so we have
   been bit
scattered recently.
   
Will have some more updates probably after next week.  We are
  planning
   on
taking our client work around hive/spark, plus taking over the
 bigtop
automation work to modernize and get that fit for human consumption
   outside
or org.  All our work and puppet modules will be open sourced,
   documented,
hopefully start to rally some other folks around effort that find
 it
   useful
   
Side note, another effort we are looking into is gradle
  tests/support.
We have been leveraging serverspec for some basic infrastructure
   tests, but
with bigtop switching over to gradle builds/testing setup in 0.8 we
   want to
include support for that in our own efforts, probably some stuff
 that
   can
be learned and leveraged in spark world for repeatable/tested
   infrastructure
   
If anyone has any specific automation questions to your environment
  you

Re: Breeze Library usage in Spark

2014-10-03 Thread David Hall
yeah, breeze.storage.Zero was introduced in either 0.8 or 0.9.

On Fri, Oct 3, 2014 at 9:45 AM, Xiangrui Meng men...@gmail.com wrote:

 Did you add a different version of breeze to the classpath? In Spark
 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze
 version you used is different from the one comes with Spark, you might
 see class not found. -Xiangrui

 On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:
  Hi Team,
 
  When I am trying to use DenseMatrix of breeze library in spark, its
 throwing
  me the following error:
 
  java.lang.noclassdeffounderror: breeze/storage/Zero
 
 
  Can someone help me on this ?
 
  Thanks,
  Padma Ch
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: BlockManager issues

2014-09-22 Thread David Rowe
I've run into this with large shuffles - I assumed that there was
contention between the shuffle output files and the JVM for memory.
Whenever we start getting these fetch failures, it corresponds with high
load on the machines the blocks are being fetched from, and in some cases
complete unresponsiveness (no ssh etc). Setting the timeout higher, or the
JVM heap lower (as a percentage of total machine memory) seemed to help..



On Mon, Sep 22, 2014 at 8:02 PM, Christoph Sawade 
christoph.saw...@googlemail.com wrote:

 Hey all. We had also the same problem described by Nishkam almost in the
 same big data setting. We fixed the fetch failure by increasing the timeout
 for acks in the driver:

 set(spark.core.connection.ack.wait.timeout, 600) // 10 minutes timeout
 for acks between nodes

 Cheers, Christoph

 2014-09-22 9:24 GMT+02:00 Hortonworks zzh...@hortonworks.com:

  Actually I met similar issue when doing groupByKey and then count if the
  shuffle size is big e.g. 1tb.
 
  Thanks.
 
  Zhan Zhang
 
  Sent from my iPhone
 
   On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote:
  
   Thanks for the quick follow up Reynold and Patrick. Tried a run with
   significantly higher ulimit, doesn't seem to help. The executors have
  35GB
   each. Btw, with a recent version of the branch, the error message is
  fetch
   failures as opposed to too many open files. Not sure if they are
   related.  Please note that the workload runs fine with head set to
  066765d.
   In case you want to reproduce the problem: I'm running slightly
 modified
   ScalaPageRank (with KryoSerializer and persistence level
   memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster.
  
   Thanks,
   Nishkam
  
   On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com
   wrote:
  
   Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible
   that you are just having more spilling as a result of the patch and so
   the filesystem is opening more files. I would try increasing the
   ulimit.
  
   How much memory do your executors have?
  
   - Patrick
  
   On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com
 
   wrote:
   Hey the numbers you mentioned don't quite line up - did you mean PR
  2711?
  
   On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com
   wrote:
   It seems like you just need to raise the ulimit?
  
  
   On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com
   wrote:
  
   Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one
 of
   the
   workloads. Tried tracing the problem through change set analysis.
  Looks
   like the offending commit is 4fde28c from Aug 4th for PR1707.
 Please
   see
   SPARK-3633 for more details.
  
   Thanks,
   Nishkam
  
 
  --
  CONFIDENTIALITY NOTICE
  NOTICE: This message is intended for the use of the individual or entity
 to
  which it is addressed and may contain information that is confidential,
  privileged and exempt from disclosure under applicable law. If the reader
  of this message is not the intended recipient, you are hereby notified
 that
  any printing, copying, dissemination, distribution, disclosure or
  forwarding of this communication is strictly prohibited. If you have
  received this communication in error, please contact the sender
 immediately
  and delete it from your system. Thank You.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Source code for mining big data with Spark

2014-09-14 Thread David Tung
Hi all,

I watched am impressed spark demo video by Reynold Xin and Aaron Davidson
in youtube ( https://www.youtube.com/watch?v=FjhRkfAuU7I ). Can someone
let me know where can I find the source codes for the demo? I can¹t see
the source codes from video clearly.

Thanks in advance

CONFIDENTIALITY CAUTION 
This e-mail and any attachments may be confidential or legally privileged. If 
you received this message in error or are not the intended recipient, you 
should destroy the e-mail message and any attachments or copies, and you are 
prohibited from retaining, distributing, disclosing or using any information 
contained herein. Please inform us of the erroneous delivery by return e-mail. 
Thank you for your cooperation.
DOCUMENT CONFIDENTIEL 
Le présent courriel et tout fichier joint à celui-ci peuvent contenir des 
renseignements confidentiels ou privilégiés. Si cet envoi ne s'adresse pas à 
vous ou si vous l'avez reçu par erreur, vous devez l'effacer. Vous ne pouvez 
conserver, distribuer, communiquer ou utiliser les renseignements qu'il 
contient. Nous vous prions de nous signaler l'erreur par courriel. Merci de 
votre collaboration.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Is breeze thread safe in Spark?

2014-09-03 Thread David Hall
mutating operations are not thread safe. Operations that don't mutate
should be thread safe. I can't speak to what Evan said, but I would guess
that the way they're using += should be safe.


On Wed, Sep 3, 2014 at 11:58 AM, RJ Nowling rnowl...@gmail.com wrote:

 David,

 Can you confirm that += is not thread safe but + is?  I'm assuming +
 allocates a new object for the write, while += doesn't.

 Thanks!
 RJ


 On Wed, Sep 3, 2014 at 2:50 PM, David Hall d...@cs.berkeley.edu wrote:

 In general, in Breeze we allocate separate work arrays for each call to
 lapack, so it should be fine. In general concurrent modification isn't
 thread safe of course, but things that ought to be thread safe really
 should be.


 On Wed, Sep 3, 2014 at 10:41 AM, RJ Nowling rnowl...@gmail.com wrote:

 No, it's not in all cases.   Since Breeze uses lapack under the hood,
 changes to memory between different threads is bad.

 There's actually a potential bug in the KMeans code where it uses +=
 instead of +.


 On Wed, Sep 3, 2014 at 1:26 PM, Ulanov, Alexander 
 alexander.ula...@hp.com
 wrote:

  Hi,
 
  Is breeze library called thread safe from Spark mllib code in case when
  native libs for blas and lapack are used? Might it be an issue when
 running
  Spark locally?
 
  Best regards, Alexander
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 


 --
 em rnowl...@gmail.com
 c 954.496.2314





 --
 em rnowl...@gmail.com
 c 954.496.2314



Re: Linear CG solver

2014-06-27 Thread David Hall
I have no ideas on benchmarks, but breeze has a CG solver:
https://github.com/scalanlp/breeze/tree/master/math/src/main/scala/breeze/optimize/linear/ConjugateGradient.scala

https://github.com/scalanlp/breeze/blob/e2adad3b885736baf890b306806a56abc77a3ed3/math/src/test/scala/breeze/optimize/linear/ConjugateGradientTest.scala

It's based on the code from TRON, and so I think it's more targeted for
norm-constrained solutions of the CG problem.








On Fri, Jun 27, 2014 at 5:54 PM, Debasish Das debasish.da...@gmail.com
wrote:

 Hi,

 I am looking for an efficient linear CG to be put inside the Quadratic
 Minimization algorithms we added for Spark mllib.

 With a good linear CG, we should be able to solve kernel SVMs with this
 solver in mllib...

 I use direct solves right now using cholesky decomposition which has higher
 complexity as matrix sizes become large...

 I found out some jblas example code:

 https://github.com/mikiobraun/jblas-examples/blob/master/src/CG.java

 I was wondering if mllib developers have any experience using this solver
 and if this is better than apache commons linear CG ?

 Thanks.
 Deb



Re: mllib vector templates

2014-05-05 Thread David Hall
On Mon, May 5, 2014 at 3:40 PM, DB Tsai dbt...@stanford.edu wrote:

 David,

 Could we use Int, Long, Float as the data feature spaces, and Double for
 optimizer?


Yes. Breeze doesn't allow operations on mixed types, so you'd need to
convert the double vectors to Floats if you wanted, e.g. dot product with
the weights vector.

You might also be interested in FeatureVector, which is just a wrapper
around Array[Int] that emulates an indicator vector. It supports dot
products, axpy, etc.

-- David




 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Mon, May 5, 2014 at 3:06 PM, David Hall d...@cs.berkeley.edu wrote:

  Lbfgs and other optimizers would not work immediately, as they require
  vector spaces over double. Otherwise it should work.
  On May 5, 2014 3:03 PM, DB Tsai dbt...@stanford.edu wrote:
 
   Breeze could take any type (Int, Long, Double, and Float) in the matrix
   template.
  
  
   Sincerely,
  
   DB Tsai
   ---
   My Blog: https://www.dbtsai.com
   LinkedIn: https://www.linkedin.com/in/dbtsai
  
  
   On Mon, May 5, 2014 at 2:56 PM, Debasish Das debasish.da...@gmail.com
   wrote:
  
Is this a breeze issue or breeze can take templates on float /
 double ?
   
If breeze can take templates then it is a minor fix for Vectors.scala
   right
?
   
Thanks.
Deb
   
   
On Mon, May 5, 2014 at 2:45 PM, DB Tsai dbt...@stanford.edu wrote:
   
 +1  Would be nice that we can use different type in Vector.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Mon, May 5, 2014 at 2:41 PM, Debasish Das 
  debasish.da...@gmail.com
 wrote:

  Hi,
 
  Why mllib vector is using double as default ?
 
  /**
 
   * Represents a numeric vector, whose index type is Int and value
   type
is
  Double.
 
   */
 
  trait Vector extends Serializable {
 
 
/**
 
 * Size of the vector.
 
 */
 
def size: Int
 
 
/**
 
 * Converts the instance to a double array.
 
 */
 
def toArray: Array[Double]
 
  Don't we need a template on float/double ? This will give us
 memory
  savings...
 
  Thanks.
 
  Deb
 

   
  
 



Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-29 Thread David Hall
Yeah, that's probably the easiest though obviously pretty hacky.

I'm surprised that the hessian approximation isn't worse than it is. (As
in, I'd expect error messages.) It's obviously line searching much more, so
the approximation must be worse. You might be interested in this online
bfgs:
http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf

-- David


On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai dbt...@stanford.edu wrote:

 Have a quick hack to understand the behavior of SLBFGS
 (Stochastic-LBFGS) by overwriting the breeze iterations method to get the
 current LBFGS step to ensure that the objective function is the same during
 the line search step. David, the following is my code, have a better way to
 inject into it?

 https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack

 Couple findings,

 1) miniBatch (using rdd sample api) for each iteration is slower than full
 data training when the full data is cached. Probably because sample is not
 efficiency in Spark.

 2) Since in the line search steps, we use the same sample of data (the
 same objective function), the SLBFGS actually converges well.

 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps
 (29 data iterations, 74.5seconds) to converge to loss  0.10. For LBFGS
 with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6
 seconds) to converge to loss  0.10.

 It seems that as long as the noisy gradient happens in different SLBFGS
 steps, it still works.

 (ps, I also tried in line search step, I use different sample of data, and
 it just doesn't work as we expect.)



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Mon, Apr 28, 2014 at 8:55 AM, David Hall d...@cs.berkeley.edu wrote:

 That's right.

 FWIW, caching should be automatic now, but it might be the version of
 Breeze you're using doesn't do that yet.

 Also, In breeze.util._ there's an implicit that adds a tee method to
 iterator, and also a last method. Both are useful for things like this.

 -- David


 On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai dbt...@stanford.edu wrote:

 I think I figure it out. Instead of calling minimize, and record the
 loss in the DiffFunction, I should do the following.

 val states = lbfgs.iterations(new CachedDiffFunction(costFun),
 initialWeights.toBreeze.toDenseVector)
 states.foreach(state = lossHistory.append(state.value))

 All the losses in states should be decreasing now. Am I right?



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai dbt...@stanford.edu wrote:

 Also, how many failure of rejection will terminate the optimization
 process? How is it related to numberOfImprovementFailures?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai dbt...@stanford.edu wrote:

 Hi David,

 I'm recording the loss history in the DiffFunction implementation, and
 that's why the rejected step is also recorded in my loss history.

 Is there any api in Breeze LBFGS to get the history which already
 excludes the reject step? Or should I just call iterations method and
 check iteratingShouldStop instead?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Apr 25, 2014 at 3:10 PM, David Hall d...@cs.berkeley.eduwrote:

 LBFGS will not take a step that sends the objective value up. It
 might try a step that is too big and reject it, so if you're just 
 logging
 everything that gets tried by LBFGS, you could see that. The iterations
 method of the minimizer should never return an increasing objective 
 value.
 If you're regularizing, are you including the regularizer in the 
 objective
 value computation?

 GD is almost never worth your time.

 -- David

 On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai dbt...@stanford.edu wrote:

 Another interesting benchmark.

 *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
 elements.*

 LBFGS converges in 70 seconds, while GD seems to be not progressing.

 Dense feature vector will be too big to fit in the memory, so only
 conduct the sparse benchmark.

 I saw the sometimes the loss bumps up, and it's weird for me. Since
 the cost function of logistic regression is convex, it should be
 monotonically decreasing.  David, any suggestion?

 The detail figure:

 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf


 *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*

 LBFGS converges in 25 seconds, while GD also seems to be not
 progressing

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread David Hall
That's right.

FWIW, caching should be automatic now, but it might be the version of
Breeze you're using doesn't do that yet.

Also, In breeze.util._ there's an implicit that adds a tee method to
iterator, and also a last method. Both are useful for things like this.

-- David

On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai dbt...@stanford.edu wrote:

 I think I figure it out. Instead of calling minimize, and record the loss
 in the DiffFunction, I should do the following.

 val states = lbfgs.iterations(new CachedDiffFunction(costFun),
 initialWeights.toBreeze.toDenseVector)
 states.foreach(state = lossHistory.append(state.value))

 All the losses in states should be decreasing now. Am I right?



 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai dbt...@stanford.edu wrote:

 Also, how many failure of rejection will terminate the optimization
 process? How is it related to numberOfImprovementFailures?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai dbt...@stanford.edu wrote:

 Hi David,

 I'm recording the loss history in the DiffFunction implementation, and
 that's why the rejected step is also recorded in my loss history.

 Is there any api in Breeze LBFGS to get the history which already
 excludes the reject step? Or should I just call iterations method and
 check iteratingShouldStop instead?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Apr 25, 2014 at 3:10 PM, David Hall d...@cs.berkeley.eduwrote:

 LBFGS will not take a step that sends the objective value up. It might
 try a step that is too big and reject it, so if you're just logging
 everything that gets tried by LBFGS, you could see that. The iterations
 method of the minimizer should never return an increasing objective value.
 If you're regularizing, are you including the regularizer in the objective
 value computation?

 GD is almost never worth your time.

 -- David

 On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai dbt...@stanford.edu wrote:

 Another interesting benchmark.

 *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
 elements.*

 LBFGS converges in 70 seconds, while GD seems to be not progressing.

 Dense feature vector will be too big to fit in the memory, so only
 conduct the sparse benchmark.

 I saw the sometimes the loss bumps up, and it's weird for me. Since
 the cost function of logistic regression is convex, it should be
 monotonically decreasing.  David, any suggestion?

 The detail figure:

 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf


 *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*

 LBFGS converges in 25 seconds, while GD also seems to be not
 progressing.

 Only conduct sparse benchmark for the same reason. I also saw the loss
 bumps up for unknown reason.

 The detail figure:

 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai dbt...@stanford.edu wrote:

 rcv1.binary is too sparse (0.15% non-zero elements), so dense format
 will not run due to out of memory. But sparse format runs really well.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai dbt...@stanford.edu wrote:

 I'm doing the timer in runMiniBatchSGD after  val numExamples =
 data.count()

 See the following. Running rcv1 dataset now, and will update soon.

 val startTime = System.nanoTime()
 for (i - 1 to numIterations) {
   // Sample a subset (fraction miniBatchFraction) of the total
 data
   // compute and sum up the subgradients on this subset (this is
 one map-reduce)
   val (gradientSum, lossSum) = data.sample(false,
 miniBatchFraction, 42 + i)
 .aggregate((BDV.zeros[Double](weights.size), 0.0))(
   seqOp = (c, v) = (c, v) match { case ((grad, loss),
 (label, features)) =
 val l = gradient.compute(features, label, weights,
 Vectors.fromBreeze(grad))
 (grad, loss + l)
   },
   combOp = (c1, c2) = (c1, c2) match { case ((grad1,
 loss1), (grad2, loss2)) =
 (grad1 += grad2, loss1 + loss2)
   })

   /**
* NOTE(Xinghao): lossSum is computed using the weights from
 the previous iteration
* and regVal

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-25 Thread David Hall
LBFGS will not take a step that sends the objective value up. It might try
a step that is too big and reject it, so if you're just logging
everything that gets tried by LBFGS, you could see that. The iterations
method of the minimizer should never return an increasing objective value.
If you're regularizing, are you including the regularizer in the objective
value computation?

GD is almost never worth your time.

-- David

On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai dbt...@stanford.edu wrote:

 Another interesting benchmark.

 *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero elements.*

 LBFGS converges in 70 seconds, while GD seems to be not progressing.

 Dense feature vector will be too big to fit in the memory, so only conduct
 the sparse benchmark.

 I saw the sometimes the loss bumps up, and it's weird for me. Since the
 cost function of logistic regression is convex, it should be monotonically
 decreasing.  David, any suggestion?

 The detail figure:

 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf


 *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*

 LBFGS converges in 25 seconds, while GD also seems to be not progressing.

 Only conduct sparse benchmark for the same reason. I also saw the loss
 bumps up for unknown reason.

 The detail figure:

 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai dbt...@stanford.edu wrote:

 rcv1.binary is too sparse (0.15% non-zero elements), so dense format
 will not run due to out of memory. But sparse format runs really well.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai dbt...@stanford.edu wrote:

 I'm doing the timer in runMiniBatchSGD after  val numExamples =
 data.count()

 See the following. Running rcv1 dataset now, and will update soon.

 val startTime = System.nanoTime()
 for (i - 1 to numIterations) {
   // Sample a subset (fraction miniBatchFraction) of the total data
   // compute and sum up the subgradients on this subset (this is one
 map-reduce)
   val (gradientSum, lossSum) = data.sample(false, miniBatchFraction,
 42 + i)
 .aggregate((BDV.zeros[Double](weights.size), 0.0))(
   seqOp = (c, v) = (c, v) match { case ((grad, loss), (label,
 features)) =
 val l = gradient.compute(features, label, weights,
 Vectors.fromBreeze(grad))
 (grad, loss + l)
   },
   combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1),
 (grad2, loss2)) =
 (grad1 += grad2, loss1 + loss2)
   })

   /**
* NOTE(Xinghao): lossSum is computed using the weights from the
 previous iteration
* and regVal is the regularization value computed in the previous
 iteration as well.
*/
   stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
   val update = updater.compute(
 weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
 stepSize, i, regParam)
   weights = update._1
   regVal = update._2
   timeStamp.append(System.nanoTime() - startTime)
 }






 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng men...@gmail.com wrote:

 I don't understand why sparse falls behind dense so much at the very
 first iteration. I didn't see count() is called in

 https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
 . Maybe you have local uncommitted changes.

 Best,
 Xiangrui

 On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote:
  Hi Xiangrui,
 
  Yes, I'm using yarn-cluster mode, and I did check # of executors I
 specified
  are the same as the actual running executors.
 
  For caching and materialization, I've the timer in optimizer after
 calling
  count(); as a result, the time for materialization in cache isn't in
 the
  benchmark.
 
  The difference you saw is actually from dense feature or sparse
 feature
  vector. For LBFGS and GD dense feature, you can see the first
 iteration
  takes the same time. It's true for GD.
 
  I'm going to run rcv1.binary which only has 0.15% non-zero elements to
  verify the hypothesis.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com
 wrote

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
On Wed, Apr 23, 2014 at 9:30 PM, Evan Sparks evan.spa...@gmail.com wrote:

 What is the number of non zeroes per row (and number of features) in the
 sparse case? We've hit some issues with breeze sparse support in the past
 but for sufficiently sparse data it's still pretty good.


Any chance you remember what the problems were? I'm sure it could be
better, but it's good to know where improvements need to happen.

-- David



  On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote:
 
  Hi all,
 
  I'm benchmarking Logistic Regression in MLlib using the newly added
 optimizer LBFGS and GD. I'm using the same dataset and the same methodology
 in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
 
  I want to know how Spark scale while adding workers, and how optimizers
 and input format (sparse or dense) impact performance.
 
  The benchmark code can be found here,
 https://github.com/dbtsai/spark-lbfgs-benchmark
 
  The first dataset I benchmarked is a9a which only has 2.2MB. I
 duplicated the dataset, and made it 762MB to have 11M rows. This dataset
 has 123 features and 11% of the data are non-zero elements.
 
  In this benchmark, all the dataset is cached in memory.
 
  As we expect, LBFGS converges faster than GD, and at some point, no
 matter how we push GD, it will converge slower and slower.
 
  However, it's surprising that sparse format runs slower than dense
 format. I did see that sparse format takes significantly smaller amount of
 memory in caching RDD, but sparse is 40% slower than dense. I think sparse
 should be fast since when we compute x wT, since x is sparse, we can do it
 faster. I wonder if there is anything I'm doing wrong.
 
  The attachment is the benchmark result.
 
  Thanks.
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai



Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
Was the weight vector sparse? The gradients? Or just the feature vectors?


On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai dbt...@dbtsai.com wrote:

 The figure showing the Log-Likelihood vs Time can be found here.


 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf

 Let me know if you can not open it.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman 
 shiva...@eecs.berkeley.edu wrote:

  I don't think the attachment came through in the list. Could you upload
  the results somewhere and link to them ?
 
 
  On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote:
 
  123 features per rows, and in average, 89% are zeros.
  On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote:
 
   What is the number of non zeroes per row (and number of features) in
 the
   sparse case? We've hit some issues with breeze sparse support in the
  past
   but for sufficiently sparse data it's still pretty good.
  
On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote:
   
Hi all,
   
I'm benchmarking Logistic Regression in MLlib using the newly added
   optimizer LBFGS and GD. I'm using the same dataset and the same
  methodology
   in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
   
I want to know how Spark scale while adding workers, and how
  optimizers
   and input format (sparse or dense) impact performance.
   
The benchmark code can be found here,
   https://github.com/dbtsai/spark-lbfgs-benchmark
   
The first dataset I benchmarked is a9a which only has 2.2MB. I
   duplicated the dataset, and made it 762MB to have 11M rows. This
 dataset
   has 123 features and 11% of the data are non-zero elements.
   
In this benchmark, all the dataset is cached in memory.
   
As we expect, LBFGS converges faster than GD, and at some point, no
   matter how we push GD, it will converge slower and slower.
   
However, it's surprising that sparse format runs slower than dense
   format. I did see that sparse format takes significantly smaller
 amount
  of
   memory in caching RDD, but sparse is 40% slower than dense. I think
  sparse
   should be fast since when we compute x wT, since x is sparse, we can
 do
  it
   faster. I wonder if there is anything I'm doing wrong.
   
The attachment is the benchmark result.
   
Thanks.
   
Sincerely,
   
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
  
 
 
 



Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
On Wed, Apr 23, 2014 at 10:18 PM, DB Tsai dbt...@dbtsai.com wrote:

 ps, it doesn't make sense to have weight and gradient sparse unless
 with strong L1 penalty.


Sure, I was just checking the obvious things. Have you run it through it a
profiler to see where the problem is?




 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Apr 23, 2014 at 10:17 PM, DB Tsai dbt...@dbtsai.com wrote:
  In mllib, the weight, and gradient are dense. Only feature is sparse.
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, Apr 23, 2014 at 10:16 PM, David Hall d...@cs.berkeley.edu
 wrote:
  Was the weight vector sparse? The gradients? Or just the feature
 vectors?
 
 
  On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai dbt...@dbtsai.com wrote:
 
  The figure showing the Log-Likelihood vs Time can be found here.
 
 
 
 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
 
  Let me know if you can not open it.
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman 
  shiva...@eecs.berkeley.edu wrote:
 
   I don't think the attachment came through in the list. Could you
 upload
   the results somewhere and link to them ?
  
  
   On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote:
  
   123 features per rows, and in average, 89% are zeros.
   On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com
 wrote:
  
What is the number of non zeroes per row (and number of features)
 in
the
sparse case? We've hit some issues with breeze sparse support in
 the
   past
but for sufficiently sparse data it's still pretty good.
   
 On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu
 wrote:

 Hi all,

 I'm benchmarking Logistic Regression in MLlib using the newly
 added
optimizer LBFGS and GD. I'm using the same dataset and the same
   methodology
in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf

 I want to know how Spark scale while adding workers, and how
   optimizers
and input format (sparse or dense) impact performance.

 The benchmark code can be found here,
https://github.com/dbtsai/spark-lbfgs-benchmark

 The first dataset I benchmarked is a9a which only has 2.2MB. I
duplicated the dataset, and made it 762MB to have 11M rows. This
dataset
has 123 features and 11% of the data are non-zero elements.

 In this benchmark, all the dataset is cached in memory.

 As we expect, LBFGS converges faster than GD, and at some
 point, no
matter how we push GD, it will converge slower and slower.

 However, it's surprising that sparse format runs slower than
 dense
format. I did see that sparse format takes significantly smaller
amount
   of
memory in caching RDD, but sparse is 40% slower than dense. I
 think
   sparse
should be fast since when we compute x wT, since x is sparse, we
 can
do
   it
faster. I wonder if there is anything I'm doing wrong.

 The attachment is the benchmark result.

 Thanks.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai
   
  
  
  
 
 



Re: RFC: varargs in Logging.scala?

2014-04-11 Thread David Hall
Another usage that's nice is:

logDebug {
   val timeS = timeMillis/1000.0
   sTime: $timeS
}

which can be useful for more complicated expressions.


On Thu, Apr 10, 2014 at 5:55 PM, Michael Armbrust mich...@databricks.comwrote:

 BTW...

 You can do calculations in string interpolation:
 sTime: ${timeMillis / 1000}s

 Or use format strings.
 fFloat with two decimal places: $floatValue%.2f

 More info:
 http://docs.scala-lang.org/overviews/core/string-interpolation.html


 On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust mich...@databricks.com
 wrote:

  Hi Marcelo,
 
  Thanks for bringing this up here, as this has been a topic of debate
  recently.  Some thoughts below.
 
  ... all of the suffer from the fact that the log message needs to be
 built
  even
 
  though it might not be used.
 
 
  This is not true of the current implementation (and this is actually why
  Spark has a logging trait instead of just using a logger directly.)
 
  If you look at the original function signatures:
 
  protected def logDebug(msg: = String) ...
 
 
  The = implies that we are passing the msg by name instead of by value.
  Under the covers, scala is creating a closure that can be used to
 calculate
  the log message, only if its actually required.  This does result is a
  significant performance improvement, but still requires allocating an
  object for the closure.  The bytecode is really something like this:
 
  val logMessage = new Function0() { def call() =  Log message +
 someExpensiveComputation() }
  log.debug(logMessage)
 
 
  In Catalyst and Spark SQL we are using the scala-logging package, which
  uses macros to automatically rewrite all of your log statements.
 
  You write: logger.debug(sLog message $someExpensiveComputation)
 
  You get:
 
  if(logger.debugEnabled) {
val logMsg = Log message + someExpensiveComputation()
logger.debug(logMsg)
  }
 
  IMHO, this is the cleanest option (and is supported by Typesafe).  Based
  on a micro-benchmark, it is also the fastest:
 
  std logging: 19885.48ms
  spark logging 914.408ms
  scala logging 729.779ms
 
  Once the dust settles from the 1.0 release, I'd be in favor of
  standardizing on scala-logging.
 
  Michael
 



Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-30 Thread David Hall
On Sun, Mar 30, 2014 at 2:01 PM, Debasish Das debasish.da...@gmail.comwrote:

 Hi David,

 I have started to experiment with BFGS solvers for Spark GLM over large
 scale data...

 I am also looking to add a good QP solver in breeze that can be used in
 Spark ALS for constraint solves...More details on that soon...

 I could not load up breeze 0.7 code onto eclipse...There is a folder called
 natives in the master but there is no code in thatall the code is in
 src/main/scala...

 I added the eclipse plugin:

 addSbtPlugin(com.github.mpeltonen % sbt-idea % 1.6.0)

 addSbtPlugin(com.typesafe.sbteclipse % sbteclipse-plugin % 2.2.0)

 But it seems the project is set to use idea...

 Could you please explain the dev methodology for breeze ? My idea is to do
 solver work in breeze as that's the right place and get it into Spark
 through Xiangrui's WIP on Sparse data and breeze support...


It would be great to have a QP Solver: I don't know if you know about this
library: http://www.joptimizer.com/

I'm not quite sure what you mean by dev methodology. If you just mean how
to get code into Breeze, just send a PR to scalanlp/breeze. Unit tests are
good for something nontrivial like this. Maybe some basic documentation.



 Thanks.
 Deb



 On Fri, Mar 7, 2014 at 12:46 AM, DB Tsai dbt...@alpinenow.com wrote:

  Hi Xiangrui,
 
  I think it doesn't matter whether we use Fortran/Breeze/RISO for
  optimizers since optimization only takes  1% of time. Most of the
  time is in gradientSum and lossSum parallel computation.
 
  Sincerely,
 
  DB Tsai
  Machine Learning Engineer
  Alpine Data Labs
  --
  Web: http://alpinenow.com/
 
 
  On Thu, Mar 6, 2014 at 7:10 PM, Xiangrui Meng men...@gmail.com wrote:
   Hi DB,
  
   Thanks for doing the comparison! What were the running times for
   fortran/breeze/riso?
  
   Best,
   Xiangrui
  
   On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote:
   Hi David,
  
   I can converge to the same result with your breeze LBFGS and Fortran
   implementations now. Probably, I made some mistakes when I tried
   breeze before. I apologize that I claimed it's not stable.
  
   See the test case in BreezeLBFGSSuite.scala
   https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS
  
   This is training multinomial logistic regression against iris dataset,
   and both optimizers can train the models with 98% training accuracy.
  
   There are two issues to use Breeze in Spark,
  
   1) When the gradientSum and lossSum are computed distributively in
   custom defined DiffFunction which will be passed into your optimizer,
   Spark will complain LBFGS class is not serializable. In
   BreezeLBFGS.scala, I've to convert RDD to array to make it work
   locally. It should be easy to fix by just having LBFGS to implement
   Serializable.
  
   2) Breeze computes redundant gradient and loss. See the following log
   from both Fortran and Breeze implementations.
  
   Thanks.
  
   Fortran:
   Iteration -1: loss 1.3862943611198926, diff 1.0
   Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352
   Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126
   Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336
   Iteration 3: loss 1.054036932835569, diff 0.03566113127440601
   Iteration 4: loss 0.9907956302751622, diff 0.0507649459571
   Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761
   Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982
   Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716
   Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277
   Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075
   Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627
  
   Breeze:
   Iteration -1: loss 1.3862943611198926, diff 1.0
   Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
   WARNING: Failed to load implementation from:
   com.github.fommil.netlib.NativeSystemBLAS
   Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
   WARNING: Failed to load implementation from:
   com.github.fommil.netlib.NativeRefBLAS
   Iteration 0: loss 1.3862943611198926, diff 0.0
   Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352
   Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126
   Iteration 3: loss 1.1242501524477688, diff 0.0
   Iteration 4: loss 1.1242501524477688, diff 0.0
   Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336
   Iteration 6: loss 1.0930151243303563, diff 0.0
   Iteration 7: loss 1.0930151243303563, diff 0.0
   Iteration 8: loss 1.054036932835569, diff 0.03566113127440601
   Iteration 9: loss 1.054036932835569, diff 0.0
   Iteration 10: loss 1.054036932835569, diff 0.0
   Iteration 11: loss 0.9907956302751622, diff 0.0507649459571
   Iteration 12: loss 0.9907956302751622, diff 0.0
   Iteration 13: loss 0.9907956302751622, diff 0.0
   Iteration 14: loss 0.9184205380342829, diff

Re: Making RDDs Covariant

2014-03-22 Thread David Hall
On Sat, Mar 22, 2014 at 8:59 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 The problem I was talking about is when you try to use typeclass converters
 and make them contravariant/covariant for input/output. Something like:

 Reader[-I, +O] { def read(i:I): O }

 Doing this, you soon have implicit collisions and philosophical concerns
 about what it means to serialize/deserialize a Parent class and a Child
 class...



You should (almost) never make a typeclass param contravariant. It's almost
certainly not what you want:

https://issues.scala-lang.org/browse/SI-2509

-- David


graphx samples in Java

2014-03-21 Thread David Soroko
Hi

Where can I find the equivalent of the graphx example 
(http://spark.apache.org/docs/0.9.0/graphx-programming-guide.html#examples ) in 
Java ? For example. How does the following translates to Java


val users: RDD[(VertexId, (String, String))] =

  sc.parallelize(Array((3L, (rxin, student)), (7L, (jgonzal, postdoc)),

   (5L, (franklin, prof)), (2L, (istoica, prof





thanks

--david



Code documentation

2014-03-15 Thread David Thomas
Is there any documentation available that explains the code architecture
that can help a new Spark framework developer?


Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-06 Thread David Hall
On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote:

 Hi David,

 I can converge to the same result with your breeze LBFGS and Fortran
 implementations now. Probably, I made some mistakes when I tried
 breeze before. I apologize that I claimed it's not stable.

 See the test case in BreezeLBFGSSuite.scala
 https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS

 This is training multinomial logistic regression against iris dataset,
 and both optimizers can train the models with 98% training accuracy.


great to hear! There were some bugs in LBFGS about 6 months ago, so
depending on the last time you tried it, it might indeed have been bugged.



 There are two issues to use Breeze in Spark,

 1) When the gradientSum and lossSum are computed distributively in
 custom defined DiffFunction which will be passed into your optimizer,
 Spark will complain LBFGS class is not serializable. In
 BreezeLBFGS.scala, I've to convert RDD to array to make it work
 locally. It should be easy to fix by just having LBFGS to implement
 Serializable.


I'm not sure why Spark should be serializing LBFGS? Shouldn't it live on
the controller node? Or is this a per-node thing?

But no problem to make it serializable.



 2) Breeze computes redundant gradient and loss. See the following log
 from both Fortran and Breeze implementations.


Err, yeah. I should probably have LBFGS do this automatically, but there's
a CachedDiffFunction that gets rid of the redundant calculations.

-- David



 Thanks.

 Fortran:
 Iteration -1: loss 1.3862943611198926, diff 1.0
 Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352
 Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126
 Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336
 Iteration 3: loss 1.054036932835569, diff 0.03566113127440601
 Iteration 4: loss 0.9907956302751622, diff 0.0507649459571
 Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761
 Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982
 Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716
 Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277
 Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075
 Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627

 Breeze:
 Iteration -1: loss 1.3862943611198926, diff 1.0
 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
 WARNING: Failed to load implementation from:
 com.github.fommil.netlib.NativeSystemBLAS
 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
 WARNING: Failed to load implementation from:
 com.github.fommil.netlib.NativeRefBLAS
 Iteration 0: loss 1.3862943611198926, diff 0.0
 Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352
 Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126
 Iteration 3: loss 1.1242501524477688, diff 0.0
 Iteration 4: loss 1.1242501524477688, diff 0.0
 Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336
 Iteration 6: loss 1.0930151243303563, diff 0.0
 Iteration 7: loss 1.0930151243303563, diff 0.0
 Iteration 8: loss 1.054036932835569, diff 0.03566113127440601
 Iteration 9: loss 1.054036932835569, diff 0.0
 Iteration 10: loss 1.054036932835569, diff 0.0
 Iteration 11: loss 0.9907956302751622, diff 0.0507649459571
 Iteration 12: loss 0.9907956302751622, diff 0.0
 Iteration 13: loss 0.9907956302751622, diff 0.0
 Iteration 14: loss 0.9184205380342829, diff 0.07304737423337761
 Iteration 15: loss 0.9184205380342829, diff 0.0
 Iteration 16: loss 0.9184205380342829, diff 0.0
 Iteration 17: loss 0.8259870936519939, diff 0.1006438117513297
 Iteration 18: loss 0.8259870936519939, diff 0.0
 Iteration 19: loss 0.8259870936519939, diff 0.0
 Iteration 20: loss 0.6327447552109576, diff 0.233952934583647
 Iteration 21: loss 0.6327447552109576, diff 0.0
 Iteration 22: loss 0.6327447552109576, diff 0.0
 Iteration 23: loss 0.5534101162436362, diff 0.12538154276652747
 Iteration 24: loss 0.5534101162436362, diff 0.0
 Iteration 25: loss 0.5534101162436362, diff 0.0
 Iteration 26: loss 0.40450200866125635, diff 0.2690732137675816
 Iteration 27: loss 0.40450200866125635, diff 0.0
 Iteration 28: loss 0.40450200866125635, diff 0.0
 Iteration 29: loss 0.30788249908237314, diff 0.23885980452569502

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/


 On Wed, Mar 5, 2014 at 2:00 PM, David Hall d...@cs.berkeley.edu wrote:
  On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote:
 
  Hi David,
 
  On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote:
   I'm happy to help fix any problems. I've verified at points that the
   implementation gives the exact same sequence of iterates for a few
  different
   functions (with a particular line search) as the c port of lbfgs. So
 I'm
  a
   little surprised it fails where Fortran succeeds... but only a little.
  This
   was fixed late last year.
  I'm working

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread David Hall
On Wed, Mar 5, 2014 at 8:50 AM, Debasish Das debasish.da...@gmail.comwrote:

 Hi David,

 Few questions on breeze solvers:

 1. I feel the right place of adding useful things from RISO LBFGS (based on
 Professor Nocedal's fortran code) will be breeze. It will involve stress
 testing breeze LBFGS on large sparse datasets and contributing fixes to
 existing breeze LBFGS with the learning from RISO LBFGS.

 You agree on that right ?


That would be great.


 2. Normally for doing experiments like 1, I fix the line search. What's
 your preferred line search in breeze BFGS ? I will also use that.

 More Thuente and Strong wolfe with backtracking helped me in the past.


The default is a Strong Wolfe search:
https://github.com/scalanlp/breeze/blob/master/src/main/scala/breeze/optimize/StrongWolfe.scala



 3. The paper that you sent also says on L-BFGS-B is better on box
 constraints. I feel It's worthwhile to have both the solvers because many
 practical problems need box constraints or complex constraints can be
 reformulated in the realm of unconstrained and box constraints.


 Example use-cases for me are automatic feature extraction from photo/video
 frames using factorization.

 I will compare L-BFGS-B vs constrained QN method that you have in Breeze
 within an analysis similar to 1.


Great. Yeah I fully expect l-bfgs-b to be better on box constraints. It
turns out that this other method looked easier to implement and gave
reasonably good results even with box constraints.



 4. Do you have a road-map on adding CG solvers in breeze ? Linear CG solver
 to compare with BLAS posv seems like a good usecase for me in mllib ALS.
 DB sent a paper by Professor Ng which shows the effectiveness of CG and
 BFGS over SGD in the email chain.


We have a linear CG solver (
https://github.com/scalanlp/breeze/blob/master/src/main/scala/breeze/optimize/linear/ConjugateGradient.scala)
which is used in the Truncated Newton solver. (
https://github.com/scalanlp/breeze/blob/master/src/main/scala/breeze/optimize/TruncatedNewtonMinimizer.scala)
But I've not implemented nonlinear CG, and honestly hadn't planned on it,
per below.



 I believe on non-convex problems like Matrix Factorization, CG family might
 converge to a better solution than BFGS. No way to know till we experiment
 on the datasets :-)


I've never had good experience with CG, and a vision colleague of mine
found that LBFGS was better on his (non-convex) stuff, but I could be
easily persuaded.

Thanks!

-- David



 Thanks.
 Deb



 On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote:

  Just subscribing to this list, so apologies for quoting weirdly and any
  other
  etiquette offenses.
 
 
  DB Tsai wrote
   Hi Deb,
  
   I had tried breeze L-BFGS algorithm, and when I tried it couple weeks
   ago, it's not as stable as the fortran implementation. I guessed the
   problem is in the line search related thing. Since we may bring breeze
   dependency for the sparse format support as you pointed out, we can
   just try to fix the L-BFGS in breeze, and we can get OWL-QN and
   L-BFGS-B.
  
   What do you think?
 
  I'm happy to help fix any problems. I've verified at points that the
  implementation gives the exact same sequence of iterates for a few
  different
  functions (with a particular line search) as the c port of lbfgs. So I'm
 a
  little surprised it fails where Fortran succeeds... but only a little.
 This
  was fixed late last year.
 
  OWL-QN seems to mostly be stable, but probably deserves more testing.
  Presumably it has whatever defects my LBFGS does. (It's really pretty
  straightforward to implement given an L-BFGS)
 
  We don't provide an L-BFGS-B implementation. We do have a more general
  constrained qn method based on
  http://jmlr.org/proceedings/papers/v5/schmidt09a/schmidt09a.pdf (which
  uses
  a L-BFGS type update as part of the algorithm). From the experiments in
  their paper, it's likely to not work as well for bound constraints, but
 can
  do things that lbfgsb can't.
 
  Again, let me know what I can help with.
 
  -- David Hall
 
 
  On Mon, Mar 3, 2014 at 3:52 PM, DB Tsai lt;dbtsai@gt; wrote:
   Hi Deb,
  
   a.  OWL-QN for solving L1 natively in BFGS
   Based on what I saw from
  
 
 https://github.com/tjhunter/scalanlp-core/blob/master/learn/src/main/scala/breeze/optimize/OWLQN.scala
   , it seems that it's not difficult to implement OWL-QN once LBFGS is
   done.
  
  
   b.  Bound constraints in BFGS : I saw you have converted the fortran
   code.
   Is there a license issue ? I can help in getting that up to speed as
   well.
   I tried to convert the code from Fortran L-BFGS-B implementation to
   java using f2j; the translated code is just a messy, and it just
   doesn't work at all. There is no license issue here. Any idea about
   how to approach this?
  
   c. Few variants of line searches : I will discuss on it.
   For the dbtsai-lbfgs branch seems like it already got merged by
 Jenkins.
   I don't think

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread David Hall
On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote:

 Hi David,

 On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote:
  I'm happy to help fix any problems. I've verified at points that the
  implementation gives the exact same sequence of iterates for a few
 different
  functions (with a particular line search) as the c port of lbfgs. So I'm
 a
  little surprised it fails where Fortran succeeds... but only a little.
 This
  was fixed late last year.
 I'm working on a reproducible test case using breeze vs fortran
 implementation to show the problem I've run into. The test will be in
 one of the test cases in my Spark fork, is it okay for you to
 investigate the issue? Or do I need to make it as a standalone test?



Um, as long as it wouldn't be too hard to pull out.



 Will send you the test later today.

 Thanks.

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/



Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread David Hall
I did not. They would be nice to have.


On Wed, Mar 5, 2014 at 5:21 PM, Debasish Das debasish.da...@gmail.comwrote:

 David,

 There used to be standard BFGS testcases in Professor Nocedal's
 package...did you stress test the solver with them ?

 If not I will shoot him an email for them.

 Thanks.
 Deb



 On Wed, Mar 5, 2014 at 2:00 PM, David Hall d...@cs.berkeley.edu wrote:

  On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote:
 
   Hi David,
  
   On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote:
I'm happy to help fix any problems. I've verified at points that the
implementation gives the exact same sequence of iterates for a few
   different
functions (with a particular line search) as the c port of lbfgs. So
  I'm
   a
little surprised it fails where Fortran succeeds... but only a
 little.
   This
was fixed late last year.
   I'm working on a reproducible test case using breeze vs fortran
   implementation to show the problem I've run into. The test will be in
   one of the test cases in my Spark fork, is it okay for you to
   investigate the issue? Or do I need to make it as a standalone test?
  
 
 
  Um, as long as it wouldn't be too hard to pull out.
 
 
  
   Will send you the test later today.
  
   Thanks.
  
   Sincerely,
  
   DB Tsai
   Machine Learning Engineer
   Alpine Data Labs
   --
   Web: http://alpinenow.com/