Re: Why spark-submit works with package not with jar
Hi, It seems this library is several years old. Have you considered using the Google provided connector? You can find it in https://github.com/GoogleCloudDataproc/spark-bigquery-connector Regards, David Rabinowitz On Sun, May 5, 2024 at 6:07 PM Jeff Zhang wrote: > Are you sure com.google.api.client.http.HttpRequestInitialize is in > the spark-bigquery-latest.jar or it may be in the transitive dependency > of spark-bigquery_2.11? > > On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh > wrote: > >> >> Mich Talebzadeh, >> Technologist | Architect | Data Engineer | Generative AI | FinCrime >> London >> United Kingdom >> >> >>view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >> >> >> -- Forwarded message - >> From: Mich Talebzadeh >> Date: Tue, 20 Oct 2020 at 16:50 >> Subject: Why spark-submit works with package not with jar >> To: user @spark >> >> >> Hi, >> >> I have a scenario that I use in Spark submit as follows: >> >> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars >> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar, >> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar* >> >> As you can see the jar files needed are added. >> >> >> This comes back with error message as below >> >> >> Creating model test.weights_MODEL >> >> java.lang.NoClassDefFoundError: >> com/google/api/client/http/HttpRequestInitializer >> >> at >> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19) >> >> at >> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19) >> >> at >> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105) >> >> ... 76 elided >> >> Caused by: java.lang.ClassNotFoundException: >> com.google.api.client.http.HttpRequestInitializer >> >> at java.net.URLClassLoader.findClass(URLClassLoader.java:382) >> >> at java.lang.ClassLoader.loadClass(ClassLoader.java:424) >> >> at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >> >> >> >> So there is an issue with finding the class, although the jar file used >> >> >> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar >> >> has it. >> >> >> Now if *I remove the above jar file and replace it with the same version >> but package* it works! >> >> >> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars >> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar >> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6* >> >> >> I have read the write-ups about packages searching the maven >> libraries etc. Not convinced why using the package should make so much >> difference between a failure and success. In other words, when to use a >> package rather than a jar. >> >> >> Any ideas will be appreciated. >> >> >> Thanks >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> > > > -- > Best Regards > > Jeff Zhang >
Re: JDK version support policy?
Thanks all for the discussion here. Based on this I think we'll stick with Java 8 for now and then upgrade to Java 11 around or after Spark 4. On Thu, Jun 8, 2023, at 07:17, Sean Owen wrote: > Noted, but for that you'd simply run your app on Java 17. If Spark works, and > your app's dependencies work on Java 17 because you compile it for 17 (and > jakarta.* classes for example) then there's no issue. > > On Thu, Jun 8, 2023 at 3:13 AM Martin Andersson > wrote: >> There are some reasons to drop Java 11 as well. Java 17 included a large >> change, breaking backwards compatibility with their transition from Java EE >> to Jakarta EE >> <https://blogs.oracle.com/javamagazine/post/transition-from-java-ee-to-jakarta-ee>. >> This means that any users using Spark 4.0 together with Spring 6.x or any >> recent version of servlet containers such as Tomcat or Jetty will experience >> issues. (For security reasons it's beneficial to float your dependencies to >> the latest version of these libraries/frameworks) >> >> I'm not explicitly saying Java 11 should be dropped in Spark 4, just thought >> I'd bring this issue to your attention. >> >> Best Regards, Martin >> >> >> *From:* Jungtaek Lim >> *Sent:* Wednesday, June 7, 2023 23:19 >> *To:* Sean Owen >> *Cc:* Dongjoon Hyun ; Holden Karau >> ; dev >> *Subject:* Re: JDK version support policy? >> >> >> >> EXTERNAL SENDER. Do not click links or open attachments unless you recognize >> the sender and know the content is safe. DO NOT provide your username or >> password. >> >> >> >> +1 to drop Java 8 but +1 to set the lowest support version to Java 11. >> >> Considering the phase for only security updates, 11 LTS would not be EOLed >> in very long time. Unless that’s coupled with other deps which require >> bumping JDK version (hope someone can bring up lists), it doesn’t seem to >> buy much. And given the strong backward compatibility JDK provides, that’s >> less likely. >> >> Purely from the project’s source code view, does anyone know how much >> benefits we can leverage for picking up 17 rather than 11? I lost the track, >> but some of their proposals are more likely catching up with other >> languages, which don’t make us be happy since Scala provides them for years. >> >> 2023년 6월 8일 (목) 오전 2:35, Sean Owen 님이 작성: >>> I also generally perceive that, after Java 9, there is much less breaking >>> change. So working on Java 11 probably means it works on 20, or can be >>> easily made to without pain. Like I think the tweaks for Java 17 were quite >>> small. >>> >>> Targeting Java >11 excludes Java 11 users and probably wouldn't buy much. >>> Keeping the support probably doesn't interfere with working on much newer >>> JVMs either. >>> >>> On Wed, Jun 7, 2023, 12:29 PM Holden Karau wrote: >>>> So JDK 11 is still supported in open JDK until 2026, I'm not sure if we're >>>> going to see enough folks moving to JRE17 by the Spark 4 release unless we >>>> have a strong benefit from dropping 11 support I'd be inclined to keep it. >>>> >>>> On Tue, Jun 6, 2023 at 9:08 PM Dongjoon Hyun wrote: >>>>> I'm also +1 on dropping both Java 8 and 11 in Apache Spark 4.0, too. >>>>> >>>>> Dongjoon. >>>>> >>>>> On 2023/06/07 02:42:19 yangjie01 wrote: >>>>> > +1 on dropping Java 8 in Spark 4.0, and I even hope Spark 4.0 can only >>>>> > support Java 17 and the upcoming Java 21. >>>>> > >>>>> > 发件人: Denny Lee >>>>> > 日期: 2023年6月7日 星期三 07:10 >>>>> > 收件人: Sean Owen >>>>> > 抄送: David Li , "dev@spark.apache.org" >>>>> > >>>>> > 主题: Re: JDK version support policy? >>>>> > >>>>> > +1 on dropping Java 8 in Spark 4.0, saying this as a fan of the >>>>> > fast-paced (positive) updates to Arrow, eh?! >>>>> > >>>>> > On Tue, Jun 6, 2023 at 4:02 PM Sean Owen >>>>> > mailto:sro...@gmail.com>> wrote: >>>>> > I haven't followed this discussion closely, but I think we could/should >>>>> > drop Java 8 in Spark 4.0, which is up next after 3.5? >>>>> > >>>>> > On Tue, Jun 6, 2023 at 2:44 PM David Li >>>>> > mailto:lidav...@apache.org>> wrote: >>>>
JDK version support policy?
Hello Spark developers, I'm from the Apache Arrow project. We've discussed Java version support [1], and crucially, whether to continue supporting Java 8 or not. As Spark is a big user of Arrow in Java, I was curious what Spark's policy here was. If Spark intends to stay on Java 8, for instance, we may also want to stay on Java 8 or otherwise provide some supported version of Arrow for Java 8. We've seen dependencies dropping or planning to drop support. gRPC may drop Java 8 at any time [2], possibly this September [3], which may affect Spark (due to Spark Connect). And today we saw that Arrow had issues running tests with Mockito on Java 20, but we couldn't update Mockito since it had dropped Java 8 support. (We pinned the JDK version in that CI pipeline for now.) So at least, I am curious if Arrow could start the long process of migrating Java versions without impacting Spark, or if we should continue to cooperate. Arrow Java doesn't see quite so much activity these days, so it's not quite critical, but it's possible that these dependency issues will start to affect us more soon. And looking forward, Java is working on APIs that should also allow us to ditch the --add-opens flag requirement too. [1]: https://lists.apache.org/thread/phpgpydtt3yrgnncdyv4qdq1gf02s0yj [2]: https://github.com/grpc/proposal/blob/master/P5-jdk-version-support.md [3]: https://github.com/grpc/grpc-java/issues/9386
SPARK-22256
Hello, my name is David McWhorter and I created a new pull request to address the SPARK-22256 ticket at https://github.com/apache/spark/pull/30739. This change adds a memory overhead setting for the spark driver running on mesos. This is a reopening of a prior pull request that was never merged. I am wondering if I could request that someone review the change and merge it if it looks acceptable? I'm happy to do any further testing or make further changes if they are needed but it looked good from the prior review. Many thanks, David McWhorter
Unsubscribe
Unsubscribe
[DISCUSS] Reducing memory usage of toPandas with Arrow "self_destruct" option
Hello all, We've been working with PySpark and Pandas, and have found that to convert a dataset using N bytes of memory to Pandas, we need to have 2N bytes free, even with the Arrow optimization enabled. The fundamental reason is ARROW-3789[1]: Arrow does not free the Arrow table until conversion finishes, so there are 2 copies of the dataset in memory. We'd like to improve this by taking advantage of the Arrow "self_destruct" option available in Arrow >= 0.16. When converting a suitable[*] Arrow table to a Pandas dataframe, it avoids the worst-case 2x memory usage, with something more like ~25% overhead instead, by freeing the columns in the Arrow table after converting each column instead of at the end of conversion. Does this sound like a desirable optimization to have in Spark? If so, how should it be exposed to users? As discussed below, there are cases where a user may or may not want it enabled. Here's a proof-of-concept patch, along with a demonstration, and a comparison of memory usage (via memory_profiler[2]) with and without the flag enabled: https://gist.github.com/lidavidm/289229caa022358432f7deebe26a9bd3 There are some cases where you may _not_ want this optimization, however, so the patch leaves it as a toggle. Is this the API we'd want, or would we prefer a different API (e.g. a configuration flag)? The reason we may not want this enabled by default is that the related split_blocks option is more likely to find zero-copy opportunities, which will result in the Pandas dataframe being backed by immutable buffers. Some Pandas operations will error in these cases, e.g. [3]. Also, to minimize memory usage, we set use_threads=False to converts each column sequentially, rather than in parallel, but this slows down the conversion somewhat. One option here may be to set self_destruct by default, but relegate the other two options (which further save memory) to a toggle, and I can measure the impact of this if desired. [1]: https://issues.apache.org/jira/browse/ARROW-3789 [2]: https://github.com/pythonprofilers/memory_profiler [3]: https://github.com/pandas-dev/pandas/issues/35530 [*] See my comment in https://issues.apache.org/jira/browse/ARROW-9878. Thanks, David - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Spark 3.0 and ORC 1.6
Hi all, I am a heavy user of Spark at LinkedIn, and am excited about the ZStandard compression option recently incorporated into ORC 1.6. I would love to explore using it for storing/querying of large (>10 TB) tables for my own disk I/O intensive workloads, and other users & companies may be interested in adopting ZStandard more broadly, since it seems to offer faster compression speeds at higher compression ratios with better multi-threaded support than zlib/Snappy. At scale, improvements of even ~10% on disk and/or compute, hopefully just from setting the “orc.compress” flag to a different value, could translate into palpable gains in capacity/cost cluster wide without requiring broad engineering migrations. See a somewhat recent FB Engineering blog post on the topic for their reported experiences: https://engineering.fb.com/core-data/zstandard/ Do we know if ORC 1.6.x will make the cut for Spark 3.0? A recent PR (https://github.com/apache/spark/pull/26669) updated ORC to 1.5.8, but I don’t have a good understanding of how difficult incorporating ORC 1.6.x into Spark will be. For instance, in the PRs for enabling Java Zstd in ORC (https://github.com/apache/orc/pull/306 & https://github.com/apache/orc/pull/412), some additional work/discussion around Hadoop shims occurred to maintain compatibility across different versions of Hadoop (e.g. 2.7) and aircompressor (a library containing Java implementations of various compression codecs, so that dependence on Hadoop 2.9 is not required). Again, these may be non-issues, but I wanted to kindle discussion around whether this can make the cut for 3.0, since I imagine it’s a major upgrade many users will focus on migrating to once released. Kind regards, David Christle
Re: [Kubernetes] Resource requests and limits for Driver and Executor Pods
Thanks for linking that PR Kimoon. It actually does mostly address the issue I was referring to. As the issue I linked in my first email states, one physical cpu might not be enough to execute a task in a performant way. So if I set spark.executor.cores=1 and spark.task.cpus=1 , I will get 1 core from Kubernetes and execute one task per Executor and run into performance problems. Being able to specify `spark.kubernetes.executor.cores=1.2` would fix the issue (1.2 is just an example). I am curious as to why you, Yinan, would want to use this property to request less than 1 physical cpu (that is how it sounds to me on the PR). Do you have testing that indicates that less than 1 physical CPU is enough for executing tasks? In the end it boils down to the question proposed by Yinan: > A relevant question is should Spark on Kubernetes really be opinionated on > how to set the cpu request and limit and even try to determine this > automatically? And I completely agree with your answer Kimoon, we should provide sensible defaults and make it configurable, as Yinan’s PR does. The only remaining question would then be what a sensible default for spark.kubernetes.executor.cores would be. Seeing that I wanted more than 1 and Yinan wants less, leaving it at 1 night be best. Thanks, David From: Kimoon Kim <kim...@pepperdata.com> Date: Friday, March 30, 2018 at 4:28 PM To: Yinan Li <liyinan...@gmail.com> Cc: David Vogelbacher <dvogelbac...@palantir.com>, "dev@spark.apache.org" <dev@spark.apache.org> Subject: Re: [Kubernetes] Resource requests and limits for Driver and Executor Pods I see. Good to learn the interaction between spark.task.cpus and spark.executor.cores. But am I right to say that PR #20553 can be still used as an additional knob on top of those two? Say a user wants 1.5 core per executor from Kubernetes, not the rounded up integer value 2? > A relevant question is should Spark on Kubernetes really be opinionated on > how to set the cpu request and limit and even try to determine this > automatically? Personally, I don't see how this can be auto-determined at all. I think the best we can do is to come up with sensible default values for the most common case, and provide and well-document other knobs for edge cases. Thanks, Kimoon On Fri, Mar 30, 2018 at 12:37 PM, Yinan Li <liyinan...@gmail.com> wrote: PR #20553 [github.com] is more for allowing users to use a fractional value for cpu requests. The existing spark.executor.cores is sufficient for specifying more than one cpus. > One way to solve this could be to request more than 1 core from Kubernetes > per task. The exact amount we should request is unclear to me (it largely > depends on how many threads actually get spawned for a task). A good indication is spark.task.cpus, and on average how many tasks are expected to run by a single executor at any point in time. If each executor is only expected to run one task at most at any point in time, spark.executor.cores can be set to be equal to spark.task.cpus. A relevant question is should Spark on Kubernetes really be opinionated on how to set the cpu request and limit and even try to determine this automatically? On Fri, Mar 30, 2018 at 11:40 AM, Kimoon Kim <kim...@pepperdata.com> wrote: > Instead of requesting `[driver,executor].memory`, we should just request > `[driver,executor].memory + [driver,executor].memoryOverhead `. I think this > case is a bit clearer than the CPU case, so I went ahead and filed an issue > [issues.apache.org] with more details and made a PR [github.com]. I think this suggestion makes sense. > One way to solve this could be to request more than 1 core from Kubernetes > per task. The exact amount we should request is unclear to me (it largely > depends on how many threads actually get spawned for a task). I wonder if this is being addressed by PR #20553 [github.com] written by Yinan. Yinan? Thanks, Kimoon On Thu, Mar 29, 2018 at 5:14 PM, David Vogelbacher <dvogelbac...@palantir.com> wrote: Hi, At the moment driver and executor pods are created using the following requests and limits: CPUMemory Request[driver,executor].cores[driver,executor].memory LimitUnlimited (but can be specified using spark.[driver,executor].cores)[driver,executor].memory + [driver,executor].memoryOverhead Specifying the requests like this leads to problems if the pods only get the requested amount of resources and nothing of the optional (limit) resources, as it can happen in a fully utilized cluster. For memory: Let’s say we have a node with 100GiB memory and 5 pods with 20 GiB memory and 5 GiB memoryOverhead. At the beginning all 5 pods use 20 GiB of memory and all is well. If a pod then starts using its overhead memory it will get killed as there is no more memor
[Kubernetes] Resource requests and limits for Driver and Executor Pods
Hi, At the moment driver and executor pods are created using the following requests and limits: CPUMemory Request[driver,executor].cores[driver,executor].memory LimitUnlimited (but can be specified using spark.[driver,executor].cores)[driver,executor].memory + [driver,executor].memoryOverhead Specifying the requests like this leads to problems if the pods only get the requested amount of resources and nothing of the optional (limit) resources, as it can happen in a fully utilized cluster. For memory: Let’s say we have a node with 100GiB memory and 5 pods with 20 GiB memory and 5 GiB memoryOverhead. At the beginning all 5 pods use 20 GiB of memory and all is well. If a pod then starts using its overhead memory it will get killed as there is no more memory available, even though we told spark that it can use 25 GiB of memory. Instead of requesting `[driver,executor].memory`, we should just request `[driver,executor].memory + [driver,executor].memoryOverhead `. I think this case is a bit clearer than the CPU case, so I went ahead and filed an issue with more details and made a PR. For CPU: As it turns out, there can be performance problems if we only have `executor.cores` available (which means we have one core per task). This was raised here and is the reason that the cpu limit was set to unlimited. This issue stems from the fact that in general there will be more than one thread per task, resulting in performance impacts if there is only one core available. However, I am not sure that just setting the limit to unlimited is the best solution because it means that even if the Kubernetes cluster can perfectly satisfy the resource requests, performance might be very bad. I think we should guarantee that an executor is able to do its work well (without performance issues or getting killed - as could happen in the memory case) with the resources it gets guaranteed from Kubernetes. One way to solve this could be to request more than 1 core from Kubernetes per task. The exact amount we should request is unclear to me (it largely depends on how many threads actually get spawned for a task). We would need to find a way to determine this somehow automatically or at least come up with a better default value than 1 core per task. Does somebody have ideas or thoughts on how to solve this best? Best, David smime.p7s Description: S/MIME cryptographic signature
RE: Launching multiple spark jobs within a main spark job.
I am not familiar of any problem with that. Anyway, If you run spark applicaction you would have multiple jobs, which makes sense that it is not a problem. Thanks David. From: Naveen [mailto:hadoopst...@gmail.com] Sent: Wednesday, December 21, 2016 9:18 AM To: dev@spark.apache.org; u...@spark.apache.org Subject: Launching multiple spark jobs within a main spark job. Hi Team, Is it ok to spawn multiple spark jobs within a main spark job, my main spark job's driver which was launched on yarn cluster, will do some preprocessing and based on it, it needs to launch multilple spark jobs on yarn cluster. Not sure if this right pattern. Please share your thoughts. Sample code i ve is as below for better understanding.. - Object Mainsparkjob { main(...){ val sc=new SparkContext(..) Fetch from hive..using hivecontext Fetch from hbase //spawning multiple Futures.. Val future1=Future{ Val sparkjob= SparkLauncher(...).launch; spark.waitFor } Similarly, future2 to futureN. future1.onComplete{...} } }// end of mainsparkjob -- Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.
Re: SPARK-13843 and future of streaming backends
> As far as group / artifact name compatibility, at least in the case of > Kafka we need different artifact names anyway, and people are going to > have to make changes to their build files for spark 2.0 anyway. As > far as keeping the actual classes in org.apache.spark to not break > code despite the group name being different, I don't know whether that > would be enforced by maven central, just looked at as poor taste, or > ASF suing for trademark violation :) Sonatype, has strict instructions to only permit org.apache.* to originate from repository.apache.org. Exceptions to that must be approved by VP, Infrastructure. -- Sent via Pony Mail for dev@spark.apache.org. View this email online at: https://pony-poc.apache.org/list.html?dev@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda
Hi Ben, > My company uses Lamba to do simple data moving and processing using python > scripts. I can see using Spark instead for the data processing would make it > into a real production level platform. That may be true. Spark has first class support for Python which should make your life easier if you do go this route. Once you've fleshed out your ideas I'm sure folks on this mailing list can provide helpful guidance based on their real world experience with Spark. > Does this pave the way into replacing > the need of a pre-instantiated cluster in AWS or bought hardware in a > datacenter? In a word, no. SAMBA is designed to extend-not-replace the traditional Spark computation and deployment model. At it's most basic, the traditional Spark computation model distributes data and computations across worker nodes in the cluster. SAMBA simply allows some of those computations to be performed by AWS Lambda rather than locally on your worker nodes. There are I believe a number of potential benefits to using SAMBA in some circumstances: 1. It can help reduce some of the workload on your Spark cluster by moving that workload onto AWS Lambda, an infrastructure on-demand compute service. 2. It allows Spark applications written in Java or Scala to make use of libraries and features offered by Python and JavaScript (Node.js) today, and potentially, more libraries and features offered by additional languages in the future as AWS Lambda language support evolves. 3. It provides a simple, clean API for integration with REST APIs that may be a benefit to Spark applications that form part of a broader data pipeline or solution. > If so, then this would be a great efficiency and make an easier > entry point for Spark usage. I hope the vision is to get rid of all cluster > management when using Spark. You might find one of the hosted Spark platform solutions such as Databricks or Amazon EMR that handle cluster management for you a good place to start. At least in my experience, they got me up and running without difficulty. David - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[ANNOUNCE] New SAMBA Package = Spark + AWS Lambda
Hi all, Just sharing news of the release of a newly available Spark package, SAMBA <https://github.com/onetapbeyond/lambda-spark-executor>. <http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor> https://github.com/onetapbeyond/lambda-spark-executor SAMBA is an Apache Spark package offering seamless integration with the AWS Lambda <https://aws.amazon.com/lambda/> compute service for Spark batch and streaming applications on the JVM. Within traditional Spark deployments RDD tasks are executed using fixed compute resources on worker nodes within the Spark cluster. With SAMBA, application developers can delegate selected RDD tasks to execute using on-demand AWS Lambda compute infrastructure in the cloud. Not unlike the recently released ROSE <https://github.com/onetapbeyond/opencpu-spark-executor> package that extends the capabilities of traditional Spark applications with support for CRAN R analytics, SAMBA provides another (hopefully) useful extension for Spark application developers on the JVM. SAMBA Spark Package: https://github.com/onetapbeyond/lambda-spark-executor <https://github.com/onetapbeyond/lambda-spark-executor> ROSE Spark Package: https://github.com/onetapbeyond/opencpu-spark-executor <https://github.com/onetapbeyond/opencpu-spark-executor> Questions, suggestions, feedback welcome. David -- "*All that is gold does not glitter,** Not all those who wander are lost."*
Re: ROSE: Spark + R on the JVM.
Hi Richard, Thanks for providing the background on your application. > the user types or copy-pastes his R code, > the system should then send this R code (using ROSE) to R Unfortunately this type of ad hoc R analysis is not supported. ROSE supports the execution of any R function or script within an existing R package on CRAN, Bioc, or github. It does not support the direct execution of arbitrary blocks of R code as you described. You may want to look at [DeployR](http://deployr.revolutionanalytics.com/), it's an open source R integration server that provides APIs in Java, JavaScript and .NET that can easily support your use case. The outputs of your DeployR integration could then become inputs to your data processing system. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re: ROSE: Spark + R on the JVM. Local Time: January 13 2016 3:18 am UTC Time: January 13 2016 8:18 am From: rsiebel...@gmail.com To: themarchoffo...@protonmail.com CC: m...@vijaykiran.com,cjno...@gmail.com,u...@spark.apache.org,dev@spark.apache.org Hi David, the use case is that we're building a data processing system with an intuitive user interface where Spark is used as the data processing framework. We would like to provide a HTML user interface to R where the user types or copy-pastes his R code, the system should then send this R code (using ROSE) to R, process it and give the results back to the user. The RDD would be used so that the data can be further processed by the system but we would like to also show or be able to show the messages printed to STDOUT and also the images (plots) that are generated by R. The plots seems to be available in the OpenCPU API, see below Inline image 1 So the case is not that we're trying to process millions of images but rather that we would like to show the generated plots (like a regression plot) that's generated in R to the user. There could be several plots generated by the code, but certainly not thousands or even hundreds, only a few. Hope that this would be possible using ROSE because it seems a really good fit, thanks in advance, Richard On Wed, Jan 13, 2016 at 3:39 AM, David Russell <themarchoffo...@protonmail.com> wrote: Hi Richard, > Would it be possible to access the session API from within ROSE, > to get for example the images that are generated by R / openCPU Technically it would be possible although there would be some potentially significant runtime costs per task in doing so, primarily those related to extracting image data from the R session, serializing and then moving that data across the cluster for each and every image. From a design perspective ROSE was intended to be used within Spark scale applications where R object data was seen as the primary task output. An output in a format that could be rapidly serialized and easily processed. Are there real world use cases where Spark scale applications capable of generating 10k, 100k, or even millions of image files would actually need to capture and store images? If so, how practically speaking, would these images ever be used? I'm just not sure. Maybe you could describe your own use case to provide some insights? > and the logging to stdout that is logged by R? If you are referring to the R console output (generated within the R session during the execution of an OCPUTask) then this data could certainly (optionally) be captured and returned on an OCPUResult. Again, can you provide any details for how you might use this console output in a real world application? As an aside, for simple standalone Spark applications that will only ever run on a single host (no cluster) you could consider using an alternative library called fluent-r. This library is also available under my GitHub repo, [see here](https://github.com/onetapbeyond/fluent-r). The fluent-r library already has support for the retrieval of R objects, R console output and R graphics device image/plots. However it is not as lightweight as ROSE and it not designed to work in a clustered environment. ROSE on the other hand is designed for scale. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re: ROSE: Spark + R on the JVM. Local Time: January 12 2016 6:56 pm UTC Time: January 12 2016 11:56 pm From: rsiebel...@gmail.com To: m...@vijaykiran.com CC: cjno...@gmail.com,themarchoffo...@protonmail.com,u...@spark.apache.org,dev@spark.apache.org Hi, this looks great and seems to be very usable. Would it be possible to access the session API from within ROSE, to get for example the images that are generated by R / openCPU and the logging to stdout that is logged by R? thanks in advance, Richard On Tue, Jan 12, 2016 at 10:16 PM, Vijay Kiran <m...@vijaykiran.com> wrote: I think it would be this: https:/
Re: Eigenvalue solver
(I don't know anything spark specific, so I'm going to treat it like a Breeze question...) As I understand it, Spark uses ARPACK via Breeze for SVD, and presumably the same approach can be used for EVD. Basically, you make a function that multiplies your "matrix" (which might be represented implicitly/distributed, whatever) by a breeze.linalg.DenseVector. This is the Breeze implementation for sparse SVD (which is fully generic and might be hard to follow if you're not used to Breeze/typeclass-heavy Scala...) https://github.com/dlwh/breeze/blob/aa958688c428db581d853fd92eb35e82f80d8b5c/math/src/main/scala/breeze/linalg/functions/svd.scala#L205-L205 The difference between SVD and EVD in arpack (to a first approximation) is that you need to multiple by A.t * A * x for SVD, and just A * x for EVD. The basic idea is to implement a Breeze UFunc eig.Impl2 implicit following the svd code (or you could just copy out the body of the function and specialize it.) The signature you're looking to implement is: implicit def Eig_Sparse_Impl[Mat](implicit mul: OpMulMatrix.Impl2[Mat, DenseVector[Double], DenseVector[Double]], dimImpl: dim.Impl[Mat, (Int, Int)]) : eig.Impl3[Mat, Int, Double, EigenvalueResult] = { The type parameters of Impl3 are: the matrix type, the number of eigenvalues you want, and a tolerance, and a result type. If you implement this signature, then you can call eig on anything that can be multiplied by a dense vector and that implements dim (to get the number of outputs). (You'll need to define the class eigenvalue result to be what you want. I don't immediately know how to unpack ARPACK's answers, but you might look at this scipy thing: https://github.com/thomasnat1/cdcNewsRanker/blob/71b0ff3989d5191dc6a78c40c4a7a9967cbb0e49/venv/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py#L1049 ) I'm happy to help more if you decide to go this route, here, or on the scala-breeze google group, or on github. -- David On Tue, Jan 12, 2016 at 10:28 AM, Lydia Ickler <ickle...@googlemail.com> wrote: > Hi, > > I wanted to know if there are any implementations yet within the Machine > Learning Library or generally that can efficiently solve eigenvalue > problems? > Or if not do you have suggestions on how to approach a parallel execution > maybe with BLAS or Breeze? > > Thanks in advance! > Lydia > > > Von meinem iPhone gesendet > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
ROSE: Spark + R on the JVM.
Hi all, I'd like to share news of the recent release of a new Spark package, [ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor). ROSE is a Scala library offering access to the full scientific computing power of the R programming language to Apache Spark batch and streaming applications on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is designed to let Scala and Java developers use R from Spark. The project is available and documented [on GitHub](https://github.com/onetapbeyond/opencpu-spark-executor) and I would encourage you to [take a look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, questions etc very welcome. David "All that is gold does not glitter, Not all those who wander are lost."
Re: ROSE: Spark + R on the JVM.
Hi Corey, > Would you mind providing a link to the github? Sure, here is the github link you're looking for: https://github.com/onetapbeyond/opencpu-spark-executor David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re: ROSE: Spark + R on the JVM. Local Time: January 12 2016 12:32 pm UTC Time: January 12 2016 5:32 pm From: cjno...@gmail.com To: themarchoffo...@protonmail.com CC: u...@spark.apache.org,dev@spark.apache.org David, Thank you very much for announcing this! It looks like it could be very useful. Would you mind providing a link to the github? On Tue, Jan 12, 2016 at 10:03 AM, David <themarchoffo...@protonmail.com> wrote: Hi all, I'd like to share news of the recent release of a new Spark package, ROSE. ROSE is a Scala library offering access to the full scientific computing power of the R programming language to Apache Spark batch and streaming applications on the JVM. Where Apache SparkR lets data scientists use Spark from R, ROSE is designed to let Scala and Java developers use R from Spark. The project is available and documented on GitHub and I would encourage you to take a look. Any feedback, questions etc very welcome. David "All that is gold does not glitter, Not all those who wander are lost."
Re: [discuss] dropping Python 2.6 support
FWIW, RHEL 6 still uses Python 2.6, although 2.7.8 and 3.3.2 are available through Red Hat Software Collections. See: https://www.softwarecollections.org/en/ I run an academic compute cluster on RHEL 6. We do, however, provide Python 2.7.x and 3.5.x via modulefiles. On Tue, Jan 5, 2016 at 8:45 AM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > +1 > > Red Hat supports Python 2.6 on REHL 5 until 2020 > <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>, but > otherwise yes, Python 2.6 is ancient history and the core Python developers > stopped supporting it in 2013. REHL 5 is not a good enough reason to > continue support for Python 2.6 IMO. > > We should aim to support Python 2.7 and Python 3.3+ (which I believe we > currently do). > > Nick > > On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang <allenzhang...@126.com> wrote: > >> plus 1, >> >> we are currently using python 2.7.2 in production environment. >> >> >> >> >> >> 在 2016-01-05 18:11:45,"Meethu Mathew" <meethu.mat...@flytxt.com> 写道: >> >> +1 >> We use Python 2.7 >> >> Regards, >> >> Meethu Mathew >> >> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin <r...@databricks.com> wrote: >> >>> Does anybody here care about us dropping support for Python 2.6 in Spark >>> 2.0? >>> >>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json >>> parsing) when compared with Python 2.7. Some libraries that Spark depend on >>> stopped supporting 2.6. We can still convince the library maintainers to >>> support 2.6, but it will be extra work. I'm curious if anybody still uses >>> Python 2.6 to run Spark. >>> >>> Thanks. >>> >>> >>> >> -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode
Differing performance in self joins
I've noticed that two queries, which return identical results, have very different performance. I'd be interested in any hints about how avoid problems like this. The DataFrame df contains a string field series and an integer eday, the number of days since (or before) the 1970-01-01 epoch. I'm doing some analysis over a sliding date window and, for now, avoiding UDAFs. I'm therefore using a self join. First, I create val laggard = df.withColumnRenamed(series, p_series).withColumnRenamed(eday, p_eday) Then, the following query runs in 16s: df.join(laggard, (df(series) === laggard(p_series)) (df(eday) === (laggard(p_eday) + 1))).count while the following query runs in 4 - 6 minutes: df.join(laggard, (df(series) === laggard(p_series)) ((df(eday) - laggard(p_eday)) === 1)).count It's worth noting that the series term is necessary to keep the query from doing a complete cartesian product over the data. Ideally, I'd like to look at lags of more than one day, but the following is equally slow: df.join(laggard, (df(series) === laggard(p_series)) (df(eday) - laggard(p_eday)).between(1,7)).count Any advice about the general principle at work here would be welcome. Thanks, David -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?
sure. On Wed, Mar 18, 2015 at 12:19 AM, Debasish Das debasish.da...@gmail.com wrote: Hi David, We are stress testing breeze.optimize.proximal and nnls...if you are cutting a release now, we will need another release soon once we get the runtime optimizations in place and merged to breeze. Thanks. Deb On Mar 15, 2015 9:39 PM, David Hall david.lw.h...@gmail.com wrote: snapshot is pushed. If you verify I'll publish the new artifacts. On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: David Hall who is a breeze creator told me that it's a bug. So, I made a jira ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or later in order to fix the bug, when the new version of breeze will be released. [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA https://issues.apache.org/jira/browse/SPARK-6341 Thanks, Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?
ping? On Sun, Mar 15, 2015 at 9:38 PM, David Hall david.lw.h...@gmail.com wrote: snapshot is pushed. If you verify I'll publish the new artifacts. On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: David Hall who is a breeze creator told me that it's a bug. So, I made a jira ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or later in order to fix the bug, when the new version of breeze will be released. [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA https://issues.apache.org/jira/browse/SPARK-6341 Thanks, Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?
snapshot is pushed. If you verify I'll publish the new artifacts. On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa yuu.ishikawa+sp...@gmail.com wrote: David Hall who is a breeze creator told me that it's a bug. So, I made a jira ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2 or later in order to fix the bug, when the new version of breeze will be released. [SPARK-6341] Upgrade breeze from 0.11.1 to 0.11.2 or later - ASF JIRA https://issues.apache.org/jira/browse/SPARK-6341 Thanks, Yu Ishikawa - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Is-there-any-bugs-to-divide-a-Breeze-sparse-vectors-at-Spark-v1-3-0-rc3-tp11056p11058.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Implementing TinkerPop on top of GraphX
I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and think the idea of using Tinkerpop as the API for GraphX is a great idea and hope you are still headed in that direction. I noticed that Tinkerpop 3 is moving into the Apache family: http://wiki.apache.org/incubator/TinkerPopProposal which might alleviate concerns about having an API definition outside of Spark. Thanks, -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Implementing-TinkerPop-on-top-of-GraphX-tp9169p10126.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: spark-yarn_2.10 1.2.0 artifacts
Thank you, Sean, using spark-network-yarn seems to do the trick. On 12/19/2014 12:13 PM, Sean Owen wrote: I believe spark-yarn does not exist from 1.2 onwards. Have a look at spark-network-yarn for where some of that went, I believe. On Fri, Dec 19, 2014 at 5:09 PM, David McWhorter mcwhor...@ccri.com wrote: Hi all, Thanks for your work on spark! I am trying to locate spark-yarn jars for the new 1.2.0 release. The jars for spark-core, etc, are on maven central, but the spark-yarn jars are missing. Confusingly and perhaps relatedly, I also can't seem to get the spark-yarn artifact to install on my local computer when I run 'mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean install'. At the install plugin stage, maven reports: [INFO] --- maven-install-plugin:2.5.1:install (default-install) @ spark-yarn_2.10 --- [INFO] Skipping artifact installation Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven build would be appreciated. David -- David McWhorter Software Engineer Commonwealth Computer Research, Inc. 1422 Sachem Place, Unit #1 Charlottesville, VA 22901 mcwhor...@ccri.com | 434.299.0090x204 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- David McWhorter Software Engineer Commonwealth Computer Research, Inc. 1422 Sachem Place, Unit #1 Charlottesville, VA 22901 mcwhor...@ccri.com | 434.299.0090x204 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
spark-yarn_2.10 1.2.0 artifacts
Hi all, Thanks for your work on spark! I am trying to locate spark-yarn jars for the new 1.2.0 release. The jars for spark-core, etc, are on maven central, but the spark-yarn jars are missing. Confusingly and perhaps relatedly, I also can't seem to get the spark-yarn artifact to install on my local computer when I run 'mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean install'. At the install plugin stage, maven reports: [INFO] --- maven-install-plugin:2.5.1:install (default-install) @ spark-yarn_2.10 --- [INFO] Skipping artifact installation Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven build would be appreciated. David -- David McWhorter Software Engineer Commonwealth Computer Research, Inc. 1422 Sachem Place, Unit #1 Charlottesville, VA 22901 mcwhor...@ccri.com | 434.299.0090x204 - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: EC2 clusters ready in launch time + 30 seconds
I agree with this - there is also the issue of different sized masters and slaves, and numbers of executors for hefty machines (e.g. r3.8xlarges), tagging of instances and volumes (we use this for cost attribution at my workplace), and running in VPCs. I think think it might be useful to take a layered approach: the first step could be getting a good reliable image produced - Nick's ticket - then doing some work on the launch script. Regarding the EMR like service - I think I heard that AWS is planning to add spark support to EMR, but as usual there's nothing firm until it's released. On Tue, Oct 7, 2014 at 7:48 AM, Daniil Osipov daniil.osi...@shazam.com wrote: I've also been looking at this. Basically, the Spark EC2 script is excellent for small development clusters of several nodes, but isn't suitable for production. It handles instance setup in a single threaded manner, while it can easily be parallelized. It also doesn't handle failure well, ex when an instance fails to start or is taking too long to respond. Our desire was to have an equivalent to Amazon EMR[1] API that would trigger Spark jobs, including specified cluster setup. I've done some work towards that end, and it would benefit from an updated AMI greatly. Dan [1] http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for posting that script, Patrick. It looks like a good place to start. Regarding Docker vs. Packer, as I understand it you can use Packer to create Docker containers at the same time as AMIs and other image types. Nick On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, Just a couple notes. I recently posted a shell script for creating the AMI's from a clean Amazon Linux AMI. https://github.com/mesos/spark-ec2/blob/v3/create_image.sh I think I will update the AMI's soon to get the most recent security updates. For spark-ec2's purpose this is probably sufficient (we'll only need to re-create them every few months). However, it would be cool if someone wanted to tackle providing a more general mechanism for defining Spark-friendly images that can be used more generally. I had thought that docker might be a good way to go for something like this - but maybe this packer thing is good too. For one thing, if we had a standard image we could use it to create containers for running Spark's unit test, which would be really cool. This would help a lot with random issues around port and filesystem contention we have for unit tests. I'm not sure if the long term place for this would be inside the spark codebase or a community library or what. But it would definitely be very valuable to have if someone wanted to take it on. - Patrick On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: FYI: There is an existing issue -- SPARK-3314 https://issues.apache.org/jira/browse/SPARK-3314 -- about scripting the creation of Spark AMIs. With Packer, it looks like we may be able to script the creation of multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a single Packer template. That's very cool. I'll be looking into this. Nick On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the update, Nate. I'm looking forward to seeing how these projects turn out. David, Packer looks very, very interesting. I'm gonna look into it more next week. Nick On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico n...@reactor8.com wrote: Bit of progress on our end, bit of lagging as well. Our guy leading effort got little bogged down on client project to update hive/sql testbed to latest spark/sparkSQL, also launching public service so we have been bit scattered recently. Will have some more updates probably after next week. We are planning on taking our client work around hive/spark, plus taking over the bigtop automation work to modernize and get that fit for human consumption outside or org. All our work and puppet modules will be open sourced, documented, hopefully start to rally some other folks around effort that find it useful Side note, another effort we are looking into is gradle tests/support. We have been leveraging serverspec for some basic infrastructure tests, but with bigtop switching over to gradle builds/testing setup in 0.8 we want to include support for that in our own efforts, probably some stuff that can be learned and leveraged in spark world for repeatable/tested infrastructure If anyone has any specific automation questions to your environment you
Re: Breeze Library usage in Spark
yeah, breeze.storage.Zero was introduced in either 0.8 or 0.9. On Fri, Oct 3, 2014 at 9:45 AM, Xiangrui Meng men...@gmail.com wrote: Did you add a different version of breeze to the classpath? In Spark 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze version you used is different from the one comes with Spark, you might see class not found. -Xiangrui On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch learnings.chitt...@gmail.com wrote: Hi Team, When I am trying to use DenseMatrix of breeze library in spark, its throwing me the following error: java.lang.noclassdeffounderror: breeze/storage/Zero Can someone help me on this ? Thanks, Padma Ch - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BlockManager issues
I've run into this with large shuffles - I assumed that there was contention between the shuffle output files and the JVM for memory. Whenever we start getting these fetch failures, it corresponds with high load on the machines the blocks are being fetched from, and in some cases complete unresponsiveness (no ssh etc). Setting the timeout higher, or the JVM heap lower (as a percentage of total machine memory) seemed to help.. On Mon, Sep 22, 2014 at 8:02 PM, Christoph Sawade christoph.saw...@googlemail.com wrote: Hey all. We had also the same problem described by Nishkam almost in the same big data setting. We fixed the fetch failure by increasing the timeout for acks in the driver: set(spark.core.connection.ack.wait.timeout, 600) // 10 minutes timeout for acks between nodes Cheers, Christoph 2014-09-22 9:24 GMT+02:00 Hortonworks zzh...@hortonworks.com: Actually I met similar issue when doing groupByKey and then count if the shuffle size is big e.g. 1tb. Thanks. Zhan Zhang Sent from my iPhone On Sep 21, 2014, at 10:56 PM, Nishkam Ravi nr...@cloudera.com wrote: Thanks for the quick follow up Reynold and Patrick. Tried a run with significantly higher ulimit, doesn't seem to help. The executors have 35GB each. Btw, with a recent version of the branch, the error message is fetch failures as opposed to too many open files. Not sure if they are related. Please note that the workload runs fine with head set to 066765d. In case you want to reproduce the problem: I'm running slightly modified ScalaPageRank (with KryoSerializer and persistence level memory_and_disk_ser) on a 30GB input dataset and a 6-node cluster. Thanks, Nishkam On Sun, Sep 21, 2014 at 10:32 PM, Patrick Wendell pwend...@gmail.com wrote: Ah I see it was SPARK-2711 (and PR1707). In that case, it's possible that you are just having more spilling as a result of the patch and so the filesystem is opening more files. I would try increasing the ulimit. How much memory do your executors have? - Patrick On Sun, Sep 21, 2014 at 10:29 PM, Patrick Wendell pwend...@gmail.com wrote: Hey the numbers you mentioned don't quite line up - did you mean PR 2711? On Sun, Sep 21, 2014 at 8:45 PM, Reynold Xin r...@databricks.com wrote: It seems like you just need to raise the ulimit? On Sun, Sep 21, 2014 at 8:41 PM, Nishkam Ravi nr...@cloudera.com wrote: Recently upgraded to 1.1.0. Saw a bunch of fetch failures for one of the workloads. Tried tracing the problem through change set analysis. Looks like the offending commit is 4fde28c from Aug 4th for PR1707. Please see SPARK-3633 for more details. Thanks, Nishkam -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Source code for mining big data with Spark
Hi all, I watched am impressed spark demo video by Reynold Xin and Aaron Davidson in youtube ( https://www.youtube.com/watch?v=FjhRkfAuU7I ). Can someone let me know where can I find the source codes for the demo? I can¹t see the source codes from video clearly. Thanks in advance CONFIDENTIALITY CAUTION This e-mail and any attachments may be confidential or legally privileged. If you received this message in error or are not the intended recipient, you should destroy the e-mail message and any attachments or copies, and you are prohibited from retaining, distributing, disclosing or using any information contained herein. Please inform us of the erroneous delivery by return e-mail. Thank you for your cooperation. DOCUMENT CONFIDENTIEL Le présent courriel et tout fichier joint à celui-ci peuvent contenir des renseignements confidentiels ou privilégiés. Si cet envoi ne s'adresse pas à vous ou si vous l'avez reçu par erreur, vous devez l'effacer. Vous ne pouvez conserver, distribuer, communiquer ou utiliser les renseignements qu'il contient. Nous vous prions de nous signaler l'erreur par courriel. Merci de votre collaboration. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Is breeze thread safe in Spark?
mutating operations are not thread safe. Operations that don't mutate should be thread safe. I can't speak to what Evan said, but I would guess that the way they're using += should be safe. On Wed, Sep 3, 2014 at 11:58 AM, RJ Nowling rnowl...@gmail.com wrote: David, Can you confirm that += is not thread safe but + is? I'm assuming + allocates a new object for the write, while += doesn't. Thanks! RJ On Wed, Sep 3, 2014 at 2:50 PM, David Hall d...@cs.berkeley.edu wrote: In general, in Breeze we allocate separate work arrays for each call to lapack, so it should be fine. In general concurrent modification isn't thread safe of course, but things that ought to be thread safe really should be. On Wed, Sep 3, 2014 at 10:41 AM, RJ Nowling rnowl...@gmail.com wrote: No, it's not in all cases. Since Breeze uses lapack under the hood, changes to memory between different threads is bad. There's actually a potential bug in the KMeans code where it uses += instead of +. On Wed, Sep 3, 2014 at 1:26 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: Hi, Is breeze library called thread safe from Spark mllib code in case when native libs for blas and lapack are used? Might it be an issue when running Spark locally? Best regards, Alexander - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org -- em rnowl...@gmail.com c 954.496.2314 -- em rnowl...@gmail.com c 954.496.2314
Re: Linear CG solver
I have no ideas on benchmarks, but breeze has a CG solver: https://github.com/scalanlp/breeze/tree/master/math/src/main/scala/breeze/optimize/linear/ConjugateGradient.scala https://github.com/scalanlp/breeze/blob/e2adad3b885736baf890b306806a56abc77a3ed3/math/src/test/scala/breeze/optimize/linear/ConjugateGradientTest.scala It's based on the code from TRON, and so I think it's more targeted for norm-constrained solutions of the CG problem. On Fri, Jun 27, 2014 at 5:54 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am looking for an efficient linear CG to be put inside the Quadratic Minimization algorithms we added for Spark mllib. With a good linear CG, we should be able to solve kernel SVMs with this solver in mllib... I use direct solves right now using cholesky decomposition which has higher complexity as matrix sizes become large... I found out some jblas example code: https://github.com/mikiobraun/jblas-examples/blob/master/src/CG.java I was wondering if mllib developers have any experience using this solver and if this is better than apache commons linear CG ? Thanks. Deb
Re: mllib vector templates
On Mon, May 5, 2014 at 3:40 PM, DB Tsai dbt...@stanford.edu wrote: David, Could we use Int, Long, Float as the data feature spaces, and Double for optimizer? Yes. Breeze doesn't allow operations on mixed types, so you'd need to convert the double vectors to Floats if you wanted, e.g. dot product with the weights vector. You might also be interested in FeatureVector, which is just a wrapper around Array[Int] that emulates an indicator vector. It supports dot products, axpy, etc. -- David Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 5, 2014 at 3:06 PM, David Hall d...@cs.berkeley.edu wrote: Lbfgs and other optimizers would not work immediately, as they require vector spaces over double. Otherwise it should work. On May 5, 2014 3:03 PM, DB Tsai dbt...@stanford.edu wrote: Breeze could take any type (Int, Long, Double, and Float) in the matrix template. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 5, 2014 at 2:56 PM, Debasish Das debasish.da...@gmail.com wrote: Is this a breeze issue or breeze can take templates on float / double ? If breeze can take templates then it is a minor fix for Vectors.scala right ? Thanks. Deb On Mon, May 5, 2014 at 2:45 PM, DB Tsai dbt...@stanford.edu wrote: +1 Would be nice that we can use different type in Vector. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, May 5, 2014 at 2:41 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Why mllib vector is using double as default ? /** * Represents a numeric vector, whose index type is Int and value type is Double. */ trait Vector extends Serializable { /** * Size of the vector. */ def size: Int /** * Converts the instance to a double array. */ def toArray: Array[Double] Don't we need a template on float/double ? This will give us memory savings... Thanks. Deb
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Yeah, that's probably the easiest though obviously pretty hacky. I'm surprised that the hessian approximation isn't worse than it is. (As in, I'd expect error messages.) It's obviously line searching much more, so the approximation must be worse. You might be interested in this online bfgs: http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf -- David On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai dbt...@stanford.edu wrote: Have a quick hack to understand the behavior of SLBFGS (Stochastic-LBFGS) by overwriting the breeze iterations method to get the current LBFGS step to ensure that the objective function is the same during the line search step. David, the following is my code, have a better way to inject into it? https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack Couple findings, 1) miniBatch (using rdd sample api) for each iteration is slower than full data training when the full data is cached. Probably because sample is not efficiency in Spark. 2) Since in the line search steps, we use the same sample of data (the same objective function), the SLBFGS actually converges well. 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps (29 data iterations, 74.5seconds) to converge to loss 0.10. For LBFGS with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6 seconds) to converge to loss 0.10. It seems that as long as the noisy gradient happens in different SLBFGS steps, it still works. (ps, I also tried in line search step, I use different sample of data, and it just doesn't work as we expect.) Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Apr 28, 2014 at 8:55 AM, David Hall d...@cs.berkeley.edu wrote: That's right. FWIW, caching should be automatic now, but it might be the version of Breeze you're using doesn't do that yet. Also, In breeze.util._ there's an implicit that adds a tee method to iterator, and also a last method. Both are useful for things like this. -- David On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai dbt...@stanford.edu wrote: I think I figure it out. Instead of calling minimize, and record the loss in the DiffFunction, I should do the following. val states = lbfgs.iterations(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector) states.foreach(state = lossHistory.append(state.value)) All the losses in states should be decreasing now. Am I right? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai dbt...@stanford.edu wrote: Also, how many failure of rejection will terminate the optimization process? How is it related to numberOfImprovementFailures? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai dbt...@stanford.edu wrote: Hi David, I'm recording the loss history in the DiffFunction implementation, and that's why the rejected step is also recorded in my loss history. Is there any api in Breeze LBFGS to get the history which already excludes the reject step? Or should I just call iterations method and check iteratingShouldStop instead? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Apr 25, 2014 at 3:10 PM, David Hall d...@cs.berkeley.eduwrote: LBFGS will not take a step that sends the objective value up. It might try a step that is too big and reject it, so if you're just logging everything that gets tried by LBFGS, you could see that. The iterations method of the minimizer should never return an increasing objective value. If you're regularizing, are you including the regularizer in the objective value computation? GD is almost never worth your time. -- David On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai dbt...@stanford.edu wrote: Another interesting benchmark. *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero elements.* LBFGS converges in 70 seconds, while GD seems to be not progressing. Dense feature vector will be too big to fit in the memory, so only conduct the sparse benchmark. I saw the sometimes the loss bumps up, and it's weird for me. Since the cost function of logistic regression is convex, it should be monotonically decreasing. David, any suggestion? The detail figure: https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.* LBFGS converges in 25 seconds, while GD also seems to be not progressing
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
That's right. FWIW, caching should be automatic now, but it might be the version of Breeze you're using doesn't do that yet. Also, In breeze.util._ there's an implicit that adds a tee method to iterator, and also a last method. Both are useful for things like this. -- David On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai dbt...@stanford.edu wrote: I think I figure it out. Instead of calling minimize, and record the loss in the DiffFunction, I should do the following. val states = lbfgs.iterations(new CachedDiffFunction(costFun), initialWeights.toBreeze.toDenseVector) states.foreach(state = lossHistory.append(state.value)) All the losses in states should be decreasing now. Am I right? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai dbt...@stanford.edu wrote: Also, how many failure of rejection will terminate the optimization process? How is it related to numberOfImprovementFailures? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai dbt...@stanford.edu wrote: Hi David, I'm recording the loss history in the DiffFunction implementation, and that's why the rejected step is also recorded in my loss history. Is there any api in Breeze LBFGS to get the history which already excludes the reject step? Or should I just call iterations method and check iteratingShouldStop instead? Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Apr 25, 2014 at 3:10 PM, David Hall d...@cs.berkeley.eduwrote: LBFGS will not take a step that sends the objective value up. It might try a step that is too big and reject it, so if you're just logging everything that gets tried by LBFGS, you could see that. The iterations method of the minimizer should never return an increasing objective value. If you're regularizing, are you including the regularizer in the objective value computation? GD is almost never worth your time. -- David On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai dbt...@stanford.edu wrote: Another interesting benchmark. *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero elements.* LBFGS converges in 70 seconds, while GD seems to be not progressing. Dense feature vector will be too big to fit in the memory, so only conduct the sparse benchmark. I saw the sometimes the loss bumps up, and it's weird for me. Since the cost function of logistic regression is convex, it should be monotonically decreasing. David, any suggestion? The detail figure: https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.* LBFGS converges in 25 seconds, while GD also seems to be not progressing. Only conduct sparse benchmark for the same reason. I also saw the loss bumps up for unknown reason. The detail figure: https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai dbt...@stanford.edu wrote: rcv1.binary is too sparse (0.15% non-zero elements), so dense format will not run due to out of memory. But sparse format runs really well. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai dbt...@stanford.edu wrote: I'm doing the timer in runMiniBatchSGD after val numExamples = data.count() See the following. Running rcv1 dataset now, and will update soon. val startTime = System.nanoTime() for (i - 1 to numIterations) { // Sample a subset (fraction miniBatchFraction) of the total data // compute and sum up the subgradients on this subset (this is one map-reduce) val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i) .aggregate((BDV.zeros[Double](weights.size), 0.0))( seqOp = (c, v) = (c, v) match { case ((grad, loss), (label, features)) = val l = gradient.compute(features, label, weights, Vectors.fromBreeze(grad)) (grad, loss + l) }, combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) = (grad1 += grad2, loss1 + loss2) }) /** * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration * and regVal
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
LBFGS will not take a step that sends the objective value up. It might try a step that is too big and reject it, so if you're just logging everything that gets tried by LBFGS, you could see that. The iterations method of the minimizer should never return an increasing objective value. If you're regularizing, are you including the regularizer in the objective value computation? GD is almost never worth your time. -- David On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai dbt...@stanford.edu wrote: Another interesting benchmark. *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero elements.* LBFGS converges in 70 seconds, while GD seems to be not progressing. Dense feature vector will be too big to fit in the memory, so only conduct the sparse benchmark. I saw the sometimes the loss bumps up, and it's weird for me. Since the cost function of logistic regression is convex, it should be monotonically decreasing. David, any suggestion? The detail figure: https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.* LBFGS converges in 25 seconds, while GD also seems to be not progressing. Only conduct sparse benchmark for the same reason. I also saw the loss bumps up for unknown reason. The detail figure: https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai dbt...@stanford.edu wrote: rcv1.binary is too sparse (0.15% non-zero elements), so dense format will not run due to out of memory. But sparse format runs really well. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai dbt...@stanford.edu wrote: I'm doing the timer in runMiniBatchSGD after val numExamples = data.count() See the following. Running rcv1 dataset now, and will update soon. val startTime = System.nanoTime() for (i - 1 to numIterations) { // Sample a subset (fraction miniBatchFraction) of the total data // compute and sum up the subgradients on this subset (this is one map-reduce) val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i) .aggregate((BDV.zeros[Double](weights.size), 0.0))( seqOp = (c, v) = (c, v) match { case ((grad, loss), (label, features)) = val l = gradient.compute(features, label, weights, Vectors.fromBreeze(grad)) (grad, loss + l) }, combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) = (grad1 += grad2, loss1 + loss2) }) /** * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration * and regVal is the regularization value computed in the previous iteration as well. */ stochasticLossHistory.append(lossSum / miniBatchSize + regVal) val update = updater.compute( weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize, i, regParam) weights = update._1 regVal = update._2 timeStamp.append(System.nanoTime() - startTime) } Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng men...@gmail.com wrote: I don't understand why sparse falls behind dense so much at the very first iteration. I didn't see count() is called in https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala . Maybe you have local uncommitted changes. Best, Xiangrui On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote: Hi Xiangrui, Yes, I'm using yarn-cluster mode, and I did check # of executors I specified are the same as the actual running executors. For caching and materialization, I've the timer in optimizer after calling count(); as a result, the time for materialization in cache isn't in the benchmark. The difference you saw is actually from dense feature or sparse feature vector. For LBFGS and GD dense feature, you can see the first iteration takes the same time. It's true for GD. I'm going to run rcv1.binary which only has 0.15% non-zero elements to verify the hypothesis. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
On Wed, Apr 23, 2014 at 9:30 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. Any chance you remember what the problems were? I'm sure it could be better, but it's good to know where improvements need to happen. -- David On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote: Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Was the weight vector sparse? The gradients? Or just the feature vectors? On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai dbt...@dbtsai.com wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I don't think the attachment came through in the list. Could you upload the results somewhere and link to them ? On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote: 123 features per rows, and in average, 89% are zeros. On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote: Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
On Wed, Apr 23, 2014 at 10:18 PM, DB Tsai dbt...@dbtsai.com wrote: ps, it doesn't make sense to have weight and gradient sparse unless with strong L1 penalty. Sure, I was just checking the obvious things. Have you run it through it a profiler to see where the problem is? Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 10:17 PM, DB Tsai dbt...@dbtsai.com wrote: In mllib, the weight, and gradient are dense. Only feature is sparse. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 10:16 PM, David Hall d...@cs.berkeley.edu wrote: Was the weight vector sparse? The gradients? Or just the feature vectors? On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai dbt...@dbtsai.com wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I don't think the attachment came through in the list. Could you upload the results somewhere and link to them ? On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote: 123 features per rows, and in average, 89% are zeros. On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote: Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: RFC: varargs in Logging.scala?
Another usage that's nice is: logDebug { val timeS = timeMillis/1000.0 sTime: $timeS } which can be useful for more complicated expressions. On Thu, Apr 10, 2014 at 5:55 PM, Michael Armbrust mich...@databricks.comwrote: BTW... You can do calculations in string interpolation: sTime: ${timeMillis / 1000}s Or use format strings. fFloat with two decimal places: $floatValue%.2f More info: http://docs.scala-lang.org/overviews/core/string-interpolation.html On Thu, Apr 10, 2014 at 5:46 PM, Michael Armbrust mich...@databricks.com wrote: Hi Marcelo, Thanks for bringing this up here, as this has been a topic of debate recently. Some thoughts below. ... all of the suffer from the fact that the log message needs to be built even though it might not be used. This is not true of the current implementation (and this is actually why Spark has a logging trait instead of just using a logger directly.) If you look at the original function signatures: protected def logDebug(msg: = String) ... The = implies that we are passing the msg by name instead of by value. Under the covers, scala is creating a closure that can be used to calculate the log message, only if its actually required. This does result is a significant performance improvement, but still requires allocating an object for the closure. The bytecode is really something like this: val logMessage = new Function0() { def call() = Log message + someExpensiveComputation() } log.debug(logMessage) In Catalyst and Spark SQL we are using the scala-logging package, which uses macros to automatically rewrite all of your log statements. You write: logger.debug(sLog message $someExpensiveComputation) You get: if(logger.debugEnabled) { val logMsg = Log message + someExpensiveComputation() logger.debug(logMsg) } IMHO, this is the cleanest option (and is supported by Typesafe). Based on a micro-benchmark, it is also the fastest: std logging: 19885.48ms spark logging 914.408ms scala logging 729.779ms Once the dust settles from the 1.0 release, I'd be in favor of standardizing on scala-logging. Michael
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
On Sun, Mar 30, 2014 at 2:01 PM, Debasish Das debasish.da...@gmail.comwrote: Hi David, I have started to experiment with BFGS solvers for Spark GLM over large scale data... I am also looking to add a good QP solver in breeze that can be used in Spark ALS for constraint solves...More details on that soon... I could not load up breeze 0.7 code onto eclipse...There is a folder called natives in the master but there is no code in thatall the code is in src/main/scala... I added the eclipse plugin: addSbtPlugin(com.github.mpeltonen % sbt-idea % 1.6.0) addSbtPlugin(com.typesafe.sbteclipse % sbteclipse-plugin % 2.2.0) But it seems the project is set to use idea... Could you please explain the dev methodology for breeze ? My idea is to do solver work in breeze as that's the right place and get it into Spark through Xiangrui's WIP on Sparse data and breeze support... It would be great to have a QP Solver: I don't know if you know about this library: http://www.joptimizer.com/ I'm not quite sure what you mean by dev methodology. If you just mean how to get code into Breeze, just send a PR to scalanlp/breeze. Unit tests are good for something nontrivial like this. Maybe some basic documentation. Thanks. Deb On Fri, Mar 7, 2014 at 12:46 AM, DB Tsai dbt...@alpinenow.com wrote: Hi Xiangrui, I think it doesn't matter whether we use Fortran/Breeze/RISO for optimizers since optimization only takes 1% of time. Most of the time is in gradientSum and lossSum parallel computation. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Thu, Mar 6, 2014 at 7:10 PM, Xiangrui Meng men...@gmail.com wrote: Hi DB, Thanks for doing the comparison! What were the running times for fortran/breeze/riso? Best, Xiangrui On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, I can converge to the same result with your breeze LBFGS and Fortran implementations now. Probably, I made some mistakes when I tried breeze before. I apologize that I claimed it's not stable. See the test case in BreezeLBFGSSuite.scala https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS This is training multinomial logistic regression against iris dataset, and both optimizers can train the models with 98% training accuracy. There are two issues to use Breeze in Spark, 1) When the gradientSum and lossSum are computed distributively in custom defined DiffFunction which will be passed into your optimizer, Spark will complain LBFGS class is not serializable. In BreezeLBFGS.scala, I've to convert RDD to array to make it work locally. It should be easy to fix by just having LBFGS to implement Serializable. 2) Breeze computes redundant gradient and loss. See the following log from both Fortran and Breeze implementations. Thanks. Fortran: Iteration -1: loss 1.3862943611198926, diff 1.0 Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 3: loss 1.054036932835569, diff 0.03566113127440601 Iteration 4: loss 0.9907956302751622, diff 0.0507649459571 Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761 Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982 Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716 Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277 Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075 Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627 Breeze: Iteration -1: loss 1.3862943611198926, diff 1.0 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Iteration 0: loss 1.3862943611198926, diff 0.0 Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 3: loss 1.1242501524477688, diff 0.0 Iteration 4: loss 1.1242501524477688, diff 0.0 Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 6: loss 1.0930151243303563, diff 0.0 Iteration 7: loss 1.0930151243303563, diff 0.0 Iteration 8: loss 1.054036932835569, diff 0.03566113127440601 Iteration 9: loss 1.054036932835569, diff 0.0 Iteration 10: loss 1.054036932835569, diff 0.0 Iteration 11: loss 0.9907956302751622, diff 0.0507649459571 Iteration 12: loss 0.9907956302751622, diff 0.0 Iteration 13: loss 0.9907956302751622, diff 0.0 Iteration 14: loss 0.9184205380342829, diff
Re: Making RDDs Covariant
On Sat, Mar 22, 2014 at 8:59 AM, Pascal Voitot Dev pascal.voitot@gmail.com wrote: The problem I was talking about is when you try to use typeclass converters and make them contravariant/covariant for input/output. Something like: Reader[-I, +O] { def read(i:I): O } Doing this, you soon have implicit collisions and philosophical concerns about what it means to serialize/deserialize a Parent class and a Child class... You should (almost) never make a typeclass param contravariant. It's almost certainly not what you want: https://issues.scala-lang.org/browse/SI-2509 -- David
graphx samples in Java
Hi Where can I find the equivalent of the graphx example (http://spark.apache.org/docs/0.9.0/graphx-programming-guide.html#examples ) in Java ? For example. How does the following translates to Java val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, (rxin, student)), (7L, (jgonzal, postdoc)), (5L, (franklin, prof)), (2L, (istoica, prof thanks --david
Code documentation
Is there any documentation available that explains the code architecture that can help a new Spark framework developer?
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, I can converge to the same result with your breeze LBFGS and Fortran implementations now. Probably, I made some mistakes when I tried breeze before. I apologize that I claimed it's not stable. See the test case in BreezeLBFGSSuite.scala https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS This is training multinomial logistic regression against iris dataset, and both optimizers can train the models with 98% training accuracy. great to hear! There were some bugs in LBFGS about 6 months ago, so depending on the last time you tried it, it might indeed have been bugged. There are two issues to use Breeze in Spark, 1) When the gradientSum and lossSum are computed distributively in custom defined DiffFunction which will be passed into your optimizer, Spark will complain LBFGS class is not serializable. In BreezeLBFGS.scala, I've to convert RDD to array to make it work locally. It should be easy to fix by just having LBFGS to implement Serializable. I'm not sure why Spark should be serializing LBFGS? Shouldn't it live on the controller node? Or is this a per-node thing? But no problem to make it serializable. 2) Breeze computes redundant gradient and loss. See the following log from both Fortran and Breeze implementations. Err, yeah. I should probably have LBFGS do this automatically, but there's a CachedDiffFunction that gets rid of the redundant calculations. -- David Thanks. Fortran: Iteration -1: loss 1.3862943611198926, diff 1.0 Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 3: loss 1.054036932835569, diff 0.03566113127440601 Iteration 4: loss 0.9907956302751622, diff 0.0507649459571 Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761 Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982 Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716 Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277 Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075 Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627 Breeze: Iteration -1: loss 1.3862943611198926, diff 1.0 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Iteration 0: loss 1.3862943611198926, diff 0.0 Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 3: loss 1.1242501524477688, diff 0.0 Iteration 4: loss 1.1242501524477688, diff 0.0 Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 6: loss 1.0930151243303563, diff 0.0 Iteration 7: loss 1.0930151243303563, diff 0.0 Iteration 8: loss 1.054036932835569, diff 0.03566113127440601 Iteration 9: loss 1.054036932835569, diff 0.0 Iteration 10: loss 1.054036932835569, diff 0.0 Iteration 11: loss 0.9907956302751622, diff 0.0507649459571 Iteration 12: loss 0.9907956302751622, diff 0.0 Iteration 13: loss 0.9907956302751622, diff 0.0 Iteration 14: loss 0.9184205380342829, diff 0.07304737423337761 Iteration 15: loss 0.9184205380342829, diff 0.0 Iteration 16: loss 0.9184205380342829, diff 0.0 Iteration 17: loss 0.8259870936519939, diff 0.1006438117513297 Iteration 18: loss 0.8259870936519939, diff 0.0 Iteration 19: loss 0.8259870936519939, diff 0.0 Iteration 20: loss 0.6327447552109576, diff 0.233952934583647 Iteration 21: loss 0.6327447552109576, diff 0.0 Iteration 22: loss 0.6327447552109576, diff 0.0 Iteration 23: loss 0.5534101162436362, diff 0.12538154276652747 Iteration 24: loss 0.5534101162436362, diff 0.0 Iteration 25: loss 0.5534101162436362, diff 0.0 Iteration 26: loss 0.40450200866125635, diff 0.2690732137675816 Iteration 27: loss 0.40450200866125635, diff 0.0 Iteration 28: loss 0.40450200866125635, diff 0.0 Iteration 29: loss 0.30788249908237314, diff 0.23885980452569502 Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Wed, Mar 5, 2014 at 2:00 PM, David Hall d...@cs.berkeley.edu wrote: On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote: I'm happy to help fix any problems. I've verified at points that the implementation gives the exact same sequence of iterates for a few different functions (with a particular line search) as the c port of lbfgs. So I'm a little surprised it fails where Fortran succeeds... but only a little. This was fixed late last year. I'm working
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
On Wed, Mar 5, 2014 at 8:50 AM, Debasish Das debasish.da...@gmail.comwrote: Hi David, Few questions on breeze solvers: 1. I feel the right place of adding useful things from RISO LBFGS (based on Professor Nocedal's fortran code) will be breeze. It will involve stress testing breeze LBFGS on large sparse datasets and contributing fixes to existing breeze LBFGS with the learning from RISO LBFGS. You agree on that right ? That would be great. 2. Normally for doing experiments like 1, I fix the line search. What's your preferred line search in breeze BFGS ? I will also use that. More Thuente and Strong wolfe with backtracking helped me in the past. The default is a Strong Wolfe search: https://github.com/scalanlp/breeze/blob/master/src/main/scala/breeze/optimize/StrongWolfe.scala 3. The paper that you sent also says on L-BFGS-B is better on box constraints. I feel It's worthwhile to have both the solvers because many practical problems need box constraints or complex constraints can be reformulated in the realm of unconstrained and box constraints. Example use-cases for me are automatic feature extraction from photo/video frames using factorization. I will compare L-BFGS-B vs constrained QN method that you have in Breeze within an analysis similar to 1. Great. Yeah I fully expect l-bfgs-b to be better on box constraints. It turns out that this other method looked easier to implement and gave reasonably good results even with box constraints. 4. Do you have a road-map on adding CG solvers in breeze ? Linear CG solver to compare with BLAS posv seems like a good usecase for me in mllib ALS. DB sent a paper by Professor Ng which shows the effectiveness of CG and BFGS over SGD in the email chain. We have a linear CG solver ( https://github.com/scalanlp/breeze/blob/master/src/main/scala/breeze/optimize/linear/ConjugateGradient.scala) which is used in the Truncated Newton solver. ( https://github.com/scalanlp/breeze/blob/master/src/main/scala/breeze/optimize/TruncatedNewtonMinimizer.scala) But I've not implemented nonlinear CG, and honestly hadn't planned on it, per below. I believe on non-convex problems like Matrix Factorization, CG family might converge to a better solution than BFGS. No way to know till we experiment on the datasets :-) I've never had good experience with CG, and a vision colleague of mine found that LBFGS was better on his (non-convex) stuff, but I could be easily persuaded. Thanks! -- David Thanks. Deb On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote: Just subscribing to this list, so apologies for quoting weirdly and any other etiquette offenses. DB Tsai wrote Hi Deb, I had tried breeze L-BFGS algorithm, and when I tried it couple weeks ago, it's not as stable as the fortran implementation. I guessed the problem is in the line search related thing. Since we may bring breeze dependency for the sparse format support as you pointed out, we can just try to fix the L-BFGS in breeze, and we can get OWL-QN and L-BFGS-B. What do you think? I'm happy to help fix any problems. I've verified at points that the implementation gives the exact same sequence of iterates for a few different functions (with a particular line search) as the c port of lbfgs. So I'm a little surprised it fails where Fortran succeeds... but only a little. This was fixed late last year. OWL-QN seems to mostly be stable, but probably deserves more testing. Presumably it has whatever defects my LBFGS does. (It's really pretty straightforward to implement given an L-BFGS) We don't provide an L-BFGS-B implementation. We do have a more general constrained qn method based on http://jmlr.org/proceedings/papers/v5/schmidt09a/schmidt09a.pdf (which uses a L-BFGS type update as part of the algorithm). From the experiments in their paper, it's likely to not work as well for bound constraints, but can do things that lbfgsb can't. Again, let me know what I can help with. -- David Hall On Mon, Mar 3, 2014 at 3:52 PM, DB Tsai lt;dbtsai@gt; wrote: Hi Deb, a. OWL-QN for solving L1 natively in BFGS Based on what I saw from https://github.com/tjhunter/scalanlp-core/blob/master/learn/src/main/scala/breeze/optimize/OWLQN.scala , it seems that it's not difficult to implement OWL-QN once LBFGS is done. b. Bound constraints in BFGS : I saw you have converted the fortran code. Is there a license issue ? I can help in getting that up to speed as well. I tried to convert the code from Fortran L-BFGS-B implementation to java using f2j; the translated code is just a messy, and it just doesn't work at all. There is no license issue here. Any idea about how to approach this? c. Few variants of line searches : I will discuss on it. For the dbtsai-lbfgs branch seems like it already got merged by Jenkins. I don't think
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote: I'm happy to help fix any problems. I've verified at points that the implementation gives the exact same sequence of iterates for a few different functions (with a particular line search) as the c port of lbfgs. So I'm a little surprised it fails where Fortran succeeds... but only a little. This was fixed late last year. I'm working on a reproducible test case using breeze vs fortran implementation to show the problem I've run into. The test will be in one of the test cases in my Spark fork, is it okay for you to investigate the issue? Or do I need to make it as a standalone test? Um, as long as it wouldn't be too hard to pull out. Will send you the test later today. Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
I did not. They would be nice to have. On Wed, Mar 5, 2014 at 5:21 PM, Debasish Das debasish.da...@gmail.comwrote: David, There used to be standard BFGS testcases in Professor Nocedal's package...did you stress test the solver with them ? If not I will shoot him an email for them. Thanks. Deb On Wed, Mar 5, 2014 at 2:00 PM, David Hall d...@cs.berkeley.edu wrote: On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, On Tue, Mar 4, 2014 at 8:13 PM, dlwh david.lw.h...@gmail.com wrote: I'm happy to help fix any problems. I've verified at points that the implementation gives the exact same sequence of iterates for a few different functions (with a particular line search) as the c port of lbfgs. So I'm a little surprised it fails where Fortran succeeds... but only a little. This was fixed late last year. I'm working on a reproducible test case using breeze vs fortran implementation to show the problem I've run into. The test will be in one of the test cases in my Spark fork, is it okay for you to investigate the issue? Or do I need to make it as a standalone test? Um, as long as it wouldn't be too hard to pull out. Will send you the test later today. Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/