Re: Hi, guys, does anyone use Spark in finance market?
Hi, yes, there's definitely a market for Apache Spark and financial institutions, I can't provide specific details but to answer your survey: "yes" and "more than a few GB!" Here are a couple of examples showing Spark with financial data, full disclosure that I work for IBM, I'm sure there are lots more examples you can find too: https://www.youtube.com/watch?v=VWBNoIwGEjo shows how Spark can be used with simple sentiment analysis to figure out correlations between real world events and stock market changes. The Spark specific part is from 3 04 until the 8th minute https://www.youtube.com/watch?v=sDmWcuO5Rk8 is a similar example where Spark is again used with sentiment analysis. One could also analyse financial data to identify trends, I think a lot of the machine learning APIs will be useful here e.g. logistic regression with many features could be used to decide whether or not an investment is a good idea based on training data (so we'd look at real outcomes from previous speculations) In both cases you can see Spark is a very important component for performing the calculations with financial data. I also know that Goldman Sachs mentioned they are interested in Spark, one talk is at https://www.youtube.com/watch?v=HWwAoTK2YrQ, so this is more evidence of the financial industries paying attention to big data and Spark. Regarding your app: I expected it to be similar to the first example where the signals you mention are real world events (e.g. the fed lowers interest rates or companies are rumoured to either be about to float or be acquired). At the 4 30 part I think you actually identify previous index values and extrapolate what they are likely to become using, so in theory your system would become more accurate over time although would going off indexes alone be sufficient (if indeed this is what you're doing). I think you'd want to combine this with real world speculation/news to figure out *why* the price is likely to change, how much by and in which direction. I agree that Apache Spark can be just the right tool for doing the heavy lifting required for analysis, computation and modelling of big data so looking forward to future Spark work in this area, and I wonder how we as Spark developers can make it easier/more powerful for Spark users to do so From: "Taotao.Li" To: user Date: 30/08/2016 14:14 Subject:Hi, guys, does anyone use Spark in finance market? Hi, guys, I'm a quant engineer in China, and I believe it's very promising when using Spark in the financial market. But I didn't find cases which combine spark and finance. So here I wanna do a small survey: do you guys use Spark in financial market related project? if yes, how large data was fed in your spark application? thanks a lot. ___ A little ad, I attended IBM Spark Hackathon, which is here : http://apachespark.devpost.com/ , and I submitted a small application, which will be used in my strategies, hope you guys and give me a vote and some suggestions on how to use spark in financial market, to discover some trade opportunity. here is my small app: http://devpost.com/software/spark-in-finance-quantitative-investing thanks a lot. -- ___ Quant | Engineer | Boy ___ blog:http://litaotao.github.io github: www.github.com/litaotao Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark 2.0.0 - Java vs Scala performance difference
On Java vs Scala: Sean's right that behind the scenes you'll be calling JVM based APIs anyway (e.g. sun.misc.unsafe for Tungsten) and that the vast majority of Apache Spark's important logic is written in Scala. Would be an interesting experiment to write the same functioning program using the Java APIs vs Scala APIs just to see if there is a noticeable difference: I'm thinking in terms of how the Scala implementation libraries perform at runtime, with profiling (we use Healthcenter, tprof, or just microbenchmarking with prints and timers), we've seen lots of code in Scala itself to do with (un)boxing and instanceOf checks that could do with some TLC for performance. Now quite outdated but still shows that writing what's concise (Scala) isn't always best for performance: https://jazzy.id.au/2012/10/16/benchmarking_scala_against_java.html So if we just to stick to Java we may not hit those overheads as often (there's a talk by my colleague on boosting performance from a Java implementer's perspective at https://www.youtube.com/watch?v=rcVTM-71bZk), but I don't expect the differences to be enormous. Full disclosure that I work for IBM and one of our goals is to improve Apache Spark and our Java implementation to perform fast together. There's also the obvious trade-off of developer productivity and code maintainability (more Java devs than Scala devs), so my suggestion is that if you're much better at writing Java or Scala code, use that for what is considered the real important performance critical logic - be aware that you're going be hitting the Apache Spark codebase written in Scala anyway, so there's only so much to be gained here. I also think that Just in Time Compiler implementations are generally better at optimising what's written as Java code instead of Scala code as knowing the types way ahead of time and where we can make codepath shortcuts in the bytecode execution should deliver a slight performance improvements. I am keen to come up with some solid recommendations based on evidence for us all to benefit from. From: Aseem Bansal To: ayan guha Cc: Sean Owen , user Date: 01/09/2016 13:11 Subject:Re: Spark 2.0.0 - Java vs Scala performance difference there is already a mail thread for scala vs python. check the archives On Thu, Sep 1, 2016 at 5:18 PM, ayan guha wrote: How about Scala vs Python? On Thu, Sep 1, 2016 at 7:27 PM, Sean Owen wrote: I can't think of a situation where it would be materially different. Both are using the JVM-based APIs directly. Here and there there's a tiny bit of overhead in using the Java APIs because something is translated from a Java-style object to a Scala-style object, but this is generally trivial. On Thu, Sep 1, 2016 at 10:06 AM, Aseem Bansal wrote: > Hi > > Would there be any significant performance difference when using Java vs. > Scala API? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org -- Best Regards, Ayan Guha Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: Spark build 1.6.2 error
Looks familiar, got the zinc server running and using a shared dev box? ps -ef | grep "com.typesafe zinc.Nailgun", look for the zinc server process, kill it and try again, Spark branch-1.6 builds great here from scratch, had plenty of problems thanks to running the zinc server here (started with build/mvn) From: Nachiketa To: Diwakar Dhanuskodi Cc: user Date: 31/08/2016 12:17 Subject:Re: Spark build 1.6.2 error Hi Diwakar, Could you please share the entire maven command that you are using to build ? And also the JDK version you are using ? Also could you please confirm that you did execute the script for change scala version to 2.11 before starting the build ? Thanks. Regards, Nachiketa On Wed, Aug 31, 2016 at 2:00 AM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: Hi, While building Spark 1.6.2 , getting below error in spark-sql. Much appreciate for any help. ERROR] missing or invalid dependency detected while loading class file 'WebUI.class'. Could not access term eclipse in package org, because it (or its dependencies) are missing. Check your build definition for missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.) A full rebuild may help if 'WebUI.class' was compiled against an incompatible version of org. [ERROR] missing or invalid dependency detected while loading class file 'WebUI.class'. Could not access term jetty in value org.eclipse, because it (or its dependencies) are missing. Check your build definition for missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see the problematic classpath.) A full rebuild may help if 'WebUI.class' was compiled against an incompatible version of org.eclipse. [WARNING] 17 warnings found [ERROR] two errors found [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [4.399s] [INFO] Spark Project Test Tags ... SUCCESS [3.443s] [INFO] Spark Project Launcher SUCCESS [10.131s] [INFO] Spark Project Networking .. SUCCESS [11.849s] [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [6.641s] [INFO] Spark Project Unsafe .. SUCCESS [19.765s] [INFO] Spark Project Core SUCCESS [4:16.511s] [INFO] Spark Project Bagel ... SUCCESS [13.401s] [INFO] Spark Project GraphX .. SUCCESS [1:08.824s] [INFO] Spark Project Streaming ... SUCCESS [2:18.844s] [INFO] Spark Project Catalyst SUCCESS [2:43.695s] [INFO] Spark Project SQL . FAILURE [1:01.762s] [INFO] Spark Project ML Library .. SKIPPED [INFO] Spark Project Tools ... SKIPPED [INFO] Spark Project Hive SKIPPED [INFO] Spark Project Docker Integration Tests SKIPPED [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Shuffle Service SKIPPED [INFO] Spark Project YARN SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External Flume Assembly . SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project External MQTT Assembly .. SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] Spark Project External Kafka Assembly . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 12:40.525s [INFO] Finished at: Wed Aug 31 01:56:50 IST 2016 [INFO] Final Memory: 71M/830M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-sql_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR]
Re: Memory allocation error with Spark 1.5, HashJoinCompatibilitySuite
Hi, I'm regularly hitting "Unable to acquire memory" problems only when trying to use overflow pages when running the full set of Spark tests across different platforms. The machines I'm using all have well over 10 GB of RAM and I'm running without any changes to the pom.xml file. Standard 3 GB Java heap specified. I'm working off this revision: commit 43e0135421b2262cbb0e06aae53523f663b4f959 Author: Yin Huai Date: Thu Aug 20 15:30:31 2015 +0800 [SPARK-10092] [SQL] Multi-DB support follow up. https://issues.apache.org/jira/browse/SPARK-10092 This pr is a follow-up one for Multi-DB support. It has the following changes: * `HiveContext.refreshTable` now accepts `dbName.tableName`. I've added prints in a variety of places, when we run just the one suite we don't hit the problem - but with the whole batch of tests, we do. Example below, note that it's always in the join31 test. cat CheckHashJoinFullBatch.txt | grep -C 10 "join31" - auto_join30 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 - auto_join31 - auto_join32 - auto_join4 - auto_join5 - auto_join6 - auto_join7 - auto_join8 - auto_join9 04:53:44.685 WARN org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite: Simplifications made on unsupported operations for test auto_join_filters - auto_join_filters - auto_join_nulls -- 05:08:18.329 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 43.0 in stage 2993.0 (TID 130982, localhost): TaskKilled (killed intentionally) 05:08:18.330 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 40.0 in stage 2993.0 (TID 130979, localhost): TaskKilled (killed intentionally) 05:08:18.340 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 15.0 in stage 2993.0 (TID 130954, localhost): TaskKilled (killed intentionally) 05:08:18.341 ERROR org.apache.spark.executor.Executor: Managed memory leak detected; size = 12582912 bytes, TID = 130985 05:08:18.341 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 46.0 in stage 2993.0 (TID 130985, localhost): TaskKilled (killed intentionally) 05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 41.0 in stage 2993.0 (TID 130980, localhost): TaskKilled (killed intentionally) 05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 26.0 in stage 2993.0 (TID 130965, localhost): TaskKilled (killed intentionally) 05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 4.0 in stage 2993.0 (TID 130943, localhost): TaskKilled (killed intentionally) 05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 11.0 in stage 2993.0 (TID 130950, localhost): TaskKilled (killed intentionally) 05:08:18.349 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 28.0 in stage 2993.0 (TID 130967, localhost): TaskKilled (killed intentionally) - join31 *** FAILED *** Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 42 in stage 2993.0 failed 1 times, most recent failure: Lost task 42.0 in stage 2993.0 (TID 130981, localhost): java.io.IOException: Unable to acquire 4194304 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:371) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:350) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:489) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:477) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:610) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) I run the test on its own with: mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -DwildcardSuites=org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite -fn test > CheckHashJoin.txt 2>&1 I run the whole batch with mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -fn test > CheckHashJoinFullBatch.txt 2>&1 java version "1.7.0_65" OpenJDK Runtime Environment (rhel-2.5.1.2.el7_0-x86_64 u65-b17) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed